[Bigjob-users] Inconsistency with number_nodes , processes_per_node & jd.number_of_processes parameters.

Mon Oct 3 10:41:45 CDT 2011

Hi Pradeep,

>    1. Does bigjob automatically determine processes_per_node on all
> resources mentioned in the resource list?

No, currently the number of nodes are only determined on agent side.
The number_nodes parameter is internally mapped to the

jd.number_of_processes = str(number_nodes)
jd.processes_per_host=str(processes_per_node)

The main issue here is that the Globus adaptor e.g. only accepts the
number_of_nodes parameter and not the total_cpu_count that the SAGA
spec specifies, which is then translated to the RSL count parameter.
But for simplicity, I probably should rename number_nodes to
number_of_processes. Then at least BJ and SAGA are consistent.

>    2. Is this condition specifically only for LONI or is it for all
> infrastructures like LSU machines & Futuregrid?

The thing you're observing below is that you request 2 processes via
Globus and you get a total of 8 processes. In order to make the
manager aware of the correct node count, you need to add the
processes_per_host parameter in this case. I know this is kind of
confusing, but the main issue is that different Globus installation
handle the count parameter different (sometimes it is mapped to the
#nodes and sometimes to #cores).

Please let me know whether the processes_per_node workaround works.

Best,
Andre

----

> I tried the script without specifying processes_per_node but when
>  jd.number_of_processes(processors) > number_nodes( nodes) it causes trouble
> even though enough resources are available. The script just rolls and jobs
> are not scheduled or queued as it doesn't find enough resources.
> For example
>        resource_list.append({"resource_url" :
> "gram://eric1.loni.org/jobmanager-pbs",
>                               "number_nodes" : "2",
>                               "allocation" : None, "queue" : "checkpt",
>                               "working_directory": (os.getcwd() + "/agent"),
>                               "walltime":20 })
>         resource_list.append({"resource_url" :
> "gram://poseidon1.loni.org/jobmanager-pbs",
>                               "number_nodes" : "2",
>                               "allocation" : None, "queue" : "checkpt",
>                               "working_directory": (os.getcwd() + "/agent"),
>                               "walltime":20 })
>   each bigjob size - 8 processors
>   for i in range(0, NUMBER_JOBS):
>             jd = saga.job.description()
>             jd.executable = "/bin/sleep "
>             jd.arguments = ["10"]
>             jd.number_of_processes = "4"        ////////  if 2 or 1 is given
> its working.. so anything less than number_nodes works.
>             jd.spmd_variation = "single"
>             jd.working_directory = os.getcwd()
>             jd.output =  "stdout-" + str(i) + ".txt"
>             jd.error = "stderr-" + str(i) + ".txt"
>             subjob = mjs.create_job(jd)
>             subjob.run()
>             print "Submited sub-job " + str(i) + "."
>             jobs.append(subjob)
>             job_start_times[subjob]=time.time()
>             job_states[subjob] = subjob.get_state()
>