[Bigjob-users] Inconsistency with number_nodes , processes_per_node & jd.number_of_processes parameters.
Andre Luckow
aluckow at cct.lsu.edu
Mon Oct 3 10:41:45 CDT 2011
Hi Pradeep,
> 1. Does bigjob automatically determine processes_per_node on all
> resources mentioned in the resource list?
No, currently the number of nodes are only determined on agent side.
The number_nodes parameter is internally mapped to the
jd.number_of_processes = str(number_nodes)
jd.processes_per_host=str(processes_per_node)
The main issue here is that the Globus adaptor e.g. only accepts the
number_of_nodes parameter and not the total_cpu_count that the SAGA
spec specifies, which is then translated to the RSL count parameter.
But for simplicity, I probably should rename number_nodes to
number_of_processes. Then at least BJ and SAGA are consistent.
> 2. Is this condition specifically only for LONI or is it for all
> infrastructures like LSU machines & Futuregrid?
The thing you're observing below is that you request 2 processes via
Globus and you get a total of 8 processes. In order to make the
manager aware of the correct node count, you need to add the
processes_per_host parameter in this case. I know this is kind of
confusing, but the main issue is that different Globus installation
handle the count parameter different (sometimes it is mapped to the
#nodes and sometimes to #cores).
Please let me know whether the processes_per_node workaround works.
Best,
Andre
----
> I tried the script without specifying processes_per_node but when
> jd.number_of_processes(processors) > number_nodes( nodes) it causes trouble
> even though enough resources are available. The script just rolls and jobs
> are not scheduled or queued as it doesn't find enough resources.
> For example
> resource_list.append({"resource_url" :
> "gram://eric1.loni.org/jobmanager-pbs",
> "number_nodes" : "2",
> "allocation" : None, "queue" : "checkpt",
> "working_directory": (os.getcwd() + "/agent"),
> "walltime":20 })
> resource_list.append({"resource_url" :
> "gram://poseidon1.loni.org/jobmanager-pbs",
> "number_nodes" : "2",
> "allocation" : None, "queue" : "checkpt",
> "working_directory": (os.getcwd() + "/agent"),
> "walltime":20 })
> each bigjob size - 8 processors
> for i in range(0, NUMBER_JOBS):
> jd = saga.job.description()
> jd.executable = "/bin/sleep "
> jd.arguments = ["10"]
> jd.number_of_processes = "4" //////// if 2 or 1 is given
> its working.. so anything less than number_nodes works.
> jd.spmd_variation = "single"
> jd.working_directory = os.getcwd()
> jd.output = "stdout-" + str(i) + ".txt"
> jd.error = "stderr-" + str(i) + ".txt"
> subjob = mjs.create_job(jd)
> subjob.run()
> print "Submited sub-job " + str(i) + "."
> jobs.append(subjob)
> job_start_times[subjob]=time.time()
> job_states[subjob] = subjob.get_state()
>
More information about the Bigjob-users
mailing list