[Bigjob-users] LA-SiGMA Execution Management meeting summary (on Pilot-Jobs)

pradeep kumar Mantha pmanth2 at cct.lsu.edu
Fri Dec 2 17:36:13 CST 2011


Hi Everyone!

We are Sorry for the EVO technical problems faced for today's meeting. Here
is the summary of the meeting for the people who missed the presentation.
The presentation is about Pilot-Jobs and their usage on LONI & XSEDE
machines.

*Problem statement:*
   1. With traditional PBS systems on an average or highly busy systems
like LONI or XSEDE, we face queue wait time for each job we submit. A
simulation with bunch of jobs takes more time to complete, since there is a
waiting time involved for each job.
   2. It is difficult to monitor free HPC resources and submit jobs using
PBS. This involves manual process like logging onto each HPC resource and
checking the queue status and then submitting the job.

*Solution:*
    Pilot Job is an effective abstraction to handle the above two problems.
   Pilot Job is a container job to which a number of tasks assigned
upfront. The user's responsibility is just to mention the HPC
resources(could be multiple) and their attributes ( like walltime, ppn,
working directory) which need to be used and set of jobs/tasks ( number of
processes, spmd_variation, working directory, input, output, error) to be
executed.
   The Pilot-Job once become active takes care of assigning jobs by
distributing them to the free resources among the  resources requested.

*Advantage:*
   The Pilot-Job uses the resources effectively in two ways.
    1. Scaling simulations over multiple HPC resources is easy.
    2. Allows execution of jobs without necessity of queuing each
individual job.
    3. There is no waiting time involved between task, i.e when a job is
completed, the next job is scheduled without waiting in the queue. This
continues until all the requested tasks are completed or walltime requested
is completed.

*Types of Pilot-Jobs:*
   1. ManyJob
   2. BigJob

 *Requirements of Pilot-Jobs:*
    ManyJob needs python (>=2.4) , ssh/gsissh authentication to/from remote
cluster head nodes to the host machine without password by use of password
keys(ssh key) .
     BigJob requries SAGA, SAGA dependencies(like boost, postgresql),  SAGA
python bindings & python ( preferrably >=2.4)

*Questions asked*

*BigJob questions:*

1. Can BigJob perform the Input data transfer required, to the the remote
HPC cluster where job need to be executed?
      No, Its is not a part of BigJob framework to perform the input data
transfer required for the jobs/simulations. The application which uses
bigjob should perform the data transfer by using ftp tools like scp/gridftp.

2. Can BigJob be used on only eric?
     No, It can be installed & executed on any HPC cluster which have SAGA
, its dependencies and its python bindings.
     Please see example
       -
https://svn.cct.lsu.edu/repos/saga-projects/applications/bigjob/trunk/generic/examples/example_manyjob_local.py

3.  I want to run a job for each input file. For example; there are 45
files, each input file requires 20 hours to process and 1 node. process per
node of HPC resource is 4. The max wall time allowed is 48hrs and 128
cores. Requesting more nodes involve more initial queue waiting. So what is
the optimal way I can use bigjob to run all the jobs in 40hrs.
     There are many ways to handle this usecase.
      1. Use a "single" bigjob to request 45 nodes(180 processors) -
20hrs(wall time) - which will complete all jobs in 20hrs. But getting 180
processors might not be possible & lot of initial queue waiting time is
involved.
      2. Use "two" BigJobs, where each BigJob request 23 nodes(92
processor) - 20hrs(wall time) - which will complete 23 jobs in 20hrs( easy
to get 92 processors & less waiting time ). The second BigJob will complete
22 jobs in 20 hrs. So all jobs are completed in 20hrs with reduced initial
queue waiting time.
      3. Use a "single" BigJob to request 23 nodes and 40 hrs(wall time) -
which will schedule first 23 jobs parallelly at a time. Once 23 jobs are
completed the remaining 22 jobs are scheduled automatically, and there is
no waiting time in involved in doing that.

*ManyJob questions:*
   1. Can we schedule ManyJob to start a single job after 3 jobs completed
their execution, if their execution time varies?
       No. We can't define multiple dependencies for a single job.
   2. Can ManyJob perform the Input data transfer required, to the the
remote HPC cluster where job need to be executed?
       Yes.

Manuals & Tutorial:
    1. ManyJob - (http://dna.engr.latech.edu/ManyJobs/)
    2. BigJob - (http://faust.cct.lsu.edu/trac/bigjob/)

Please send your questions & use cases
   BigJob     -  bigjob-users at cct.lsu.edu
   ManyJob  -  rajibm at latech.edu , fuji at tulane.edu, bishop at latech.edu

Please find the today's tutorial at
Dec2Tutorial_V2.ppt<https://wiki.cct.lsu.edu/lasigma-execution-management/images/2/23/Dec2Tutorial_V2.ppt>
  and for more details refer to the BigJob & ManyJob websites or write to
the mailing lists mentioned above.

 thanks
pradeep
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cct.lsu.edu/pipermail/bigjob-users/attachments/20111202/6d0c99df/attachment.html 


More information about the Bigjob-users mailing list