[Bigjob-users] Attempt to run example_manyjob_affinity.py

Shantenu Jha sjha at cct.lsu.edu
Sat Dec 10 16:05:26 CST 2011


Paula,

> I don't think I have a grid certificate. When I do grid-proxy-init, I get

In which case you should be using "ssh".

i.e., use:

ssh://hotel.futuregrid.org in the resource_list.append("resource_url" :

Shantenu






> 
> -bash: grid-proxy-init: command not found
> 
> Paula
> 
> On Sat, Dec 10, 2011 at 3:53 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
>       Paula,
>
>       Do you even have a grid certificate? If not you will have to use
>       non-globus solutions e.g., pbs-ssh.
>
>       Shantenu
> 
>
>       Hi Paula,
>       did you run grid-proxy-init? Could you set:
>
>       export SAGA_VERBOSE=100
>
>       in your shell and re-send the log, please?
>
>       Thanks!
>       Andre
>
>       On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
>       <psanem1 at tigers.lsu.edu> wrote:
>             Hi Andre,
>
>                   for launching remote jobs you certainly want to use
>                   Globus, i.e. GRAM
>                   URLs. The bigjob_agent key in the resource_list does not
>                   need to be
>                   set anymore!
> 
> 
>
>             When I have this configuration for hotel
>
>             resource_list.append( {"resource_url" :
>             "gram://hotel.futuregrid.org/jobmanager-pbs", "number_of_processes" :
>             "1",
>             "allocation" : "myAllocation", "queue" : "workq", "bigjob_agent":
> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
>             ),
>             "working_directory": (os.getcwd() + "/agent"), "walltime":10,
>             "affinity" :
>             "affinity1"})
>
>             I get this error:
> 
>
>             DEBUG:root:Utilizing ADVERT Backend
>             DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>             DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url:
>             None
>             DEBUG:root:Initialized Coordination to:
>             advert://advert.cct.lsu.edu:8080/
>             (DB: )
>             DEBUG:root:initialized BigJob:
>             bigjob:7e9a80dc-2374-11e1-92c3-002215124496
>             DEBUG:root:create pilot job entry on backend server:
>             bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>             DEBUG:root:create advert entry:
>             advert://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost?
>
>             DEBUG:root:update state of pilot job to: Unknown Stopped: False
>             DEBUG:root:set pilot state to: Unknown
>             Adaptor specific modifications: fork
>             Working directory:
>             /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>
>             use standard proxy
>             Submit pilot job to: fork://localhost/
>             DEBUG:root:start bigjob at:
>             gram://hotel.futuregrid.org/jobmanager-pbs
>
>             DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
>             DEBUG:root:Utilizing ADVERT Backend
>             DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>             DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url:
>             None
>             DEBUG:root:Initialized Coordination to:
>             advert://advert.cct.lsu.edu:8080/
>             (DB: )
>             DEBUG:root:initialized BigJob:
>             bigjob:7f793e3a-2374-11e1-92c3-002215124496
>             DEBUG:root:create pilot job entry on backend server:
>             bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>             DEBUG:root:create advert entry:
> advert://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org
>             ?
>
>             DEBUG:root:update state of pilot job to: Unknown Stopped: False
>             DEBUG:root:set pilot state to: Unknown
>             Adaptor specific modifications: gram
>             DEBUG:root:Escape RSL
>             Working directory:
>             /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>             use standard proxy
>
>             Traceback (most recent call last):
>               File "example_manyjob_affinity.py", line 61, in <module>
>                 mjs = many_job_affinity_service(resource_list, COORDINATION_URL)
>               File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>             any_job_affinity.py",
>             line 19, in __init__
>                 super(many_job_affinity_service, self).__init__(bigjob_list,
>             advert_host)
>               File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>             any_job.py",
>             line 59, in __init__
>                 self.__init_bigjobs()
>               File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>             any_job.py",
>             line 74, in __init_bigjobs
>                 self.__start_bigjob(i)
>               File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>             any_job.py",
>             line 104, in __start_bigjob
>                 ppn)
>               File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob/bigjob_ma
>             nager.py",
>             line 246, in start_pilot_job
>                 js = saga.job.service(lrms_saga_url)
>             bad_parameter: SAGA(BadParameter): condor_job: Adaptor supports
>             'condor' and
>             'condorg' URL schemes, 'gram' is not supported.
>
>             DEBUG:root:Cancel re-scheduler thread
>             Exception AttributeError: "'many_job_affinity_service' object has no
>             attribute 'stop'" in <bound method many_job_affinity_service.__del__
>             of
>             7e9a6c78-2374-11e1-92c3-002215124496> ignored
>             Cancel Pilot Job
>             stop pilot job: bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>             DEBUG:root:delete pilot job:
>             bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>             Cancel Pilot Job
>             stop pilot job:
>             bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>             DEBUG:root:delete pilot job:
>             bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>
>                   Please try to submit the job first from the hotel front
>                   node. Once you
>                   are sure that Globus etc. is working you can do a remote
>                   submission.
>                   E.g you need to make sure that you use the right BJ
>                   version on hotel
>                   as well (ie. you need to double-check your PYTHONPATH
>                   etc.)
> 
> 
>
>             I think I have BigJob in Hotel working properly, at least
>             example_local_single.py and example_local_multiple worked. My
>             PYTHONPATH is
> 
> /gpfs/home/paulasoo/.bigjob/python/lib/python2.7/site-packages/:/gpfs/software/x86_64/el5/hotel/SA
>             GA/saga/1.6/gcc-4.1.2//lib/python2.7/site-packages/:
>
>             Thanks,
>
>             Paula
>
>                   Hope that helps.
>
>                   Best,
>                   Andre
>
>                   On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu
>                   <psanem1 at tigers.lsu.edu>
>                   wrote:
>                         Hi Andre,
> 
>
>                               Please change numer_nodes to
>                               number_of_processes in line 51:
>
>                               resource_list.append(
>                               {"resource_url" :
>                               "fork://localhost/",
>                               "number_of_processes" : "2",
>                               "allocation" : "myAllocation",
>                               ...
> 
> 
>
>                         I changed number_nodes to number_of_processes
>                         and it now runs. However,
>                         it
>                         seems like it is in a infinite loop, perhaps
>                         because my second machine
>                         is
>                         not configured properly? This is the
>                         configuration for my second
>                         machine:
>
>                         resource_list.append( {"resource_url" :
>                         "ssh://hotel.futuregrid.org",
>                         "number_of_processes" : "4", "allocation" : 
>                         "myAllocation", "queue" :
>                         "workq", "bigjob_agent":
> 
> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
>                         ),
>                         "working_directory": (os.getcwd() +
>                         "/agent"), "walltime":10, "affinity"
>                         :
>                         "affinity1"})
>
>                         Here, I used the bigjob_agent_launcher.sh in
>                         the directory written above
>                         because it was the only place I could find
>                         it. Should I use something
>                         else
>                         for "bigjob_agent"? Also, I'm not sure
>                         whether
>                         ssh://hotel.futuregrid.org or
>                         gram://hotel.futuregrid.org/jobmanager-pbs
>                         should be used.
>
>                         This is the output that keeps coming up:
>
>                         Current states: {'New': 2, 'Unknown': 6}
>                         DEBUG:root:Reschedule Thread
>                         DEBUG:root:Big Job:
>                         bigjob:50ec186c-22a2-11e1-9a15-002215124496:localhost
>                         Cores: 0/2 State: Running Terminated: False
>                         #Required Cores: 1
>                         DEBUG:root:Big Job:
>                         bigjob:51d0e1cc-22a2-11e1-9a15-002215124496:hotel.futuregrid.org
>                         Cores:
>                         4/4
>                         State: Unknown Terminated: False #Required
>                         Cores: 1
>                         DEBUG:root:found no active resource for
>                         sub-job => (re-) queue it
>                         DEBUG:root:free_cores: [0, 0]
>                         total_free_cores: 0
> 
>
>                               Also, the BJ part of CSA is a bit
>                               old and contains some bugs. If
>                               possible, try to install BJ in
>                               userspace (as outlined on the
>                               Wiki
>                               page).
>
>                               It's fixed in BigJob-0.3.31 and
>                               SVN.
> 
> 
>
>                         I followed the instructions on the Wiki (b.
>                         Python Packaging and
>                         Virtualenv)
>                         and did an update. I looks like I have
>                         BigJob-0.3.31.
>
>                         Thanks,
>
>                         Paula
> 
>
>                         On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow
>                         <aluckow at cct.lsu.edu>
>                         wrote:
>
>                               Hi Paula,
>                               there is a small bug in the
>                               example:
>
>                               Please change numer_nodes to
>                               number_of_processes in line 51:
>
>                               resource_list.append(
>                               {"resource_url" :
>                               "fork://localhost/",
>                               "number_of_processes" : "2",
>                               "allocation" : "myAllocation",
>                               ...
>
>                               Also, the BJ part of CSA is a bit
>                               old and contains some bugs. If
>                               possible, try to install BJ in
>                               userspace (as outlined on the
>                               Wiki
>                               page).
>
>                               It's fixed in BigJob-0.3.31 and
>                               SVN.
>
>                               Best,
>                               Andre
> 
>
>                               On Thu, Dec 8, 2011 at 9:42 PM,
>                               Paula Sanematsu
>                               <psanem1 at tigers.lsu.edu>
>                               wrote:
>                                     Hi,
>
>                                     I'm trying to run
>                                     example_manyjob_affinity.py
>                                     on Sierra, but it
>                                     doesn't
>                                     complete (see my
>                                     output below). I'm
>                                     submitting the job
>                                     from Sierra
>                                     and
>                                     would
>                                     like to use Hotel as
>                                     my second machine.
>                                     Could you please
>                                     advise me on
>                                     how to
>                                     proceed?
>
>                                     In addition, is there
>                                     anything wrong with
>                                     the cct advert
>                                     service? I
>                                     could
>                                     run
>                                     example_local_single.py,
>                                     but now it's not
>                                     working.
>
>                                     Thanks,
>
>                                     Paula
>
>                                     ManyJob load test
>                                     with 8 jobs.
>                                     Create manyjob
>                                     service
>                                     DEBUG:root:start
>                                     bigjob at:
>                                     fork://localhost/
>                                     DEBUG:root:init
>                                     BigJob w/:
>                                     advert://advert.cct.lsu.edu:8080
>                                     DEBUG:root:['/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/../',
>                                     '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.6.4-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages',
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>                                     '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip',
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7',
> 
>
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2',
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk',
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old',
> 
>
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload',
> 
>
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages',
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>                                     '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
> 
>
>                                     '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob',
> 
> 
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>                                     ',
> 
> 
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>                                     ']
>                                     DEBUG:root:Utilizing
>                                     ADVERT Backend
>                                     DEBUG:root:Parsing
>                                     URL:
>                                     advert://advert.cct.lsu.edu:8080
>                                     DEBUG:root:Server:
>                                     advert.cct.lsu.edu
>                                     Port 8080
>                                     server_connect_url:
>                                     None
>                                     DEBUG:root:initialized
>                                     BigJob:
>                                     bigjob:6638ce9e-21d6-11e1-ac76-002215124496
>                                     Traceback (most
>                                     recent call last):
>                                       File
>                                     "example_manyjob_affinity.py",
>                                     line 61, in <module>
>                                         mjs =
>                                     many_job_affinity_service(resource_list,
>                                     COORDINATION_URL)
>                                       File
> 
> 
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>                                     /many_job_affinity.py",
>                                     line 19, in __init__
>                                        
>                                     super(many_job_affinity_service,
>                                     self).__init__(bigjob_list,
>                                     advert_host)
>                                       File
> 
> 
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>                                     /many_job.py",
>                                     line 59, in __init__
>                                        
>                                     self.__init_bigjobs()
>                                       File
> 
> 
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>                                     /many_job.py",
>                                     line 74, in
>                                     __init_bigjobs
>                                        
>                                     self.__start_bigjob(i)
>                                       File
> 
> 
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>                                     /many_job.py",
>                                     line 98, in
>                                     __start_bigjob
>                                        
>                                     bj_dict["number_of_processes"],
>                                     KeyError:
>                                     'number_of_processes'
>                                     Cancel Pilot Job
>                                     stop pilot job:
>                                     DEBUG:root:create
>                                     advert entry:
>                                     advert://advert.cct.lsu.edu:8080/
>                                     DEBUG:root:update
>                                     state of pilot job
>                                     to: Done Stopped:
>                                     True
>                                     DEBUG:root:delete
>                                     pilot job:
>
>                                     _______________________________________________
>                                     Bigjob-users mailing
>                                     list
>                                     Bigjob-users at mail.cct.lsu.edu
>                                     https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
> 
> 
> 
> 
>
>       _______________________________________________
>       Bigjob-users mailing list
>       Bigjob-users at mail.cct.lsu.edu
>       https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
> 
> 
> 
>


More information about the Bigjob-users mailing list