[Bigjob-users] Attempt to run example_manyjob_affinity.py

Paula Sanematsu psanem1 at tigers.lsu.edu
Sat Dec 10 15:44:19 CST 2011


Hi Andre,

for launching remote jobs you certainly want to use Globus, i.e. GRAM
> URLs. The bigjob_agent key in the resource_list does not need to be
> set anymore!
>

When I have this configuration for hotel

resource_list.append( {"resource_url" : "gram://
hotel.futuregrid.org/jobmanager-pbs", "number_of_processes" : "1",
"allocation" : "myAllocation", "queue" : "workq", "bigjob_agent":
("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"),
"working_directory": (os.getcwd() + "/agent"), "walltime":10, "affinity" :
"affinity1"})

I get this error:

DEBUG:root:Utilizing ADVERT Backend
DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url: None
DEBUG:root:Initialized Coordination to: advert://advert.cct.lsu.edu:8080/(DB: )
DEBUG:root:initialized BigJob: bigjob:7e9a80dc-2374-11e1-92c3-002215124496
DEBUG:root:create pilot job entry on backend server:
bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
DEBUG:root:create advert entry: advert://
advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost
?
DEBUG:root:update state of pilot job to: Unknown Stopped: False
DEBUG:root:set pilot state to: Unknown
Adaptor specific modifications: fork
Working directory: /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
use standard proxy
Submit pilot job to: fork://localhost/
DEBUG:root:start bigjob at: gram://hotel.futuregrid.org/jobmanager-pbs
DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
DEBUG:root:Utilizing ADVERT Backend
DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url: None
DEBUG:root:Initialized Coordination to: advert://advert.cct.lsu.edu:8080/(DB: )
DEBUG:root:initialized BigJob: bigjob:7f793e3a-2374-11e1-92c3-002215124496
DEBUG:root:create pilot job entry on backend server:
bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
DEBUG:root:create advert entry: advert://
advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org
?
DEBUG:root:update state of pilot job to: Unknown Stopped: False
DEBUG:root:set pilot state to: Unknown
Adaptor specific modifications: gram
DEBUG:root:Escape RSL
Working directory: /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
use standard proxy
Traceback (most recent call last):
  File "example_manyjob_affinity.py", line 61, in <module>
    mjs = many_job_affinity_service(resource_list, COORDINATION_URL)
  File
"/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job_affinity.py",
line 19, in __init__
    super(many_job_affinity_service, self).__init__(bigjob_list,
advert_host)
  File
"/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job.py",
line 59, in __init__
    self.__init_bigjobs()
  File
"/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job.py",
line 74, in __init_bigjobs
    self.__start_bigjob(i)
  File
"/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job.py",
line 104, in __start_bigjob
    ppn)
  File
"/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob/bigjob_manager.py",
line 246, in start_pilot_job
    js = saga.job.service(lrms_saga_url)
bad_parameter: SAGA(BadParameter): condor_job: Adaptor supports 'condor'
and 'condorg' URL schemes, 'gram' is not supported.

DEBUG:root:Cancel re-scheduler thread
Exception AttributeError: "'many_job_affinity_service' object has no
attribute 'stop'" in <bound method many_job_affinity_service.__del__ of
7e9a6c78-2374-11e1-92c3-002215124496> ignored
Cancel Pilot Job
stop pilot job: bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
DEBUG:root:delete pilot job:
bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
Cancel Pilot Job
stop pilot job: bigjob:7f793e3a-2374-11e1-92c3-002215124496:
hotel.futuregrid.org
DEBUG:root:delete pilot job: bigjob:7f793e3a-2374-11e1-92c3-002215124496:
hotel.futuregrid.org

Please try to submit the job first from the hotel front node. Once you
> are sure that Globus etc. is working you can do a remote submission.
> E.g you need to make sure that you use the right BJ version on hotel
> as well (ie. you need to double-check your PYTHONPATH etc.)
>

I think I have BigJob in Hotel working properly, at least
example_local_single.py and example_local_multiple worked. My PYTHONPATH is

/gpfs/home/paulasoo/.bigjob/python/lib/python2.7/site-packages/:/gpfs/software/x86_64/el5/hotel/SAGA/saga/1.6/gcc-4.1.2//lib/python2.7/site-packages/:

Thanks,

Paula

Hope that helps.
>
> Best,
> Andre
>
> On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu <psanem1 at tigers.lsu.edu>
> wrote:
> > Hi Andre,
> >
> >>
> >> Please change numer_nodes to number_of_processes in line 51:
> >>
> >> resource_list.append( {"resource_url" : "fork://localhost/",
> >> "number_of_processes" : "2", "allocation" : "myAllocation", ...
> >
> >
> > I changed number_nodes to number_of_processes and it now runs. However,
> it
> > seems like it is in a infinite loop, perhaps because my second machine is
> > not configured properly? This is the configuration for my second machine:
> >
> > resource_list.append( {"resource_url" : "ssh://hotel.futuregrid.org",
> > "number_of_processes" : "4", "allocation" :  "myAllocation", "queue" :
> > "workq", "bigjob_agent":
> >
> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"),
> > "working_directory": (os.getcwd() + "/agent"), "walltime":10, "affinity"
> :
> > "affinity1"})
> >
> > Here, I used the bigjob_agent_launcher.sh in the directory written above
> > because it was the only place I could find it. Should I use something
> else
> > for "bigjob_agent"? Also, I'm not sure whether ssh://
> hotel.futuregrid.org or
> > gram://hotel.futuregrid.org/jobmanager-pbs should be used.
> >
> > This is the output that keeps coming up:
> >
> > Current states: {'New': 2, 'Unknown': 6}
> > DEBUG:root:Reschedule Thread
> > DEBUG:root:Big Job: bigjob:50ec186c-22a2-11e1-9a15-002215124496:localhost
> > Cores: 0/2 State: Running Terminated: False #Required Cores: 1
> > DEBUG:root:Big Job:
> > bigjob:51d0e1cc-22a2-11e1-9a15-002215124496:hotel.futuregrid.org Cores:
> 4/4
> > State: Unknown Terminated: False #Required Cores: 1
> > DEBUG:root:found no active resource for sub-job => (re-) queue it
> > DEBUG:root:free_cores: [0, 0] total_free_cores: 0
> >
> >>
> >> Also, the BJ part of CSA is a bit old and contains some bugs. If
> >> possible, try to install BJ in userspace (as outlined on the Wiki
> >> page).
> >>
> >> It's fixed in BigJob-0.3.31 and SVN.
> >
> >
> > I followed the instructions on the Wiki (b. Python Packaging and
> Virtualenv)
> > and did an update. I looks like I have BigJob-0.3.31.
> >
> > Thanks,
> >
> > Paula
> >
> >
> > On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow <aluckow at cct.lsu.edu>
> wrote:
> >>
> >> Hi Paula,
> >> there is a small bug in the example:
> >>
> >> Please change numer_nodes to number_of_processes in line 51:
> >>
> >> resource_list.append( {"resource_url" : "fork://localhost/",
> >> "number_of_processes" : "2", "allocation" : "myAllocation", ...
> >>
> >> Also, the BJ part of CSA is a bit old and contains some bugs. If
> >> possible, try to install BJ in userspace (as outlined on the Wiki
> >> page).
> >>
> >> It's fixed in BigJob-0.3.31 and SVN.
> >>
> >> Best,
> >> Andre
> >>
> >>
> >> On Thu, Dec 8, 2011 at 9:42 PM, Paula Sanematsu <psanem1 at tigers.lsu.edu
> >
> >> wrote:
> >> > Hi,
> >> >
> >> > I'm trying to run example_manyjob_affinity.py on Sierra, but it
> doesn't
> >> > complete (see my output below). I'm submitting the job from Sierra and
> >> > would
> >> > like to use Hotel as my second machine. Could you please advise me on
> >> > how to
> >> > proceed?
> >> >
> >> > In addition, is there anything wrong with the cct advert service? I
> >> > could
> >> > run example_local_single.py, but now it's not working.
> >> >
> >> > Thanks,
> >> >
> >> > Paula
> >> >
> >> > ManyJob load test with 8 jobs.
> >> > Create manyjob service
> >> > DEBUG:root:start bigjob at: fork://localhost/
> >> > DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
> >> > DEBUG:root:['/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/../',
> >> > '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.6.4-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
> >> > '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages',
> >> > '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
> >> > '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
> >> > '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip',
> >> > '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7',
> >> >
> >> >
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2',
> >> > '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk',
> >> > '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old',
> >> >
> >> >
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload',
> >> >
> >> >
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages',
> >> > '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
> >> > '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic',
> >> >
> >> >
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic']
> >> > DEBUG:root:Utilizing ADVERT Backend
> >> > DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
> >> > DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url:
> None
> >> > DEBUG:root:initialized BigJob:
> >> > bigjob:6638ce9e-21d6-11e1-ac76-002215124496
> >> > Traceback (most recent call last):
> >> >   File "example_manyjob_affinity.py", line 61, in <module>
> >> >     mjs = many_job_affinity_service(resource_list, COORDINATION_URL)
> >> >   File
> >> >
> >> >
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job_affinity.py",
> >> > line 19, in __init__
> >> >     super(many_job_affinity_service, self).__init__(bigjob_list,
> >> > advert_host)
> >> >   File
> >> >
> >> >
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job.py",
> >> > line 59, in __init__
> >> >     self.__init_bigjobs()
> >> >   File
> >> >
> >> >
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job.py",
> >> > line 74, in __init_bigjobs
> >> >     self.__start_bigjob(i)
> >> >   File
> >> >
> >> >
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job.py",
> >> > line 98, in __start_bigjob
> >> >     bj_dict["number_of_processes"],
> >> > KeyError: 'number_of_processes'
> >> > Cancel Pilot Job
> >> > stop pilot job:
> >> > DEBUG:root:create advert entry: advert://advert.cct.lsu.edu:8080/
> >> > DEBUG:root:update state of pilot job to: Done Stopped: True
> >> > DEBUG:root:delete pilot job:
> >> >
> >> > _______________________________________________
> >> > Bigjob-users mailing list
> >> > Bigjob-users at mail.cct.lsu.edu
> >> > https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
> >> >
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cct.lsu.edu/pipermail/bigjob-users/attachments/20111210/18d6b0e2/attachment-0001.html 


More information about the Bigjob-users mailing list