[Bigjob-users] Attempt to run example_manyjob_affinity.py
Paula Sanematsu
psanem1 at tigers.lsu.edu
Sat Dec 10 16:27:33 CST 2011
When I use ssh://hotel.futuregrid.org for "resource_url", I keep getting
this output over and over again.
DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
Cores: 0/2 State: Running Terminated: False #Required Cores: 1
DEBUG:root:Big Job: bigjob:7a4c7d70-237c-11e1-83f2-002215124496:
hotel.futuregrid.org Cores: 1/1 State: Unknown Terminated: False #Required
Cores: 1
DEBUG:root:found no active resource for sub-job => (re-) queue it
DEBUG:root:free_cores: [0, 0] total_free_cores: 0
DEBUG:root:Reschedule Thread
DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
Cores: 0/2 State: Running Terminated: False #Required Cores: 1
Current states: {'New': 2, 'Unknown': 6}
On Sat, Dec 10, 2011 at 4:05 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
> Paula,
>
>
> I don't think I have a grid certificate. When I do grid-proxy-init, I get
>>
>
> In which case you should be using "ssh".
>
> i.e., use:
>
> ssh://hotel.futuregrid.org in the resource_list.append("**resource_url" :
>
> Shantenu
>
>
>
>
>
>
>
>> -bash: grid-proxy-init: command not found
>>
>> Paula
>>
>> On Sat, Dec 10, 2011 at 3:53 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
>> Paula,
>>
>> Do you even have a grid certificate? If not you will have to use
>> non-globus solutions e.g., pbs-ssh.
>>
>> Shantenu
>>
>>
>> Hi Paula,
>> did you run grid-proxy-init? Could you set:
>>
>> export SAGA_VERBOSE=100
>>
>> in your shell and re-send the log, please?
>>
>> Thanks!
>> Andre
>>
>> On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
>> <psanem1 at tigers.lsu.edu> wrote:
>> Hi Andre,
>>
>> for launching remote jobs you certainly want to use
>> Globus, i.e. GRAM
>> URLs. The bigjob_agent key in the resource_list does not
>> need to be
>> set anymore!
>>
>>
>>
>> When I have this configuration for hotel
>>
>> resource_list.append( {"resource_url" :
>> "gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>",
>> "number_of_processes" :
>> "1",
>> "allocation" : "myAllocation", "queue" : "workq",
>> "bigjob_agent":
>> ("/N/soft/SAGA/saga/1.5.3/gcc-**4.1.2/lib/python2.7.1/site-**
>> packages/bigjob/bigjob_agent_**launcher.sh"
>> ),
>> "working_directory": (os.getcwd() + "/agent"), "walltime":10,
>> "affinity" :
>> "affinity1"})
>>
>> I get this error:
>>
>>
>> DEBUG:root:Utilizing ADVERT Backend
>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>> server_connect_url:
>> None
>> DEBUG:root:Initialized Coordination to:
>> advert://advert.cct.lsu.edu:**8080/<http://advert.cct.lsu.edu:8080/>
>> (DB: )
>> DEBUG:root:initialized BigJob:
>> bigjob:7e9a80dc-2374-11e1-**92c3-002215124496
>> DEBUG:root:create pilot job entry on backend server:
>> bigjob:7e9a80dc-2374-11e1-**92c3-002215124496:localhost
>> DEBUG:root:create advert entry:
>> advert://advert.cct.lsu.edu:**8080//bigjob/7e9a80dc-2374-**
>> 11e1-92c3-002215124496/**localhost<http://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost>
>> ?
>>
>> DEBUG:root:update state of pilot job to: Unknown Stopped: False
>> DEBUG:root:set pilot state to: Unknown
>> Adaptor specific modifications: fork
>> Working directory:
>> /N/u/paulasoo/HW06_E3/**examples/manySJ_2BJ_diff/agent
>>
>> use standard proxy
>> Submit pilot job to: fork://localhost/
>> DEBUG:root:start bigjob at:
>> gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>
>>
>> DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>> DEBUG:root:Utilizing ADVERT Backend
>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>> server_connect_url:
>> None
>> DEBUG:root:Initialized Coordination to:
>> advert://advert.cct.lsu.edu:**8080/<http://advert.cct.lsu.edu:8080/>
>> (DB: )
>> DEBUG:root:initialized BigJob:
>> bigjob:7f793e3a-2374-11e1-**92c3-002215124496
>> DEBUG:root:create pilot job entry on backend server:
>> bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**
>> futuregrid.org <http://hotel.futuregrid.org>
>> DEBUG:root:create advert entry:
>> advert://advert.cct.lsu.edu:**8080//bigjob/7f793e3a-2374-**
>> 11e1-92c3-002215124496/hotel.**futuregrid.org<http://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org>
>> ?
>>
>> DEBUG:root:update state of pilot job to: Unknown Stopped: False
>> DEBUG:root:set pilot state to: Unknown
>> Adaptor specific modifications: gram
>> DEBUG:root:Escape RSL
>> Working directory:
>> /N/u/paulasoo/HW06_E3/**examples/manySJ_2BJ_diff/agent
>> use standard proxy
>>
>> Traceback (most recent call last):
>> File "example_manyjob_affinity.py", line 61, in <module>
>> mjs = many_job_affinity_service(**resource_list,
>> COORDINATION_URL)
>> File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>> any_job_affinity.py",
>> line 19, in __init__
>> super(many_job_affinity_**service,
>> self).__init__(bigjob_list,
>> advert_host)
>> File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>> any_job.py",
>> line 59, in __init__
>> self.__init_bigjobs()
>> File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>> any_job.py",
>> line 74, in __init_bigjobs
>> self.__start_bigjob(i)
>> File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>> any_job.py",
>> line 104, in __start_bigjob
>> ppn)
>> File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob/bigjob_ma
>> nager.py",
>> line 246, in start_pilot_job
>> js = saga.job.service(lrms_saga_**url)
>> bad_parameter: SAGA(BadParameter): condor_job: Adaptor supports
>> 'condor' and
>> 'condorg' URL schemes, 'gram' is not supported.
>>
>> DEBUG:root:Cancel re-scheduler thread
>> Exception AttributeError: "'many_job_affinity_service' object
>> has no
>> attribute 'stop'" in <bound method many_job_affinity_service.__
>> **del__
>> of
>> 7e9a6c78-2374-11e1-92c3-**002215124496> ignored
>> Cancel Pilot Job
>> stop pilot job: bigjob:7e9a80dc-2374-11e1-**
>> 92c3-002215124496:localhost
>> DEBUG:root:delete pilot job:
>> bigjob:7e9a80dc-2374-11e1-**92c3-002215124496:localhost
>> Cancel Pilot Job
>> stop pilot job:
>> bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**
>> futuregrid.org <http://hotel.futuregrid.org>
>> DEBUG:root:delete pilot job:
>> bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**
>> futuregrid.org <http://hotel.futuregrid.org>
>>
>> Please try to submit the job first from the hotel front
>> node. Once you
>> are sure that Globus etc. is working you can do a remote
>> submission.
>> E.g you need to make sure that you use the right BJ
>> version on hotel
>> as well (ie. you need to double-check your PYTHONPATH
>> etc.)
>>
>>
>>
>> I think I have BigJob in Hotel working properly, at least
>> example_local_single.py and example_local_multiple worked. My
>> PYTHONPATH is
>>
>> /gpfs/home/paulasoo/.bigjob/**python/lib/python2.7/site-**
>> packages/:/gpfs/software/x86_**64/el5/hotel/SA
>> GA/saga/1.6/gcc-4.1.2//lib/**python2.7/site-packages/:
>>
>> Thanks,
>>
>> Paula
>>
>> Hope that helps.
>>
>> Best,
>> Andre
>>
>> On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu
>> <psanem1 at tigers.lsu.edu>
>> wrote:
>> Hi Andre,
>>
>>
>> Please change numer_nodes to
>> number_of_processes in line 51:
>>
>> resource_list.append(
>> {"resource_url" :
>> "fork://localhost/",
>> "number_of_processes" : "2",
>> "allocation" : "myAllocation",
>> ...
>>
>>
>>
>> I changed number_nodes to number_of_processes
>> and it now runs. However,
>> it
>> seems like it is in a infinite loop, perhaps
>> because my second machine
>> is
>> not configured properly? This is the
>> configuration for my second
>> machine:
>>
>> resource_list.append( {"resource_url" :
>> "ssh://hotel.futuregrid.org",
>> "number_of_processes" : "4", "allocation" :
>> "myAllocation", "queue" :
>> "workq", "bigjob_agent":
>>
>> ("/N/soft/SAGA/saga/1.5.3/gcc-**4.1.2/lib/python2.7.1/site-**
>> packages/bigjob/bigjob_agent_**launcher.sh"
>> ),
>> "working_directory": (os.getcwd() +
>> "/agent"), "walltime":10, "affinity"
>> :
>> "affinity1"})
>>
>> Here, I used the bigjob_agent_launcher.sh in
>> the directory written above
>> because it was the only place I could find
>> it. Should I use something
>> else
>> for "bigjob_agent"? Also, I'm not sure
>> whether
>> ssh://hotel.futuregrid.org or
>> gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>
>> should be used.
>>
>> This is the output that keeps coming up:
>>
>> Current states: {'New': 2, 'Unknown': 6}
>> DEBUG:root:Reschedule Thread
>> DEBUG:root:Big Job:
>> bigjob:50ec186c-22a2-11e1-**
>> 9a15-002215124496:localhost
>> Cores: 0/2 State: Running Terminated: False
>> #Required Cores: 1
>> DEBUG:root:Big Job:
>> bigjob:51d0e1cc-22a2-11e1-**9a15-002215124496:
>> hotel.**futuregrid.org <http://hotel.futuregrid.org>
>> Cores:
>> 4/4
>> State: Unknown Terminated: False #Required
>> Cores: 1
>> DEBUG:root:found no active resource for
>> sub-job => (re-) queue it
>> DEBUG:root:free_cores: [0, 0]
>> total_free_cores: 0
>>
>>
>> Also, the BJ part of CSA is a bit
>> old and contains some bugs. If
>> possible, try to install BJ in
>> userspace (as outlined on the
>> Wiki
>> page).
>>
>> It's fixed in BigJob-0.3.31 and
>> SVN.
>>
>>
>>
>> I followed the instructions on the Wiki (b.
>> Python Packaging and
>> Virtualenv)
>> and did an update. I looks like I have
>> BigJob-0.3.31.
>>
>> Thanks,
>>
>> Paula
>>
>>
>> On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow
>> <aluckow at cct.lsu.edu>
>> wrote:
>>
>> Hi Paula,
>> there is a small bug in the
>> example:
>>
>> Please change numer_nodes to
>> number_of_processes in line 51:
>>
>> resource_list.append(
>> {"resource_url" :
>> "fork://localhost/",
>> "number_of_processes" : "2",
>> "allocation" : "myAllocation",
>> ...
>>
>> Also, the BJ part of CSA is a bit
>> old and contains some bugs. If
>> possible, try to install BJ in
>> userspace (as outlined on the
>> Wiki
>> page).
>>
>> It's fixed in BigJob-0.3.31 and
>> SVN.
>>
>> Best,
>> Andre
>>
>>
>> On Thu, Dec 8, 2011 at 9:42 PM,
>> Paula Sanematsu
>> <psanem1 at tigers.lsu.edu>
>> wrote:
>> Hi,
>>
>> I'm trying to run
>> example_manyjob_affinity.py
>> on Sierra, but it
>> doesn't
>> complete (see my
>> output below). I'm
>> submitting the job
>> from Sierra
>> and
>> would
>> like to use Hotel as
>> my second machine.
>> Could you please
>> advise me on
>> how to
>> proceed?
>>
>> In addition, is there
>> anything wrong with
>> the cct advert
>> service? I
>> could
>> run
>> example_local_single.py,
>> but now it's not
>> working.
>>
>> Thanks,
>>
>> Paula
>>
>> ManyJob load test
>> with 8 jobs.
>> Create manyjob
>> service
>> DEBUG:root:start
>> bigjob at:
>> fork://localhost/
>> DEBUG:root:init
>> BigJob w/:
>> advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>> DEBUG:root:['/N/u/paulasoo/**
>> HW06_E3/examples/manySJ_2BJ_**diff/../',
>> '/N/u/paulasoo/HW06_E3/**
>> examples/manySJ_2BJ_diff',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/setuptools-0.6c11-**py2.7.egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/BigJob-0.3.2-py2.7.**egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/redis-2.2.4-py2.7.**egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/virtualenv-1.6.4-py2.**7.egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/threadpool-1.2.7-py2.**7.egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/uuid-1.30-py2.7.egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/setuptools-0.6c11-**py2.7.egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/redis-2.2.4-py2.7.**egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/threadpool-1.2.7-py2.**7.egg',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/uuid-1.30-py2.7.egg',
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages',
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/2.7/site-**packages',
>> '/N/u/paulasoo/HW06_E3/**
>> examples/manySJ_2BJ_diff',
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python27.**zip',
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7'**,
>>
>>
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**plat-linux2',
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**lib-tk',
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**lib-old',
>>
>>
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**lib-dynload',
>>
>>
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**site-packages',
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/2.7/site-**packages',
>> '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/2.7/site-**packages',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/BigJob-0.3.2-py2.7.**egg/bigjob',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>> ',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>> ']
>> DEBUG:root:Utilizing
>> ADVERT Backend
>> DEBUG:root:Parsing
>> URL:
>> advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>> DEBUG:root:Server:
>> advert.cct.lsu.edu
>> Port 8080
>> server_connect_url:
>> None
>> DEBUG:root:initialized
>> BigJob:
>> bigjob:6638ce9e-21d6-11e1-**
>> ac76-002215124496
>> Traceback (most
>> recent call last):
>> File
>> "example_manyjob_affinity.py",
>> line 61, in <module>
>> mjs =
>> many_job_affinity_service(**
>> resource_list,
>> COORDINATION_URL)
>> File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>> /many_job_affinity.py",
>> line 19, in __init__
>>
>> super(many_job_affinity_**service,
>> self).__init__(bigjob_list,
>> advert_host)
>> File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>
>> /many_job.py",
>> line 59, in __init__
>>
>> self.__init_bigjobs()
>> File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>
>> /many_job.py",
>> line 74, in
>> __init_bigjobs
>>
>> self.__start_bigjob(i)
>> File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>
>> /many_job.py",
>> line 98, in
>> __start_bigjob
>>
>> bj_dict["number_of_processes"]**,
>> KeyError:
>> 'number_of_processes'
>> Cancel Pilot Job
>> stop pilot job:
>> DEBUG:root:create
>> advert entry:
>> advert://advert.cct.lsu.edu:**8080/<http://advert.cct.lsu.edu:8080/>
>> DEBUG:root:update
>> state of pilot job
>> to: Done Stopped:
>> True
>> DEBUG:root:delete
>> pilot job:
>>
>> ______________________________**
>> _________________
>> Bigjob-users mailing
>> list
>> Bigjob-users at mail.cct.lsu.edu
>> https://mail.cct.lsu.edu/**
>> mailman/listinfo/bigjob-users<https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users>
>>
>>
>>
>>
>>
>> ______________________________**_________________
>> Bigjob-users mailing list
>> Bigjob-users at mail.cct.lsu.edu
>> https://mail.cct.lsu.edu/**mailman/listinfo/bigjob-users<https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cct.lsu.edu/pipermail/bigjob-users/attachments/20111210/bc810bdf/attachment-0001.html
More information about the Bigjob-users
mailing list