[Bigjob-users] Attempt to run example_manyjob_affinity.py

Paula Sanematsu psanem1 at tigers.lsu.edu
Sat Dec 10 16:01:22 CST 2011


I don't think I have a grid certificate. When I do grid-proxy-init, I get

-bash: grid-proxy-init: command not found

Paula

On Sat, Dec 10, 2011 at 3:53 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:

> Paula,
>
> Do you even have a grid certificate? If not you will have to use
> non-globus solutions e.g., pbs-ssh.
>
> Shantenu
>
>
>  Hi Paula,
>> did you run grid-proxy-init? Could you set:
>>
>> export SAGA_VERBOSE=100
>>
>> in your shell and re-send the log, please?
>>
>> Thanks!
>> Andre
>>
>> On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
>> <psanem1 at tigers.lsu.edu> wrote:
>>
>>> Hi Andre,
>>>
>>>  for launching remote jobs you certainly want to use Globus, i.e. GRAM
>>>> URLs. The bigjob_agent key in the resource_list does not need to be
>>>> set anymore!
>>>>
>>>
>>>
>>> When I have this configuration for hotel
>>>
>>> resource_list.append( {"resource_url" :
>>> "gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>",
>>> "number_of_processes" : "1",
>>> "allocation" : "myAllocation", "queue" : "workq", "bigjob_agent":
>>> ("/N/soft/SAGA/saga/1.5.3/gcc-**4.1.2/lib/python2.7.1/site-**
>>> packages/bigjob/bigjob_agent_**launcher.sh"),
>>> "working_directory": (os.getcwd() + "/agent"), "walltime":10, "affinity"
>>> :
>>> "affinity1"})
>>>
>>> I get this error:
>>>
>>>
>>> DEBUG:root:Utilizing ADVERT Backend
>>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url: None
>>> DEBUG:root:Initialized Coordination to: advert://advert.cct.lsu.edu:**
>>> 8080/ <http://advert.cct.lsu.edu:8080/>
>>> (DB: )
>>> DEBUG:root:initialized BigJob: bigjob:7e9a80dc-2374-11e1-**
>>> 92c3-002215124496
>>> DEBUG:root:create pilot job entry on backend server:
>>> bigjob:7e9a80dc-2374-11e1-**92c3-002215124496:localhost
>>> DEBUG:root:create advert entry:
>>> advert://advert.cct.lsu.edu:**8080//bigjob/7e9a80dc-2374-**
>>> 11e1-92c3-002215124496/**localhost<http://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost>
>>> ?
>>>
>>> DEBUG:root:update state of pilot job to: Unknown Stopped: False
>>> DEBUG:root:set pilot state to: Unknown
>>> Adaptor specific modifications: fork
>>> Working directory: /N/u/paulasoo/HW06_E3/**
>>> examples/manySJ_2BJ_diff/agent
>>>
>>> use standard proxy
>>> Submit pilot job to: fork://localhost/
>>> DEBUG:root:start bigjob at: gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>
>>>
>>> DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>> DEBUG:root:Utilizing ADVERT Backend
>>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url: None
>>> DEBUG:root:Initialized Coordination to: advert://advert.cct.lsu.edu:**
>>> 8080/ <http://advert.cct.lsu.edu:8080/>
>>> (DB: )
>>> DEBUG:root:initialized BigJob: bigjob:7f793e3a-2374-11e1-**
>>> 92c3-002215124496
>>> DEBUG:root:create pilot job entry on backend server:
>>> bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**futuregrid.org<http://hotel.futuregrid.org>
>>> DEBUG:root:create advert entry:
>>> advert://advert.cct.lsu.edu:**8080//bigjob/7f793e3a-2374-**
>>> 11e1-92c3-002215124496/hotel.**futuregrid.org<http://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org>
>>> ?
>>>
>>> DEBUG:root:update state of pilot job to: Unknown Stopped: False
>>> DEBUG:root:set pilot state to: Unknown
>>> Adaptor specific modifications: gram
>>> DEBUG:root:Escape RSL
>>> Working directory: /N/u/paulasoo/HW06_E3/**
>>> examples/manySJ_2BJ_diff/agent
>>> use standard proxy
>>>
>>> Traceback (most recent call last):
>>>   File "example_manyjob_affinity.py", line 61, in <module>
>>>     mjs = many_job_affinity_service(**resource_list, COORDINATION_URL)
>>>   File
>>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/many_job_**affinity.py",
>>> line 19, in __init__
>>>     super(many_job_affinity_**service, self).__init__(bigjob_list,
>>> advert_host)
>>>   File
>>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/many_job.py",
>>> line 59, in __init__
>>>     self.__init_bigjobs()
>>>   File
>>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/many_job.py",
>>> line 74, in __init_bigjobs
>>>     self.__start_bigjob(i)
>>>   File
>>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/many_job.py",
>>> line 104, in __start_bigjob
>>>     ppn)
>>>   File
>>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>>> BigJob-0.3.31-py2.7.egg/**bigjob/bigjob_manager.py",
>>> line 246, in start_pilot_job
>>>     js = saga.job.service(lrms_saga_**url)
>>> bad_parameter: SAGA(BadParameter): condor_job: Adaptor supports 'condor'
>>> and
>>> 'condorg' URL schemes, 'gram' is not supported.
>>>
>>> DEBUG:root:Cancel re-scheduler thread
>>> Exception AttributeError: "'many_job_affinity_service' object has no
>>> attribute 'stop'" in <bound method many_job_affinity_service.__**del__
>>> of
>>> 7e9a6c78-2374-11e1-92c3-**002215124496> ignored
>>> Cancel Pilot Job
>>> stop pilot job: bigjob:7e9a80dc-2374-11e1-**92c3-002215124496:localhost
>>> DEBUG:root:delete pilot job:
>>> bigjob:7e9a80dc-2374-11e1-**92c3-002215124496:localhost
>>> Cancel Pilot Job
>>> stop pilot job:
>>> bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**futuregrid.org<http://hotel.futuregrid.org>
>>> DEBUG:root:delete pilot job:
>>> bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**futuregrid.org<http://hotel.futuregrid.org>
>>>
>>>  Please try to submit the job first from the hotel front node. Once you
>>>> are sure that Globus etc. is working you can do a remote submission.
>>>> E.g you need to make sure that you use the right BJ version on hotel
>>>> as well (ie. you need to double-check your PYTHONPATH etc.)
>>>>
>>>
>>>
>>> I think I have BigJob in Hotel working properly, at least
>>> example_local_single.py and example_local_multiple worked. My PYTHONPATH
>>> is
>>>
>>> /gpfs/home/paulasoo/.bigjob/**python/lib/python2.7/site-**
>>> packages/:/gpfs/software/x86_**64/el5/hotel/SAGA/saga/1.6/**
>>> gcc-4.1.2//lib/python2.7/site-**packages/:
>>>
>>> Thanks,
>>>
>>> Paula
>>>
>>>  Hope that helps.
>>>>
>>>> Best,
>>>> Andre
>>>>
>>>> On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu <psanem1 at tigers.lsu.edu
>>>> >
>>>> wrote:
>>>>
>>>>> Hi Andre,
>>>>>
>>>>>
>>>>>> Please change numer_nodes to number_of_processes in line 51:
>>>>>>
>>>>>> resource_list.append( {"resource_url" : "fork://localhost/",
>>>>>> "number_of_processes" : "2", "allocation" : "myAllocation", ...
>>>>>>
>>>>>
>>>>>
>>>>> I changed number_nodes to number_of_processes and it now runs. However,
>>>>> it
>>>>> seems like it is in a infinite loop, perhaps because my second machine
>>>>> is
>>>>> not configured properly? This is the configuration for my second
>>>>> machine:
>>>>>
>>>>> resource_list.append( {"resource_url" : "ssh://hotel.futuregrid.org",
>>>>> "number_of_processes" : "4", "allocation" :  "myAllocation", "queue" :
>>>>> "workq", "bigjob_agent":
>>>>>
>>>>> ("/N/soft/SAGA/saga/1.5.3/gcc-**4.1.2/lib/python2.7.1/site-**
>>>>> packages/bigjob/bigjob_agent_**launcher.sh"),
>>>>> "working_directory": (os.getcwd() + "/agent"), "walltime":10,
>>>>> "affinity"
>>>>> :
>>>>> "affinity1"})
>>>>>
>>>>> Here, I used the bigjob_agent_launcher.sh in the directory written
>>>>> above
>>>>> because it was the only place I could find it. Should I use something
>>>>> else
>>>>> for "bigjob_agent"? Also, I'm not sure whether
>>>>> ssh://hotel.futuregrid.org or
>>>>> gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>should be used.
>>>>>
>>>>> This is the output that keeps coming up:
>>>>>
>>>>> Current states: {'New': 2, 'Unknown': 6}
>>>>> DEBUG:root:Reschedule Thread
>>>>> DEBUG:root:Big Job:
>>>>> bigjob:50ec186c-22a2-11e1-**9a15-002215124496:localhost
>>>>> Cores: 0/2 State: Running Terminated: False #Required Cores: 1
>>>>> DEBUG:root:Big Job:
>>>>> bigjob:51d0e1cc-22a2-11e1-**9a15-002215124496:hotel.**futuregrid.org<http://hotel.futuregrid.org>Cores:
>>>>> 4/4
>>>>> State: Unknown Terminated: False #Required Cores: 1
>>>>> DEBUG:root:found no active resource for sub-job => (re-) queue it
>>>>> DEBUG:root:free_cores: [0, 0] total_free_cores: 0
>>>>>
>>>>>
>>>>>> Also, the BJ part of CSA is a bit old and contains some bugs. If
>>>>>> possible, try to install BJ in userspace (as outlined on the Wiki
>>>>>> page).
>>>>>>
>>>>>> It's fixed in BigJob-0.3.31 and SVN.
>>>>>>
>>>>>
>>>>>
>>>>> I followed the instructions on the Wiki (b. Python Packaging and
>>>>> Virtualenv)
>>>>> and did an update. I looks like I have BigJob-0.3.31.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Paula
>>>>>
>>>>>
>>>>> On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow <aluckow at cct.lsu.edu>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hi Paula,
>>>>>> there is a small bug in the example:
>>>>>>
>>>>>> Please change numer_nodes to number_of_processes in line 51:
>>>>>>
>>>>>> resource_list.append( {"resource_url" : "fork://localhost/",
>>>>>> "number_of_processes" : "2", "allocation" : "myAllocation", ...
>>>>>>
>>>>>> Also, the BJ part of CSA is a bit old and contains some bugs. If
>>>>>> possible, try to install BJ in userspace (as outlined on the Wiki
>>>>>> page).
>>>>>>
>>>>>> It's fixed in BigJob-0.3.31 and SVN.
>>>>>>
>>>>>> Best,
>>>>>> Andre
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 8, 2011 at 9:42 PM, Paula Sanematsu
>>>>>> <psanem1 at tigers.lsu.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I'm trying to run example_manyjob_affinity.py on Sierra, but it
>>>>>>> doesn't
>>>>>>> complete (see my output below). I'm submitting the job from Sierra
>>>>>>> and
>>>>>>> would
>>>>>>> like to use Hotel as my second machine. Could you please advise me on
>>>>>>> how to
>>>>>>> proceed?
>>>>>>>
>>>>>>> In addition, is there anything wrong with the cct advert service? I
>>>>>>> could
>>>>>>> run example_local_single.py, but now it's not working.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Paula
>>>>>>>
>>>>>>> ManyJob load test with 8 jobs.
>>>>>>> Create manyjob service
>>>>>>> DEBUG:root:start bigjob at: fork://localhost/
>>>>>>> DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>>>>>> DEBUG:root:['/N/u/paulasoo/**HW06_E3/examples/manySJ_2BJ_**
>>>>>>> diff/../',
>>>>>>> '/N/u/paulasoo/HW06_E3/**examples/manySJ_2BJ_diff',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/setuptools-0.6c11-**py2.7.egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/redis-2.2.4-py2.7.**egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/virtualenv-1.6.4-py2.**7.egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/threadpool-1.2.7-py2.**7.egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/uuid-1.30-py2.7.egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/setuptools-0.6c11-**py2.7.egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/redis-2.2.4-py2.7.**egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/threadpool-1.2.7-py2.**7.egg',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/uuid-1.30-py2.7.egg',
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**packages',
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/2.7/site-**
>>>>>>> packages',
>>>>>>> '/N/u/paulasoo/HW06_E3/**examples/manySJ_2BJ_diff',
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/python27.**zip',
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/python2.7'**,
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/python2.7/**
>>>>>>> plat-linux2',
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/python2.7/**
>>>>>>> lib-tk',
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/python2.7/**
>>>>>>> lib-old',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/python2.7/**
>>>>>>> lib-dynload',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/python2.7/**
>>>>>>> site-packages',
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/2.7/site-**
>>>>>>> packages',
>>>>>>> '/N/soft/SAGA/external/python/**2.7.1/gcc-4.1.2/lib/2.7/site-**
>>>>>>> packages',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg/bigjob',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic',
>>>>>>>
>>>>>>>
>>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic']
>>>>>>> DEBUG:root:Utilizing ADVERT Backend
>>>>>>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>>>>>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url:
>>>>>>> None
>>>>>>> DEBUG:root:initialized BigJob:
>>>>>>> bigjob:6638ce9e-21d6-11e1-**ac76-002215124496
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "example_manyjob_affinity.py", line 61, in <module>
>>>>>>>     mjs = many_job_affinity_service(**resource_list,
>>>>>>> COORDINATION_URL)
>>>>>>>   File
>>>>>>>
>>>>>>>
>>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic/many_job_**
>>>>>>> affinity.py",
>>>>>>> line 19, in __init__
>>>>>>>     super(many_job_affinity_**service, self).__init__(bigjob_list,
>>>>>>> advert_host)
>>>>>>>   File
>>>>>>>
>>>>>>>
>>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic/many_job.**py",
>>>>>>> line 59, in __init__
>>>>>>>     self.__init_bigjobs()
>>>>>>>   File
>>>>>>>
>>>>>>>
>>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic/many_job.**py",
>>>>>>> line 74, in __init_bigjobs
>>>>>>>     self.__start_bigjob(i)
>>>>>>>   File
>>>>>>>
>>>>>>>
>>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>>>>>>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic/many_job.**py",
>>>>>>> line 98, in __start_bigjob
>>>>>>>     bj_dict["number_of_processes"]**,
>>>>>>> KeyError: 'number_of_processes'
>>>>>>> Cancel Pilot Job
>>>>>>> stop pilot job:
>>>>>>> DEBUG:root:create advert entry: advert://advert.cct.lsu.edu:**8080/<http://advert.cct.lsu.edu:8080/>
>>>>>>> DEBUG:root:update state of pilot job to: Done Stopped: True
>>>>>>> DEBUG:root:delete pilot job:
>>>>>>>
>>>>>>> ______________________________**_________________
>>>>>>> Bigjob-users mailing list
>>>>>>> Bigjob-users at mail.cct.lsu.edu
>>>>>>> https://mail.cct.lsu.edu/**mailman/listinfo/bigjob-users<https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users>
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>
>>>  ______________________________**_________________
>> Bigjob-users mailing list
>> Bigjob-users at mail.cct.lsu.edu
>> https://mail.cct.lsu.edu/**mailman/listinfo/bigjob-users<https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cct.lsu.edu/pipermail/bigjob-users/attachments/20111210/1c14a444/attachment-0001.html 


More information about the Bigjob-users mailing list