[Bigjob-users] Attempt to run example_manyjob_affinity.py

Shantenu Jha sjha at cct.lsu.edu
Sat Dec 10 15:53:40 CST 2011


Paula,

Do you even have a grid certificate? If not you will have to use
non-globus solutions e.g., pbs-ssh.

Shantenu

> Hi Paula,
> did you run grid-proxy-init? Could you set:
>
> export SAGA_VERBOSE=100
>
> in your shell and re-send the log, please?
>
> Thanks!
> Andre
>
> On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
> <psanem1 at tigers.lsu.edu> wrote:
>> Hi Andre,
>>
>>> for launching remote jobs you certainly want to use Globus, i.e. GRAM
>>> URLs. The bigjob_agent key in the resource_list does not need to be
>>> set anymore!
>>
>>
>> When I have this configuration for hotel
>>
>> resource_list.append( {"resource_url" :
>> "gram://hotel.futuregrid.org/jobmanager-pbs", "number_of_processes" : "1",
>> "allocation" : "myAllocation", "queue" : "workq", "bigjob_agent":
>> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"),
>> "working_directory": (os.getcwd() + "/agent"), "walltime":10, "affinity" :
>> "affinity1"})
>>
>> I get this error:
>>
>>
>> DEBUG:root:Utilizing ADVERT Backend
>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url: None
>> DEBUG:root:Initialized Coordination to: advert://advert.cct.lsu.edu:8080/
>> (DB: )
>> DEBUG:root:initialized BigJob: bigjob:7e9a80dc-2374-11e1-92c3-002215124496
>> DEBUG:root:create pilot job entry on backend server:
>> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>> DEBUG:root:create advert entry:
>> advert://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost?
>>
>> DEBUG:root:update state of pilot job to: Unknown Stopped: False
>> DEBUG:root:set pilot state to: Unknown
>> Adaptor specific modifications: fork
>> Working directory: /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>>
>> use standard proxy
>> Submit pilot job to: fork://localhost/
>> DEBUG:root:start bigjob at: gram://hotel.futuregrid.org/jobmanager-pbs
>>
>> DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
>> DEBUG:root:Utilizing ADVERT Backend
>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url: None
>> DEBUG:root:Initialized Coordination to: advert://advert.cct.lsu.edu:8080/
>> (DB: )
>> DEBUG:root:initialized BigJob: bigjob:7f793e3a-2374-11e1-92c3-002215124496
>> DEBUG:root:create pilot job entry on backend server:
>> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>> DEBUG:root:create advert entry:
>> advert://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org?
>>
>> DEBUG:root:update state of pilot job to: Unknown Stopped: False
>> DEBUG:root:set pilot state to: Unknown
>> Adaptor specific modifications: gram
>> DEBUG:root:Escape RSL
>> Working directory: /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>> use standard proxy
>>
>> Traceback (most recent call last):
>>   File "example_manyjob_affinity.py", line 61, in <module>
>>     mjs = many_job_affinity_service(resource_list, COORDINATION_URL)
>>   File
>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job_affinity.py",
>> line 19, in __init__
>>     super(many_job_affinity_service, self).__init__(bigjob_list,
>> advert_host)
>>   File
>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job.py",
>> line 59, in __init__
>>     self.__init_bigjobs()
>>   File
>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job.py",
>> line 74, in __init_bigjobs
>>     self.__start_bigjob(i)
>>   File
>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/many_job.py",
>> line 104, in __start_bigjob
>>     ppn)
>>   File
>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob/bigjob_manager.py",
>> line 246, in start_pilot_job
>>     js = saga.job.service(lrms_saga_url)
>> bad_parameter: SAGA(BadParameter): condor_job: Adaptor supports 'condor' and
>> 'condorg' URL schemes, 'gram' is not supported.
>>
>> DEBUG:root:Cancel re-scheduler thread
>> Exception AttributeError: "'many_job_affinity_service' object has no
>> attribute 'stop'" in <bound method many_job_affinity_service.__del__ of
>> 7e9a6c78-2374-11e1-92c3-002215124496> ignored
>> Cancel Pilot Job
>> stop pilot job: bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>> DEBUG:root:delete pilot job:
>> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>> Cancel Pilot Job
>> stop pilot job:
>> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>> DEBUG:root:delete pilot job:
>> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>>
>>> Please try to submit the job first from the hotel front node. Once you
>>> are sure that Globus etc. is working you can do a remote submission.
>>> E.g you need to make sure that you use the right BJ version on hotel
>>> as well (ie. you need to double-check your PYTHONPATH etc.)
>>
>>
>> I think I have BigJob in Hotel working properly, at least
>> example_local_single.py and example_local_multiple worked. My PYTHONPATH is
>>
>> /gpfs/home/paulasoo/.bigjob/python/lib/python2.7/site-packages/:/gpfs/software/x86_64/el5/hotel/SAGA/saga/1.6/gcc-4.1.2//lib/python2.7/site-packages/:
>>
>> Thanks,
>>
>> Paula
>>
>>> Hope that helps.
>>>
>>> Best,
>>> Andre
>>>
>>> On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu <psanem1 at tigers.lsu.edu>
>>> wrote:
>>>> Hi Andre,
>>>>
>>>>>
>>>>> Please change numer_nodes to number_of_processes in line 51:
>>>>>
>>>>> resource_list.append( {"resource_url" : "fork://localhost/",
>>>>> "number_of_processes" : "2", "allocation" : "myAllocation", ...
>>>>
>>>>
>>>> I changed number_nodes to number_of_processes and it now runs. However,
>>>> it
>>>> seems like it is in a infinite loop, perhaps because my second machine
>>>> is
>>>> not configured properly? This is the configuration for my second
>>>> machine:
>>>>
>>>> resource_list.append( {"resource_url" : "ssh://hotel.futuregrid.org",
>>>> "number_of_processes" : "4", "allocation" :  "myAllocation", "queue" :
>>>> "workq", "bigjob_agent":
>>>>
>>>> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"),
>>>> "working_directory": (os.getcwd() + "/agent"), "walltime":10, "affinity"
>>>> :
>>>> "affinity1"})
>>>>
>>>> Here, I used the bigjob_agent_launcher.sh in the directory written above
>>>> because it was the only place I could find it. Should I use something
>>>> else
>>>> for "bigjob_agent"? Also, I'm not sure whether
>>>> ssh://hotel.futuregrid.org or
>>>> gram://hotel.futuregrid.org/jobmanager-pbs should be used.
>>>>
>>>> This is the output that keeps coming up:
>>>>
>>>> Current states: {'New': 2, 'Unknown': 6}
>>>> DEBUG:root:Reschedule Thread
>>>> DEBUG:root:Big Job:
>>>> bigjob:50ec186c-22a2-11e1-9a15-002215124496:localhost
>>>> Cores: 0/2 State: Running Terminated: False #Required Cores: 1
>>>> DEBUG:root:Big Job:
>>>> bigjob:51d0e1cc-22a2-11e1-9a15-002215124496:hotel.futuregrid.org Cores:
>>>> 4/4
>>>> State: Unknown Terminated: False #Required Cores: 1
>>>> DEBUG:root:found no active resource for sub-job => (re-) queue it
>>>> DEBUG:root:free_cores: [0, 0] total_free_cores: 0
>>>>
>>>>>
>>>>> Also, the BJ part of CSA is a bit old and contains some bugs. If
>>>>> possible, try to install BJ in userspace (as outlined on the Wiki
>>>>> page).
>>>>>
>>>>> It's fixed in BigJob-0.3.31 and SVN.
>>>>
>>>>
>>>> I followed the instructions on the Wiki (b. Python Packaging and
>>>> Virtualenv)
>>>> and did an update. I looks like I have BigJob-0.3.31.
>>>>
>>>> Thanks,
>>>>
>>>> Paula
>>>>
>>>>
>>>> On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow <aluckow at cct.lsu.edu>
>>>> wrote:
>>>>>
>>>>> Hi Paula,
>>>>> there is a small bug in the example:
>>>>>
>>>>> Please change numer_nodes to number_of_processes in line 51:
>>>>>
>>>>> resource_list.append( {"resource_url" : "fork://localhost/",
>>>>> "number_of_processes" : "2", "allocation" : "myAllocation", ...
>>>>>
>>>>> Also, the BJ part of CSA is a bit old and contains some bugs. If
>>>>> possible, try to install BJ in userspace (as outlined on the Wiki
>>>>> page).
>>>>>
>>>>> It's fixed in BigJob-0.3.31 and SVN.
>>>>>
>>>>> Best,
>>>>> Andre
>>>>>
>>>>>
>>>>> On Thu, Dec 8, 2011 at 9:42 PM, Paula Sanematsu
>>>>> <psanem1 at tigers.lsu.edu>
>>>>> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to run example_manyjob_affinity.py on Sierra, but it
>>>>>> doesn't
>>>>>> complete (see my output below). I'm submitting the job from Sierra
>>>>>> and
>>>>>> would
>>>>>> like to use Hotel as my second machine. Could you please advise me on
>>>>>> how to
>>>>>> proceed?
>>>>>>
>>>>>> In addition, is there anything wrong with the cct advert service? I
>>>>>> could
>>>>>> run example_local_single.py, but now it's not working.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Paula
>>>>>>
>>>>>> ManyJob load test with 8 jobs.
>>>>>> Create manyjob service
>>>>>> DEBUG:root:start bigjob at: fork://localhost/
>>>>>> DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
>>>>>> DEBUG:root:['/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/../',
>>>>>> '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.6.4-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages',
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>>>> '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip',
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2',
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk',
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages',
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic',
>>>>>>
>>>>>>
>>>>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic']
>>>>>> DEBUG:root:Utilizing ADVERT Backend
>>>>>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>>>>>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url:
>>>>>> None
>>>>>> DEBUG:root:initialized BigJob:
>>>>>> bigjob:6638ce9e-21d6-11e1-ac76-002215124496
>>>>>> Traceback (most recent call last):
>>>>>>   File "example_manyjob_affinity.py", line 61, in <module>
>>>>>>     mjs = many_job_affinity_service(resource_list, COORDINATION_URL)
>>>>>>   File
>>>>>>
>>>>>>
>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job_affinity.py",
>>>>>> line 19, in __init__
>>>>>>     super(many_job_affinity_service, self).__init__(bigjob_list,
>>>>>> advert_host)
>>>>>>   File
>>>>>>
>>>>>>
>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job.py",
>>>>>> line 59, in __init__
>>>>>>     self.__init_bigjobs()
>>>>>>   File
>>>>>>
>>>>>>
>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job.py",
>>>>>> line 74, in __init_bigjobs
>>>>>>     self.__start_bigjob(i)
>>>>>>   File
>>>>>>
>>>>>>
>>>>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic/many_job.py",
>>>>>> line 98, in __start_bigjob
>>>>>>     bj_dict["number_of_processes"],
>>>>>> KeyError: 'number_of_processes'
>>>>>> Cancel Pilot Job
>>>>>> stop pilot job:
>>>>>> DEBUG:root:create advert entry: advert://advert.cct.lsu.edu:8080/
>>>>>> DEBUG:root:update state of pilot job to: Done Stopped: True
>>>>>> DEBUG:root:delete pilot job:
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bigjob-users mailing list
>>>>>> Bigjob-users at mail.cct.lsu.edu
>>>>>> https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>>>>>>
>>>>
>>>>
>>
>>
> _______________________________________________
> Bigjob-users mailing list
> Bigjob-users at mail.cct.lsu.edu
> https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>


More information about the Bigjob-users mailing list