[Bigjob-users] Attempt to run example_manyjob_affinity.py
Andre Luckow
aluckow at cct.lsu.edu
Sat Dec 10 16:33:49 CST 2011
Hi Paula,
the reason is that hotel has somehow a strange configuration:
This machine accepts SSH public key and One Time Password (OTP) logins only.
If you do not have a public key set up, you will be prompted for a password.
This is *not* your FutureGrid password, but the One Time Password
generated from your
OTP token. Do not type your FutureGrid password, it will not work.
If you do not
have a token or public key, you will not be able to login.
Can you login without password into hotel?
Does:
ssh hotel /bin/date
work for you?
Does it work on another machine, e.g. india?
Best,
Andre
On Sat, Dec 10, 2011 at 11:27 PM, Paula Sanematsu
<psanem1 at tigers.lsu.edu> wrote:
> When I use ssh://hotel.futuregrid.org for "resource_url", I keep getting
> this output over and over again.
>
> DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
> Cores: 0/2 State: Running Terminated: False #Required Cores: 1
> DEBUG:root:Big Job:
> bigjob:7a4c7d70-237c-11e1-83f2-002215124496:hotel.futuregrid.org Cores: 1/1
> State: Unknown Terminated: False #Required Cores: 1
>
> DEBUG:root:found no active resource for sub-job => (re-) queue it
> DEBUG:root:free_cores: [0, 0] total_free_cores: 0
> DEBUG:root:Reschedule Thread
> DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
> Cores: 0/2 State: Running Terminated: False #Required Cores: 1
>
> Current states: {'New': 2, 'Unknown': 6}
>
>
> On Sat, Dec 10, 2011 at 4:05 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
>>
>> Paula,
>>
>>
>>> I don't think I have a grid certificate. When I do grid-proxy-init, I get
>>
>>
>> In which case you should be using "ssh".
>>
>> i.e., use:
>>
>> ssh://hotel.futuregrid.org in the resource_list.append("resource_url" :
>>
>> Shantenu
>>
>>
>>
>>
>>
>>
>>>
>>> -bash: grid-proxy-init: command not found
>>>
>>> Paula
>>>
>>> On Sat, Dec 10, 2011 at 3:53 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
>>> Paula,
>>>
>>> Do you even have a grid certificate? If not you will have to use
>>> non-globus solutions e.g., pbs-ssh.
>>>
>>> Shantenu
>>>
>>>
>>> Hi Paula,
>>> did you run grid-proxy-init? Could you set:
>>>
>>> export SAGA_VERBOSE=100
>>>
>>> in your shell and re-send the log, please?
>>>
>>> Thanks!
>>> Andre
>>>
>>> On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
>>> <psanem1 at tigers.lsu.edu> wrote:
>>> Hi Andre,
>>>
>>> for launching remote jobs you certainly want to use
>>> Globus, i.e. GRAM
>>> URLs. The bigjob_agent key in the resource_list does not
>>> need to be
>>> set anymore!
>>>
>>>
>>>
>>> When I have this configuration for hotel
>>>
>>> resource_list.append( {"resource_url" :
>>> "gram://hotel.futuregrid.org/jobmanager-pbs",
>>> "number_of_processes" :
>>> "1",
>>> "allocation" : "myAllocation", "queue" : "workq",
>>> "bigjob_agent":
>>>
>>> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
>>> ),
>>> "working_directory": (os.getcwd() + "/agent"), "walltime":10,
>>> "affinity" :
>>> "affinity1"})
>>>
>>> I get this error:
>>>
>>>
>>> DEBUG:root:Utilizing ADVERT Backend
>>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>>> server_connect_url:
>>> None
>>> DEBUG:root:Initialized Coordination to:
>>> advert://advert.cct.lsu.edu:8080/
>>> (DB: )
>>> DEBUG:root:initialized BigJob:
>>> bigjob:7e9a80dc-2374-11e1-92c3-002215124496
>>> DEBUG:root:create pilot job entry on backend server:
>>> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>>> DEBUG:root:create advert entry:
>>>
>>> advert://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost?
>>>
>>> DEBUG:root:update state of pilot job to: Unknown Stopped:
>>> False
>>> DEBUG:root:set pilot state to: Unknown
>>> Adaptor specific modifications: fork
>>> Working directory:
>>> /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>>>
>>> use standard proxy
>>> Submit pilot job to: fork://localhost/
>>> DEBUG:root:start bigjob at:
>>> gram://hotel.futuregrid.org/jobmanager-pbs
>>>
>>> DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
>>> DEBUG:root:Utilizing ADVERT Backend
>>> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>>> DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>>> server_connect_url:
>>> None
>>> DEBUG:root:Initialized Coordination to:
>>> advert://advert.cct.lsu.edu:8080/
>>> (DB: )
>>> DEBUG:root:initialized BigJob:
>>> bigjob:7f793e3a-2374-11e1-92c3-002215124496
>>> DEBUG:root:create pilot job entry on backend server:
>>>
>>> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>>> DEBUG:root:create advert entry:
>>>
>>> advert://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org
>>> ?
>>>
>>> DEBUG:root:update state of pilot job to: Unknown Stopped:
>>> False
>>> DEBUG:root:set pilot state to: Unknown
>>> Adaptor specific modifications: gram
>>> DEBUG:root:Escape RSL
>>> Working directory:
>>> /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>>> use standard proxy
>>>
>>> Traceback (most recent call last):
>>> File "example_manyjob_affinity.py", line 61, in <module>
>>> mjs = many_job_affinity_service(resource_list,
>>> COORDINATION_URL)
>>> File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>> any_job_affinity.py",
>>> line 19, in __init__
>>> super(many_job_affinity_service,
>>> self).__init__(bigjob_list,
>>> advert_host)
>>> File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>> any_job.py",
>>> line 59, in __init__
>>> self.__init_bigjobs()
>>> File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>> any_job.py",
>>> line 74, in __init_bigjobs
>>> self.__start_bigjob(i)
>>> File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>> any_job.py",
>>> line 104, in __start_bigjob
>>> ppn)
>>> File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob/bigjob_ma
>>> nager.py",
>>> line 246, in start_pilot_job
>>> js = saga.job.service(lrms_saga_url)
>>> bad_parameter: SAGA(BadParameter): condor_job: Adaptor
>>> supports
>>> 'condor' and
>>> 'condorg' URL schemes, 'gram' is not supported.
>>>
>>> DEBUG:root:Cancel re-scheduler thread
>>> Exception AttributeError: "'many_job_affinity_service' object
>>> has no
>>> attribute 'stop'" in <bound method
>>> many_job_affinity_service.__del__
>>> of
>>> 7e9a6c78-2374-11e1-92c3-002215124496> ignored
>>> Cancel Pilot Job
>>> stop pilot job:
>>> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>>> DEBUG:root:delete pilot job:
>>> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>>> Cancel Pilot Job
>>> stop pilot job:
>>>
>>> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>>> DEBUG:root:delete pilot job:
>>>
>>> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>>>
>>> Please try to submit the job first from the hotel front
>>> node. Once you
>>> are sure that Globus etc. is working you can do a remote
>>> submission.
>>> E.g you need to make sure that you use the right BJ
>>> version on hotel
>>> as well (ie. you need to double-check your PYTHONPATH
>>> etc.)
>>>
>>>
>>>
>>> I think I have BigJob in Hotel working properly, at least
>>> example_local_single.py and example_local_multiple worked. My
>>> PYTHONPATH is
>>>
>>>
>>> /gpfs/home/paulasoo/.bigjob/python/lib/python2.7/site-packages/:/gpfs/software/x86_64/el5/hotel/SA
>>> GA/saga/1.6/gcc-4.1.2//lib/python2.7/site-packages/:
>>>
>>> Thanks,
>>>
>>> Paula
>>>
>>> Hope that helps.
>>>
>>> Best,
>>> Andre
>>>
>>> On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu
>>> <psanem1 at tigers.lsu.edu>
>>> wrote:
>>> Hi Andre,
>>>
>>>
>>> Please change numer_nodes to
>>> number_of_processes in line 51:
>>>
>>> resource_list.append(
>>> {"resource_url" :
>>> "fork://localhost/",
>>> "number_of_processes" : "2",
>>> "allocation" : "myAllocation",
>>> ...
>>>
>>>
>>>
>>> I changed number_nodes to number_of_processes
>>> and it now runs. However,
>>> it
>>> seems like it is in a infinite loop, perhaps
>>> because my second machine
>>> is
>>> not configured properly? This is the
>>> configuration for my second
>>> machine:
>>>
>>> resource_list.append( {"resource_url" :
>>> "ssh://hotel.futuregrid.org",
>>> "number_of_processes" : "4", "allocation" :
>>> "myAllocation", "queue" :
>>> "workq", "bigjob_agent":
>>>
>>>
>>> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
>>> ),
>>> "working_directory": (os.getcwd() +
>>> "/agent"), "walltime":10, "affinity"
>>> :
>>> "affinity1"})
>>>
>>> Here, I used the bigjob_agent_launcher.sh in
>>> the directory written above
>>> because it was the only place I could find
>>> it. Should I use something
>>> else
>>> for "bigjob_agent"? Also, I'm not sure
>>> whether
>>> ssh://hotel.futuregrid.org or
>>> gram://hotel.futuregrid.org/jobmanager-pbs
>>> should be used.
>>>
>>> This is the output that keeps coming up:
>>>
>>> Current states: {'New': 2, 'Unknown': 6}
>>> DEBUG:root:Reschedule Thread
>>> DEBUG:root:Big Job:
>>>
>>> bigjob:50ec186c-22a2-11e1-9a15-002215124496:localhost
>>> Cores: 0/2 State: Running Terminated: False
>>> #Required Cores: 1
>>> DEBUG:root:Big Job:
>>>
>>> bigjob:51d0e1cc-22a2-11e1-9a15-002215124496:hotel.futuregrid.org
>>> Cores:
>>> 4/4
>>> State: Unknown Terminated: False #Required
>>> Cores: 1
>>> DEBUG:root:found no active resource for
>>> sub-job => (re-) queue it
>>> DEBUG:root:free_cores: [0, 0]
>>> total_free_cores: 0
>>>
>>>
>>> Also, the BJ part of CSA is a bit
>>> old and contains some bugs. If
>>> possible, try to install BJ in
>>> userspace (as outlined on the
>>> Wiki
>>> page).
>>>
>>> It's fixed in BigJob-0.3.31 and
>>> SVN.
>>>
>>>
>>>
>>> I followed the instructions on the Wiki (b.
>>> Python Packaging and
>>> Virtualenv)
>>> and did an update. I looks like I have
>>> BigJob-0.3.31.
>>>
>>> Thanks,
>>>
>>> Paula
>>>
>>>
>>> On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow
>>> <aluckow at cct.lsu.edu>
>>> wrote:
>>>
>>> Hi Paula,
>>> there is a small bug in the
>>> example:
>>>
>>> Please change numer_nodes to
>>> number_of_processes in line 51:
>>>
>>> resource_list.append(
>>> {"resource_url" :
>>> "fork://localhost/",
>>> "number_of_processes" : "2",
>>> "allocation" : "myAllocation",
>>> ...
>>>
>>> Also, the BJ part of CSA is a bit
>>> old and contains some bugs. If
>>> possible, try to install BJ in
>>> userspace (as outlined on the
>>> Wiki
>>> page).
>>>
>>> It's fixed in BigJob-0.3.31 and
>>> SVN.
>>>
>>> Best,
>>> Andre
>>>
>>>
>>> On Thu, Dec 8, 2011 at 9:42 PM,
>>> Paula Sanematsu
>>> <psanem1 at tigers.lsu.edu>
>>> wrote:
>>> Hi,
>>>
>>> I'm trying to run
>>> example_manyjob_affinity.py
>>> on Sierra, but it
>>> doesn't
>>> complete (see my
>>> output below). I'm
>>> submitting the job
>>> from Sierra
>>> and
>>> would
>>> like to use Hotel as
>>> my second machine.
>>> Could you please
>>> advise me on
>>> how to
>>> proceed?
>>>
>>> In addition, is there
>>> anything wrong with
>>> the cct advert
>>> service? I
>>> could
>>> run
>>> example_local_single.py,
>>> but now it's not
>>> working.
>>>
>>> Thanks,
>>>
>>> Paula
>>>
>>> ManyJob load test
>>> with 8 jobs.
>>> Create manyjob
>>> service
>>> DEBUG:root:start
>>> bigjob at:
>>> fork://localhost/
>>> DEBUG:root:init
>>> BigJob w/:
>>> advert://advert.cct.lsu.edu:8080
>>>
>>> DEBUG:root:['/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/../',
>>>
>>> '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.6.4-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages',
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>
>>> '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip',
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7',
>>>
>>>
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2',
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk',
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old',
>>>
>>>
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload',
>>>
>>>
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages',
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>
>>> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>> ',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>> ']
>>> DEBUG:root:Utilizing
>>> ADVERT Backend
>>> DEBUG:root:Parsing
>>> URL:
>>> advert://advert.cct.lsu.edu:8080
>>> DEBUG:root:Server:
>>> advert.cct.lsu.edu
>>> Port 8080
>>> server_connect_url:
>>> None
>>> DEBUG:root:initialized
>>> BigJob:
>>>
>>> bigjob:6638ce9e-21d6-11e1-ac76-002215124496
>>> Traceback (most
>>> recent call last):
>>> File
>>> "example_manyjob_affinity.py",
>>> line 61, in <module>
>>> mjs =
>>>
>>> many_job_affinity_service(resource_list,
>>> COORDINATION_URL)
>>> File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>> /many_job_affinity.py",
>>> line 19, in __init__
>>>
>>> super(many_job_affinity_service,
>>> self).__init__(bigjob_list,
>>> advert_host)
>>> File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>
>>> /many_job.py",
>>> line 59, in __init__
>>>
>>> self.__init_bigjobs()
>>> File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>
>>> /many_job.py",
>>> line 74, in
>>> __init_bigjobs
>>>
>>> self.__start_bigjob(i)
>>> File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>
>>> /many_job.py",
>>> line 98, in
>>> __start_bigjob
>>>
>>> bj_dict["number_of_processes"],
>>> KeyError:
>>> 'number_of_processes'
>>> Cancel Pilot Job
>>> stop pilot job:
>>> DEBUG:root:create
>>> advert entry:
>>> advert://advert.cct.lsu.edu:8080/
>>> DEBUG:root:update
>>> state of pilot job
>>> to: Done Stopped:
>>> True
>>> DEBUG:root:delete
>>> pilot job:
>>>
>>>
>>> _______________________________________________
>>> Bigjob-users mailing
>>> list
>>> Bigjob-users at mail.cct.lsu.edu
>>>
>>> https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> Bigjob-users mailing list
>>> Bigjob-users at mail.cct.lsu.edu
>>> https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>>>
>>>
>>>
>
More information about the Bigjob-users
mailing list