[Bigjob-users] Attempt to run example_manyjob_affinity.py
Shantenu Jha
sjha at cct.lsu.edu
Sat Dec 10 16:05:26 CST 2011
Paula,
> I don't think I have a grid certificate. When I do grid-proxy-init, I get
In which case you should be using "ssh".
i.e., use:
ssh://hotel.futuregrid.org in the resource_list.append("resource_url" :
Shantenu
>
> -bash: grid-proxy-init: command not found
>
> Paula
>
> On Sat, Dec 10, 2011 at 3:53 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
> Paula,
>
> Do you even have a grid certificate? If not you will have to use
> non-globus solutions e.g., pbs-ssh.
>
> Shantenu
>
>
> Hi Paula,
> did you run grid-proxy-init? Could you set:
>
> export SAGA_VERBOSE=100
>
> in your shell and re-send the log, please?
>
> Thanks!
> Andre
>
> On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
> <psanem1 at tigers.lsu.edu> wrote:
> Hi Andre,
>
> for launching remote jobs you certainly want to use
> Globus, i.e. GRAM
> URLs. The bigjob_agent key in the resource_list does not
> need to be
> set anymore!
>
>
>
> When I have this configuration for hotel
>
> resource_list.append( {"resource_url" :
> "gram://hotel.futuregrid.org/jobmanager-pbs", "number_of_processes" :
> "1",
> "allocation" : "myAllocation", "queue" : "workq", "bigjob_agent":
> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
> ),
> "working_directory": (os.getcwd() + "/agent"), "walltime":10,
> "affinity" :
> "affinity1"})
>
> I get this error:
>
>
> DEBUG:root:Utilizing ADVERT Backend
> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url:
> None
> DEBUG:root:Initialized Coordination to:
> advert://advert.cct.lsu.edu:8080/
> (DB: )
> DEBUG:root:initialized BigJob:
> bigjob:7e9a80dc-2374-11e1-92c3-002215124496
> DEBUG:root:create pilot job entry on backend server:
> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
> DEBUG:root:create advert entry:
> advert://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost?
>
> DEBUG:root:update state of pilot job to: Unknown Stopped: False
> DEBUG:root:set pilot state to: Unknown
> Adaptor specific modifications: fork
> Working directory:
> /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>
> use standard proxy
> Submit pilot job to: fork://localhost/
> DEBUG:root:start bigjob at:
> gram://hotel.futuregrid.org/jobmanager-pbs
>
> DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
> DEBUG:root:Utilizing ADVERT Backend
> DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
> DEBUG:root:Server: advert.cct.lsu.edu Port 8080 server_connect_url:
> None
> DEBUG:root:Initialized Coordination to:
> advert://advert.cct.lsu.edu:8080/
> (DB: )
> DEBUG:root:initialized BigJob:
> bigjob:7f793e3a-2374-11e1-92c3-002215124496
> DEBUG:root:create pilot job entry on backend server:
> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
> DEBUG:root:create advert entry:
> advert://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org
> ?
>
> DEBUG:root:update state of pilot job to: Unknown Stopped: False
> DEBUG:root:set pilot state to: Unknown
> Adaptor specific modifications: gram
> DEBUG:root:Escape RSL
> Working directory:
> /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
> use standard proxy
>
> Traceback (most recent call last):
> File "example_manyjob_affinity.py", line 61, in <module>
> mjs = many_job_affinity_service(resource_list, COORDINATION_URL)
> File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
> any_job_affinity.py",
> line 19, in __init__
> super(many_job_affinity_service, self).__init__(bigjob_list,
> advert_host)
> File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
> any_job.py",
> line 59, in __init__
> self.__init_bigjobs()
> File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
> any_job.py",
> line 74, in __init_bigjobs
> self.__start_bigjob(i)
> File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
> any_job.py",
> line 104, in __start_bigjob
> ppn)
> File
> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob/bigjob_ma
> nager.py",
> line 246, in start_pilot_job
> js = saga.job.service(lrms_saga_url)
> bad_parameter: SAGA(BadParameter): condor_job: Adaptor supports
> 'condor' and
> 'condorg' URL schemes, 'gram' is not supported.
>
> DEBUG:root:Cancel re-scheduler thread
> Exception AttributeError: "'many_job_affinity_service' object has no
> attribute 'stop'" in <bound method many_job_affinity_service.__del__
> of
> 7e9a6c78-2374-11e1-92c3-002215124496> ignored
> Cancel Pilot Job
> stop pilot job: bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
> DEBUG:root:delete pilot job:
> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
> Cancel Pilot Job
> stop pilot job:
> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
> DEBUG:root:delete pilot job:
> bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>
> Please try to submit the job first from the hotel front
> node. Once you
> are sure that Globus etc. is working you can do a remote
> submission.
> E.g you need to make sure that you use the right BJ
> version on hotel
> as well (ie. you need to double-check your PYTHONPATH
> etc.)
>
>
>
> I think I have BigJob in Hotel working properly, at least
> example_local_single.py and example_local_multiple worked. My
> PYTHONPATH is
>
> /gpfs/home/paulasoo/.bigjob/python/lib/python2.7/site-packages/:/gpfs/software/x86_64/el5/hotel/SA
> GA/saga/1.6/gcc-4.1.2//lib/python2.7/site-packages/:
>
> Thanks,
>
> Paula
>
> Hope that helps.
>
> Best,
> Andre
>
> On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu
> <psanem1 at tigers.lsu.edu>
> wrote:
> Hi Andre,
>
>
> Please change numer_nodes to
> number_of_processes in line 51:
>
> resource_list.append(
> {"resource_url" :
> "fork://localhost/",
> "number_of_processes" : "2",
> "allocation" : "myAllocation",
> ...
>
>
>
> I changed number_nodes to number_of_processes
> and it now runs. However,
> it
> seems like it is in a infinite loop, perhaps
> because my second machine
> is
> not configured properly? This is the
> configuration for my second
> machine:
>
> resource_list.append( {"resource_url" :
> "ssh://hotel.futuregrid.org",
> "number_of_processes" : "4", "allocation" :
> "myAllocation", "queue" :
> "workq", "bigjob_agent":
>
> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
> ),
> "working_directory": (os.getcwd() +
> "/agent"), "walltime":10, "affinity"
> :
> "affinity1"})
>
> Here, I used the bigjob_agent_launcher.sh in
> the directory written above
> because it was the only place I could find
> it. Should I use something
> else
> for "bigjob_agent"? Also, I'm not sure
> whether
> ssh://hotel.futuregrid.org or
> gram://hotel.futuregrid.org/jobmanager-pbs
> should be used.
>
> This is the output that keeps coming up:
>
> Current states: {'New': 2, 'Unknown': 6}
> DEBUG:root:Reschedule Thread
> DEBUG:root:Big Job:
> bigjob:50ec186c-22a2-11e1-9a15-002215124496:localhost
> Cores: 0/2 State: Running Terminated: False
> #Required Cores: 1
> DEBUG:root:Big Job:
> bigjob:51d0e1cc-22a2-11e1-9a15-002215124496:hotel.futuregrid.org
> Cores:
> 4/4
> State: Unknown Terminated: False #Required
> Cores: 1
> DEBUG:root:found no active resource for
> sub-job => (re-) queue it
> DEBUG:root:free_cores: [0, 0]
> total_free_cores: 0
>
>
> Also, the BJ part of CSA is a bit
> old and contains some bugs. If
> possible, try to install BJ in
> userspace (as outlined on the
> Wiki
> page).
>
> It's fixed in BigJob-0.3.31 and
> SVN.
>
>
>
> I followed the instructions on the Wiki (b.
> Python Packaging and
> Virtualenv)
> and did an update. I looks like I have
> BigJob-0.3.31.
>
> Thanks,
>
> Paula
>
>
> On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow
> <aluckow at cct.lsu.edu>
> wrote:
>
> Hi Paula,
> there is a small bug in the
> example:
>
> Please change numer_nodes to
> number_of_processes in line 51:
>
> resource_list.append(
> {"resource_url" :
> "fork://localhost/",
> "number_of_processes" : "2",
> "allocation" : "myAllocation",
> ...
>
> Also, the BJ part of CSA is a bit
> old and contains some bugs. If
> possible, try to install BJ in
> userspace (as outlined on the
> Wiki
> page).
>
> It's fixed in BigJob-0.3.31 and
> SVN.
>
> Best,
> Andre
>
>
> On Thu, Dec 8, 2011 at 9:42 PM,
> Paula Sanematsu
> <psanem1 at tigers.lsu.edu>
> wrote:
> Hi,
>
> I'm trying to run
> example_manyjob_affinity.py
> on Sierra, but it
> doesn't
> complete (see my
> output below). I'm
> submitting the job
> from Sierra
> and
> would
> like to use Hotel as
> my second machine.
> Could you please
> advise me on
> how to
> proceed?
>
> In addition, is there
> anything wrong with
> the cct advert
> service? I
> could
> run
> example_local_single.py,
> but now it's not
> working.
>
> Thanks,
>
> Paula
>
> ManyJob load test
> with 8 jobs.
> Create manyjob
> service
> DEBUG:root:start
> bigjob at:
> fork://localhost/
> DEBUG:root:init
> BigJob w/:
> advert://advert.cct.lsu.edu:8080
> DEBUG:root:['/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/../',
> '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.6.4-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages',
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
> '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip',
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7',
>
>
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2',
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk',
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old',
>
>
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload',
>
>
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages',
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
> '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
> ',
>
>
> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
> ']
> DEBUG:root:Utilizing
> ADVERT Backend
> DEBUG:root:Parsing
> URL:
> advert://advert.cct.lsu.edu:8080
> DEBUG:root:Server:
> advert.cct.lsu.edu
> Port 8080
> server_connect_url:
> None
> DEBUG:root:initialized
> BigJob:
> bigjob:6638ce9e-21d6-11e1-ac76-002215124496
> Traceback (most
> recent call last):
> File
> "example_manyjob_affinity.py",
> line 61, in <module>
> mjs =
> many_job_affinity_service(resource_list,
> COORDINATION_URL)
> File
>
>
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
> /many_job_affinity.py",
> line 19, in __init__
>
> super(many_job_affinity_service,
> self).__init__(bigjob_list,
> advert_host)
> File
>
>
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
> /many_job.py",
> line 59, in __init__
>
> self.__init_bigjobs()
> File
>
>
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
> /many_job.py",
> line 74, in
> __init_bigjobs
>
> self.__start_bigjob(i)
> File
>
>
> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
> /many_job.py",
> line 98, in
> __start_bigjob
>
> bj_dict["number_of_processes"],
> KeyError:
> 'number_of_processes'
> Cancel Pilot Job
> stop pilot job:
> DEBUG:root:create
> advert entry:
> advert://advert.cct.lsu.edu:8080/
> DEBUG:root:update
> state of pilot job
> to: Done Stopped:
> True
> DEBUG:root:delete
> pilot job:
>
> _______________________________________________
> Bigjob-users mailing
> list
> Bigjob-users at mail.cct.lsu.edu
> https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>
>
>
>
>
> _______________________________________________
> Bigjob-users mailing list
> Bigjob-users at mail.cct.lsu.edu
> https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>
>
>
>
More information about the Bigjob-users
mailing list