[Bigjob-users] Attempt to run example_manyjob_affinity.py

Paula Sanematsu psanem1 at tigers.lsu.edu
Sat Dec 10 16:27:33 CST 2011


When I use ssh://hotel.futuregrid.org for "resource_url", I keep getting
this output over and over again.

DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
Cores: 0/2 State: Running Terminated: False #Required Cores: 1
DEBUG:root:Big Job: bigjob:7a4c7d70-237c-11e1-83f2-002215124496:
hotel.futuregrid.org Cores: 1/1 State: Unknown Terminated: False #Required
Cores: 1
DEBUG:root:found no active resource for sub-job => (re-) queue it
DEBUG:root:free_cores: [0, 0] total_free_cores: 0
DEBUG:root:Reschedule Thread
DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
Cores: 0/2 State: Running Terminated: False #Required Cores: 1
Current states: {'New': 2, 'Unknown': 6}


On Sat, Dec 10, 2011 at 4:05 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:

> Paula,
>
>
>  I don't think I have a grid certificate. When I do grid-proxy-init, I get
>>
>
> In which case you should be using "ssh".
>
> i.e., use:
>
> ssh://hotel.futuregrid.org in the resource_list.append("**resource_url" :
>
> Shantenu
>
>
>
>
>
>
>
>> -bash: grid-proxy-init: command not found
>>
>> Paula
>>
>> On Sat, Dec 10, 2011 at 3:53 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
>>      Paula,
>>
>>      Do you even have a grid certificate? If not you will have to use
>>      non-globus solutions e.g., pbs-ssh.
>>
>>      Shantenu
>>
>>
>>      Hi Paula,
>>      did you run grid-proxy-init? Could you set:
>>
>>      export SAGA_VERBOSE=100
>>
>>      in your shell and re-send the log, please?
>>
>>      Thanks!
>>      Andre
>>
>>      On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
>>      <psanem1 at tigers.lsu.edu> wrote:
>>            Hi Andre,
>>
>>                  for launching remote jobs you certainly want to use
>>                  Globus, i.e. GRAM
>>                  URLs. The bigjob_agent key in the resource_list does not
>>                  need to be
>>                  set anymore!
>>
>>
>>
>>            When I have this configuration for hotel
>>
>>            resource_list.append( {"resource_url" :
>>            "gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>",
>> "number_of_processes" :
>>            "1",
>>            "allocation" : "myAllocation", "queue" : "workq",
>> "bigjob_agent":
>> ("/N/soft/SAGA/saga/1.5.3/gcc-**4.1.2/lib/python2.7.1/site-**
>> packages/bigjob/bigjob_agent_**launcher.sh"
>>            ),
>>            "working_directory": (os.getcwd() + "/agent"), "walltime":10,
>>            "affinity" :
>>            "affinity1"})
>>
>>            I get this error:
>>
>>
>>            DEBUG:root:Utilizing ADVERT Backend
>>            DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>            DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>> server_connect_url:
>>            None
>>            DEBUG:root:Initialized Coordination to:
>>            advert://advert.cct.lsu.edu:**8080/<http://advert.cct.lsu.edu:8080/>
>>            (DB: )
>>            DEBUG:root:initialized BigJob:
>>            bigjob:7e9a80dc-2374-11e1-**92c3-002215124496
>>            DEBUG:root:create pilot job entry on backend server:
>>            bigjob:7e9a80dc-2374-11e1-**92c3-002215124496:localhost
>>            DEBUG:root:create advert entry:
>>            advert://advert.cct.lsu.edu:**8080//bigjob/7e9a80dc-2374-**
>> 11e1-92c3-002215124496/**localhost<http://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost>
>> ?
>>
>>            DEBUG:root:update state of pilot job to: Unknown Stopped: False
>>            DEBUG:root:set pilot state to: Unknown
>>            Adaptor specific modifications: fork
>>            Working directory:
>>            /N/u/paulasoo/HW06_E3/**examples/manySJ_2BJ_diff/agent
>>
>>            use standard proxy
>>            Submit pilot job to: fork://localhost/
>>            DEBUG:root:start bigjob at:
>>            gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>
>>
>>            DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>            DEBUG:root:Utilizing ADVERT Backend
>>            DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>            DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>> server_connect_url:
>>            None
>>            DEBUG:root:Initialized Coordination to:
>>            advert://advert.cct.lsu.edu:**8080/<http://advert.cct.lsu.edu:8080/>
>>            (DB: )
>>            DEBUG:root:initialized BigJob:
>>            bigjob:7f793e3a-2374-11e1-**92c3-002215124496
>>            DEBUG:root:create pilot job entry on backend server:
>>            bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**
>> futuregrid.org <http://hotel.futuregrid.org>
>>            DEBUG:root:create advert entry:
>> advert://advert.cct.lsu.edu:**8080//bigjob/7f793e3a-2374-**
>> 11e1-92c3-002215124496/hotel.**futuregrid.org<http://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org>
>>            ?
>>
>>            DEBUG:root:update state of pilot job to: Unknown Stopped: False
>>            DEBUG:root:set pilot state to: Unknown
>>            Adaptor specific modifications: gram
>>            DEBUG:root:Escape RSL
>>            Working directory:
>>            /N/u/paulasoo/HW06_E3/**examples/manySJ_2BJ_diff/agent
>>            use standard proxy
>>
>>            Traceback (most recent call last):
>>              File "example_manyjob_affinity.py", line 61, in <module>
>>                mjs = many_job_affinity_service(**resource_list,
>> COORDINATION_URL)
>>              File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>>            any_job_affinity.py",
>>            line 19, in __init__
>>                super(many_job_affinity_**service,
>> self).__init__(bigjob_list,
>>            advert_host)
>>              File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>>            any_job.py",
>>            line 59, in __init__
>>                self.__init_bigjobs()
>>              File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>>            any_job.py",
>>            line 74, in __init_bigjobs
>>                self.__start_bigjob(i)
>>              File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob_dynamic/m
>>            any_job.py",
>>            line 104, in __start_bigjob
>>                ppn)
>>              File
>> "/N/u/paulasoo/.bigjob/python/**lib/python2.7/site-packages/**
>> BigJob-0.3.31-py2.7.egg/**bigjob/bigjob_ma
>>            nager.py",
>>            line 246, in start_pilot_job
>>                js = saga.job.service(lrms_saga_**url)
>>            bad_parameter: SAGA(BadParameter): condor_job: Adaptor supports
>>            'condor' and
>>            'condorg' URL schemes, 'gram' is not supported.
>>
>>            DEBUG:root:Cancel re-scheduler thread
>>            Exception AttributeError: "'many_job_affinity_service' object
>> has no
>>            attribute 'stop'" in <bound method many_job_affinity_service.__
>> **del__
>>            of
>>            7e9a6c78-2374-11e1-92c3-**002215124496> ignored
>>            Cancel Pilot Job
>>            stop pilot job: bigjob:7e9a80dc-2374-11e1-**
>> 92c3-002215124496:localhost
>>            DEBUG:root:delete pilot job:
>>            bigjob:7e9a80dc-2374-11e1-**92c3-002215124496:localhost
>>            Cancel Pilot Job
>>            stop pilot job:
>>            bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**
>> futuregrid.org <http://hotel.futuregrid.org>
>>            DEBUG:root:delete pilot job:
>>            bigjob:7f793e3a-2374-11e1-**92c3-002215124496:hotel.**
>> futuregrid.org <http://hotel.futuregrid.org>
>>
>>                  Please try to submit the job first from the hotel front
>>                  node. Once you
>>                  are sure that Globus etc. is working you can do a remote
>>                  submission.
>>                  E.g you need to make sure that you use the right BJ
>>                  version on hotel
>>                  as well (ie. you need to double-check your PYTHONPATH
>>                  etc.)
>>
>>
>>
>>            I think I have BigJob in Hotel working properly, at least
>>            example_local_single.py and example_local_multiple worked. My
>>            PYTHONPATH is
>>
>> /gpfs/home/paulasoo/.bigjob/**python/lib/python2.7/site-**
>> packages/:/gpfs/software/x86_**64/el5/hotel/SA
>>            GA/saga/1.6/gcc-4.1.2//lib/**python2.7/site-packages/:
>>
>>            Thanks,
>>
>>            Paula
>>
>>                  Hope that helps.
>>
>>                  Best,
>>                  Andre
>>
>>                  On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu
>>                  <psanem1 at tigers.lsu.edu>
>>                  wrote:
>>                        Hi Andre,
>>
>>
>>                              Please change numer_nodes to
>>                              number_of_processes in line 51:
>>
>>                              resource_list.append(
>>                              {"resource_url" :
>>                              "fork://localhost/",
>>                              "number_of_processes" : "2",
>>                              "allocation" : "myAllocation",
>>                              ...
>>
>>
>>
>>                        I changed number_nodes to number_of_processes
>>                        and it now runs. However,
>>                        it
>>                        seems like it is in a infinite loop, perhaps
>>                        because my second machine
>>                        is
>>                        not configured properly? This is the
>>                        configuration for my second
>>                        machine:
>>
>>                        resource_list.append( {"resource_url" :
>>                        "ssh://hotel.futuregrid.org",
>>                        "number_of_processes" : "4", "allocation" :
>>                        "myAllocation", "queue" :
>>                        "workq", "bigjob_agent":
>>
>> ("/N/soft/SAGA/saga/1.5.3/gcc-**4.1.2/lib/python2.7.1/site-**
>> packages/bigjob/bigjob_agent_**launcher.sh"
>>                        ),
>>                        "working_directory": (os.getcwd() +
>>                        "/agent"), "walltime":10, "affinity"
>>                        :
>>                        "affinity1"})
>>
>>                        Here, I used the bigjob_agent_launcher.sh in
>>                        the directory written above
>>                        because it was the only place I could find
>>                        it. Should I use something
>>                        else
>>                        for "bigjob_agent"? Also, I'm not sure
>>                        whether
>>                        ssh://hotel.futuregrid.org or
>>                        gram://hotel.futuregrid.org/**jobmanager-pbs<http://hotel.futuregrid.org/jobmanager-pbs>
>>                        should be used.
>>
>>                        This is the output that keeps coming up:
>>
>>                        Current states: {'New': 2, 'Unknown': 6}
>>                        DEBUG:root:Reschedule Thread
>>                        DEBUG:root:Big Job:
>>                        bigjob:50ec186c-22a2-11e1-**
>> 9a15-002215124496:localhost
>>                        Cores: 0/2 State: Running Terminated: False
>>                        #Required Cores: 1
>>                        DEBUG:root:Big Job:
>>                        bigjob:51d0e1cc-22a2-11e1-**9a15-002215124496:
>> hotel.**futuregrid.org <http://hotel.futuregrid.org>
>>                        Cores:
>>                        4/4
>>                        State: Unknown Terminated: False #Required
>>                        Cores: 1
>>                        DEBUG:root:found no active resource for
>>                        sub-job => (re-) queue it
>>                        DEBUG:root:free_cores: [0, 0]
>>                        total_free_cores: 0
>>
>>
>>                              Also, the BJ part of CSA is a bit
>>                              old and contains some bugs. If
>>                              possible, try to install BJ in
>>                              userspace (as outlined on the
>>                              Wiki
>>                              page).
>>
>>                              It's fixed in BigJob-0.3.31 and
>>                              SVN.
>>
>>
>>
>>                        I followed the instructions on the Wiki (b.
>>                        Python Packaging and
>>                        Virtualenv)
>>                        and did an update. I looks like I have
>>                        BigJob-0.3.31.
>>
>>                        Thanks,
>>
>>                        Paula
>>
>>
>>                        On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow
>>                        <aluckow at cct.lsu.edu>
>>                        wrote:
>>
>>                              Hi Paula,
>>                              there is a small bug in the
>>                              example:
>>
>>                              Please change numer_nodes to
>>                              number_of_processes in line 51:
>>
>>                              resource_list.append(
>>                              {"resource_url" :
>>                              "fork://localhost/",
>>                              "number_of_processes" : "2",
>>                              "allocation" : "myAllocation",
>>                              ...
>>
>>                              Also, the BJ part of CSA is a bit
>>                              old and contains some bugs. If
>>                              possible, try to install BJ in
>>                              userspace (as outlined on the
>>                              Wiki
>>                              page).
>>
>>                              It's fixed in BigJob-0.3.31 and
>>                              SVN.
>>
>>                              Best,
>>                              Andre
>>
>>
>>                              On Thu, Dec 8, 2011 at 9:42 PM,
>>                              Paula Sanematsu
>>                              <psanem1 at tigers.lsu.edu>
>>                              wrote:
>>                                    Hi,
>>
>>                                    I'm trying to run
>>                                    example_manyjob_affinity.py
>>                                    on Sierra, but it
>>                                    doesn't
>>                                    complete (see my
>>                                    output below). I'm
>>                                    submitting the job
>>                                    from Sierra
>>                                    and
>>                                    would
>>                                    like to use Hotel as
>>                                    my second machine.
>>                                    Could you please
>>                                    advise me on
>>                                    how to
>>                                    proceed?
>>
>>                                    In addition, is there
>>                                    anything wrong with
>>                                    the cct advert
>>                                    service? I
>>                                    could
>>                                    run
>>                                    example_local_single.py,
>>                                    but now it's not
>>                                    working.
>>
>>                                    Thanks,
>>
>>                                    Paula
>>
>>                                    ManyJob load test
>>                                    with 8 jobs.
>>                                    Create manyjob
>>                                    service
>>                                    DEBUG:root:start
>>                                    bigjob at:
>>                                    fork://localhost/
>>                                    DEBUG:root:init
>>                                    BigJob w/:
>>                                    advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>                                    DEBUG:root:['/N/u/paulasoo/**
>> HW06_E3/examples/manySJ_2BJ_**diff/../',
>>                                    '/N/u/paulasoo/HW06_E3/**
>> examples/manySJ_2BJ_diff',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/setuptools-0.6c11-**py2.7.egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/BigJob-0.3.2-py2.7.**egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/redis-2.2.4-py2.7.**egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/virtualenv-1.6.4-py2.**7.egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/threadpool-1.2.7-py2.**7.egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/uuid-1.30-py2.7.egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/setuptools-0.6c11-**py2.7.egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/redis-2.2.4-py2.7.**egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/threadpool-1.2.7-py2.**7.egg',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/uuid-1.30-py2.7.egg',
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages',
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/2.7/site-**packages',
>>                                    '/N/u/paulasoo/HW06_E3/**
>> examples/manySJ_2BJ_diff',
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python27.**zip',
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7'**,
>>
>>
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**plat-linux2',
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**lib-tk',
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**lib-old',
>>
>>
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**lib-dynload',
>>
>>
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/python2.7/**site-packages',
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/2.7/site-**packages',
>>                                    '/N/soft/SAGA/external/python/**
>> 2.7.1/gcc-4.1.2/lib/2.7/site-**packages',
>>
>>
>>                                    '/N/soft/SAGA/saga/1.6/gcc-4.**
>> 1.2/lib/python2.7/site-**packages/BigJob-0.3.2-py2.7.**egg/bigjob',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>                                    ',
>>
>>
>> '/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>                                    ']
>>                                    DEBUG:root:Utilizing
>>                                    ADVERT Backend
>>                                    DEBUG:root:Parsing
>>                                    URL:
>>                                    advert://advert.cct.lsu.edu:**8080<http://advert.cct.lsu.edu:8080>
>>                                    DEBUG:root:Server:
>>                                    advert.cct.lsu.edu
>>                                    Port 8080
>>                                    server_connect_url:
>>                                    None
>>                                    DEBUG:root:initialized
>>                                    BigJob:
>>                                    bigjob:6638ce9e-21d6-11e1-**
>> ac76-002215124496
>>                                    Traceback (most
>>                                    recent call last):
>>                                      File
>>                                    "example_manyjob_affinity.py",
>>                                    line 61, in <module>
>>                                        mjs =
>>                                    many_job_affinity_service(**
>> resource_list,
>>                                    COORDINATION_URL)
>>                                      File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>                                    /many_job_affinity.py",
>>                                    line 19, in __init__
>>
>>                                    super(many_job_affinity_**service,
>>                                    self).__init__(bigjob_list,
>>                                    advert_host)
>>                                      File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>
>>                                    /many_job.py",
>>                                    line 59, in __init__
>>
>>                                    self.__init_bigjobs()
>>                                      File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>
>>                                    /many_job.py",
>>                                    line 74, in
>>                                    __init_bigjobs
>>
>>                                    self.__start_bigjob(i)
>>                                      File
>>
>>
>> "/N/soft/SAGA/saga/1.6/gcc-4.**1.2/lib/python2.7/site-**
>> packages/BigJob-0.3.2-py2.7.**egg/bigjob_dynamic
>>
>>                                    /many_job.py",
>>                                    line 98, in
>>                                    __start_bigjob
>>
>>                                    bj_dict["number_of_processes"]**,
>>                                    KeyError:
>>                                    'number_of_processes'
>>                                    Cancel Pilot Job
>>                                    stop pilot job:
>>                                    DEBUG:root:create
>>                                    advert entry:
>>                                    advert://advert.cct.lsu.edu:**8080/<http://advert.cct.lsu.edu:8080/>
>>                                    DEBUG:root:update
>>                                    state of pilot job
>>                                    to: Done Stopped:
>>                                    True
>>                                    DEBUG:root:delete
>>                                    pilot job:
>>
>>                                    ______________________________**
>> _________________
>>                                    Bigjob-users mailing
>>                                    list
>>                                    Bigjob-users at mail.cct.lsu.edu
>>                                    https://mail.cct.lsu.edu/**
>> mailman/listinfo/bigjob-users<https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users>
>>
>>
>>
>>
>>
>>      ______________________________**_________________
>>      Bigjob-users mailing list
>>      Bigjob-users at mail.cct.lsu.edu
>>      https://mail.cct.lsu.edu/**mailman/listinfo/bigjob-users<https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users>
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.cct.lsu.edu/pipermail/bigjob-users/attachments/20111210/bc810bdf/attachment-0001.html 


More information about the Bigjob-users mailing list