[Bigjob-users] Attempt to run example_manyjob_affinity.py

Andre Luckow aluckow at cct.lsu.edu
Sat Dec 10 16:33:49 CST 2011


Hi Paula,
the reason is that hotel has somehow a strange configuration:

This machine accepts SSH public key and One Time Password (OTP) logins only.
If you do not have a public key set up, you will be prompted for a password.
This is *not* your FutureGrid password, but the One Time Password
generated from your
OTP token.  Do not type your FutureGrid password, it will not work.
If you do not
have a token or public key, you will not be able to login.

Can you login without password into hotel?

Does:

ssh hotel /bin/date

work for you?

Does it work on another machine, e.g. india?

Best,
Andre

On Sat, Dec 10, 2011 at 11:27 PM, Paula Sanematsu
<psanem1 at tigers.lsu.edu> wrote:
> When I use ssh://hotel.futuregrid.org for "resource_url", I keep getting
> this output over and over again.
>
> DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
> Cores: 0/2 State: Running Terminated: False #Required Cores: 1
> DEBUG:root:Big Job:
> bigjob:7a4c7d70-237c-11e1-83f2-002215124496:hotel.futuregrid.org Cores: 1/1
> State: Unknown Terminated: False #Required Cores: 1
>
> DEBUG:root:found no active resource for sub-job => (re-) queue it
> DEBUG:root:free_cores: [0, 0] total_free_cores: 0
> DEBUG:root:Reschedule Thread
> DEBUG:root:Big Job: bigjob:796b90c6-237c-11e1-83f2-002215124496:localhost
> Cores: 0/2 State: Running Terminated: False #Required Cores: 1
>
> Current states: {'New': 2, 'Unknown': 6}
>
>
> On Sat, Dec 10, 2011 at 4:05 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
>>
>> Paula,
>>
>>
>>> I don't think I have a grid certificate. When I do grid-proxy-init, I get
>>
>>
>> In which case you should be using "ssh".
>>
>> i.e., use:
>>
>> ssh://hotel.futuregrid.org in the resource_list.append("resource_url" :
>>
>> Shantenu
>>
>>
>>
>>
>>
>>
>>>
>>> -bash: grid-proxy-init: command not found
>>>
>>> Paula
>>>
>>> On Sat, Dec 10, 2011 at 3:53 PM, Shantenu Jha <sjha at cct.lsu.edu> wrote:
>>>      Paula,
>>>
>>>      Do you even have a grid certificate? If not you will have to use
>>>      non-globus solutions e.g., pbs-ssh.
>>>
>>>      Shantenu
>>>
>>>
>>>      Hi Paula,
>>>      did you run grid-proxy-init? Could you set:
>>>
>>>      export SAGA_VERBOSE=100
>>>
>>>      in your shell and re-send the log, please?
>>>
>>>      Thanks!
>>>      Andre
>>>
>>>      On Sat, Dec 10, 2011 at 10:44 PM, Paula Sanematsu
>>>      <psanem1 at tigers.lsu.edu> wrote:
>>>            Hi Andre,
>>>
>>>                  for launching remote jobs you certainly want to use
>>>                  Globus, i.e. GRAM
>>>                  URLs. The bigjob_agent key in the resource_list does not
>>>                  need to be
>>>                  set anymore!
>>>
>>>
>>>
>>>            When I have this configuration for hotel
>>>
>>>            resource_list.append( {"resource_url" :
>>>            "gram://hotel.futuregrid.org/jobmanager-pbs",
>>> "number_of_processes" :
>>>            "1",
>>>            "allocation" : "myAllocation", "queue" : "workq",
>>> "bigjob_agent":
>>>
>>> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
>>>            ),
>>>            "working_directory": (os.getcwd() + "/agent"), "walltime":10,
>>>            "affinity" :
>>>            "affinity1"})
>>>
>>>            I get this error:
>>>
>>>
>>>            DEBUG:root:Utilizing ADVERT Backend
>>>            DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>>>            DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>>> server_connect_url:
>>>            None
>>>            DEBUG:root:Initialized Coordination to:
>>>            advert://advert.cct.lsu.edu:8080/
>>>            (DB: )
>>>            DEBUG:root:initialized BigJob:
>>>            bigjob:7e9a80dc-2374-11e1-92c3-002215124496
>>>            DEBUG:root:create pilot job entry on backend server:
>>>            bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>>>            DEBUG:root:create advert entry:
>>>
>>>  advert://advert.cct.lsu.edu:8080//bigjob/7e9a80dc-2374-11e1-92c3-002215124496/localhost?
>>>
>>>            DEBUG:root:update state of pilot job to: Unknown Stopped:
>>> False
>>>            DEBUG:root:set pilot state to: Unknown
>>>            Adaptor specific modifications: fork
>>>            Working directory:
>>>            /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>>>
>>>            use standard proxy
>>>            Submit pilot job to: fork://localhost/
>>>            DEBUG:root:start bigjob at:
>>>            gram://hotel.futuregrid.org/jobmanager-pbs
>>>
>>>            DEBUG:root:init BigJob w/: advert://advert.cct.lsu.edu:8080
>>>            DEBUG:root:Utilizing ADVERT Backend
>>>            DEBUG:root:Parsing URL: advert://advert.cct.lsu.edu:8080
>>>            DEBUG:root:Server: advert.cct.lsu.edu Port 8080
>>> server_connect_url:
>>>            None
>>>            DEBUG:root:Initialized Coordination to:
>>>            advert://advert.cct.lsu.edu:8080/
>>>            (DB: )
>>>            DEBUG:root:initialized BigJob:
>>>            bigjob:7f793e3a-2374-11e1-92c3-002215124496
>>>            DEBUG:root:create pilot job entry on backend server:
>>>
>>>  bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>>>            DEBUG:root:create advert entry:
>>>
>>> advert://advert.cct.lsu.edu:8080//bigjob/7f793e3a-2374-11e1-92c3-002215124496/hotel.futuregrid.org
>>>            ?
>>>
>>>            DEBUG:root:update state of pilot job to: Unknown Stopped:
>>> False
>>>            DEBUG:root:set pilot state to: Unknown
>>>            Adaptor specific modifications: gram
>>>            DEBUG:root:Escape RSL
>>>            Working directory:
>>>            /N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/agent
>>>            use standard proxy
>>>
>>>            Traceback (most recent call last):
>>>              File "example_manyjob_affinity.py", line 61, in <module>
>>>                mjs = many_job_affinity_service(resource_list,
>>> COORDINATION_URL)
>>>              File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>>            any_job_affinity.py",
>>>            line 19, in __init__
>>>                super(many_job_affinity_service,
>>> self).__init__(bigjob_list,
>>>            advert_host)
>>>              File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>>            any_job.py",
>>>            line 59, in __init__
>>>                self.__init_bigjobs()
>>>              File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>>            any_job.py",
>>>            line 74, in __init_bigjobs
>>>                self.__start_bigjob(i)
>>>              File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob_dynamic/m
>>>            any_job.py",
>>>            line 104, in __start_bigjob
>>>                ppn)
>>>              File
>>>
>>> "/N/u/paulasoo/.bigjob/python/lib/python2.7/site-packages/BigJob-0.3.31-py2.7.egg/bigjob/bigjob_ma
>>>            nager.py",
>>>            line 246, in start_pilot_job
>>>                js = saga.job.service(lrms_saga_url)
>>>            bad_parameter: SAGA(BadParameter): condor_job: Adaptor
>>> supports
>>>            'condor' and
>>>            'condorg' URL schemes, 'gram' is not supported.
>>>
>>>            DEBUG:root:Cancel re-scheduler thread
>>>            Exception AttributeError: "'many_job_affinity_service' object
>>> has no
>>>            attribute 'stop'" in <bound method
>>> many_job_affinity_service.__del__
>>>            of
>>>            7e9a6c78-2374-11e1-92c3-002215124496> ignored
>>>            Cancel Pilot Job
>>>            stop pilot job:
>>> bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>>>            DEBUG:root:delete pilot job:
>>>            bigjob:7e9a80dc-2374-11e1-92c3-002215124496:localhost
>>>            Cancel Pilot Job
>>>            stop pilot job:
>>>
>>>  bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>>>            DEBUG:root:delete pilot job:
>>>
>>>  bigjob:7f793e3a-2374-11e1-92c3-002215124496:hotel.futuregrid.org
>>>
>>>                  Please try to submit the job first from the hotel front
>>>                  node. Once you
>>>                  are sure that Globus etc. is working you can do a remote
>>>                  submission.
>>>                  E.g you need to make sure that you use the right BJ
>>>                  version on hotel
>>>                  as well (ie. you need to double-check your PYTHONPATH
>>>                  etc.)
>>>
>>>
>>>
>>>            I think I have BigJob in Hotel working properly, at least
>>>            example_local_single.py and example_local_multiple worked. My
>>>            PYTHONPATH is
>>>
>>>
>>> /gpfs/home/paulasoo/.bigjob/python/lib/python2.7/site-packages/:/gpfs/software/x86_64/el5/hotel/SA
>>>            GA/saga/1.6/gcc-4.1.2//lib/python2.7/site-packages/:
>>>
>>>            Thanks,
>>>
>>>            Paula
>>>
>>>                  Hope that helps.
>>>
>>>                  Best,
>>>                  Andre
>>>
>>>                  On Fri, Dec 9, 2011 at 9:21 PM, Paula Sanematsu
>>>                  <psanem1 at tigers.lsu.edu>
>>>                  wrote:
>>>                        Hi Andre,
>>>
>>>
>>>                              Please change numer_nodes to
>>>                              number_of_processes in line 51:
>>>
>>>                              resource_list.append(
>>>                              {"resource_url" :
>>>                              "fork://localhost/",
>>>                              "number_of_processes" : "2",
>>>                              "allocation" : "myAllocation",
>>>                              ...
>>>
>>>
>>>
>>>                        I changed number_nodes to number_of_processes
>>>                        and it now runs. However,
>>>                        it
>>>                        seems like it is in a infinite loop, perhaps
>>>                        because my second machine
>>>                        is
>>>                        not configured properly? This is the
>>>                        configuration for my second
>>>                        machine:
>>>
>>>                        resource_list.append( {"resource_url" :
>>>                        "ssh://hotel.futuregrid.org",
>>>                        "number_of_processes" : "4", "allocation" :
>>>                        "myAllocation", "queue" :
>>>                        "workq", "bigjob_agent":
>>>
>>>
>>> ("/N/soft/SAGA/saga/1.5.3/gcc-4.1.2/lib/python2.7.1/site-packages/bigjob/bigjob_agent_launcher.sh"
>>>                        ),
>>>                        "working_directory": (os.getcwd() +
>>>                        "/agent"), "walltime":10, "affinity"
>>>                        :
>>>                        "affinity1"})
>>>
>>>                        Here, I used the bigjob_agent_launcher.sh in
>>>                        the directory written above
>>>                        because it was the only place I could find
>>>                        it. Should I use something
>>>                        else
>>>                        for "bigjob_agent"? Also, I'm not sure
>>>                        whether
>>>                        ssh://hotel.futuregrid.org or
>>>                        gram://hotel.futuregrid.org/jobmanager-pbs
>>>                        should be used.
>>>
>>>                        This is the output that keeps coming up:
>>>
>>>                        Current states: {'New': 2, 'Unknown': 6}
>>>                        DEBUG:root:Reschedule Thread
>>>                        DEBUG:root:Big Job:
>>>
>>>  bigjob:50ec186c-22a2-11e1-9a15-002215124496:localhost
>>>                        Cores: 0/2 State: Running Terminated: False
>>>                        #Required Cores: 1
>>>                        DEBUG:root:Big Job:
>>>
>>>  bigjob:51d0e1cc-22a2-11e1-9a15-002215124496:hotel.futuregrid.org
>>>                        Cores:
>>>                        4/4
>>>                        State: Unknown Terminated: False #Required
>>>                        Cores: 1
>>>                        DEBUG:root:found no active resource for
>>>                        sub-job => (re-) queue it
>>>                        DEBUG:root:free_cores: [0, 0]
>>>                        total_free_cores: 0
>>>
>>>
>>>                              Also, the BJ part of CSA is a bit
>>>                              old and contains some bugs. If
>>>                              possible, try to install BJ in
>>>                              userspace (as outlined on the
>>>                              Wiki
>>>                              page).
>>>
>>>                              It's fixed in BigJob-0.3.31 and
>>>                              SVN.
>>>
>>>
>>>
>>>                        I followed the instructions on the Wiki (b.
>>>                        Python Packaging and
>>>                        Virtualenv)
>>>                        and did an update. I looks like I have
>>>                        BigJob-0.3.31.
>>>
>>>                        Thanks,
>>>
>>>                        Paula
>>>
>>>
>>>                        On Thu, Dec 8, 2011 at 3:07 PM, Andre Luckow
>>>                        <aluckow at cct.lsu.edu>
>>>                        wrote:
>>>
>>>                              Hi Paula,
>>>                              there is a small bug in the
>>>                              example:
>>>
>>>                              Please change numer_nodes to
>>>                              number_of_processes in line 51:
>>>
>>>                              resource_list.append(
>>>                              {"resource_url" :
>>>                              "fork://localhost/",
>>>                              "number_of_processes" : "2",
>>>                              "allocation" : "myAllocation",
>>>                              ...
>>>
>>>                              Also, the BJ part of CSA is a bit
>>>                              old and contains some bugs. If
>>>                              possible, try to install BJ in
>>>                              userspace (as outlined on the
>>>                              Wiki
>>>                              page).
>>>
>>>                              It's fixed in BigJob-0.3.31 and
>>>                              SVN.
>>>
>>>                              Best,
>>>                              Andre
>>>
>>>
>>>                              On Thu, Dec 8, 2011 at 9:42 PM,
>>>                              Paula Sanematsu
>>>                              <psanem1 at tigers.lsu.edu>
>>>                              wrote:
>>>                                    Hi,
>>>
>>>                                    I'm trying to run
>>>                                    example_manyjob_affinity.py
>>>                                    on Sierra, but it
>>>                                    doesn't
>>>                                    complete (see my
>>>                                    output below). I'm
>>>                                    submitting the job
>>>                                    from Sierra
>>>                                    and
>>>                                    would
>>>                                    like to use Hotel as
>>>                                    my second machine.
>>>                                    Could you please
>>>                                    advise me on
>>>                                    how to
>>>                                    proceed?
>>>
>>>                                    In addition, is there
>>>                                    anything wrong with
>>>                                    the cct advert
>>>                                    service? I
>>>                                    could
>>>                                    run
>>>                                    example_local_single.py,
>>>                                    but now it's not
>>>                                    working.
>>>
>>>                                    Thanks,
>>>
>>>                                    Paula
>>>
>>>                                    ManyJob load test
>>>                                    with 8 jobs.
>>>                                    Create manyjob
>>>                                    service
>>>                                    DEBUG:root:start
>>>                                    bigjob at:
>>>                                    fork://localhost/
>>>                                    DEBUG:root:init
>>>                                    BigJob w/:
>>>                                    advert://advert.cct.lsu.edu:8080
>>>
>>>  DEBUG:root:['/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff/../',
>>>
>>>  '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/virtualenv-1.6.4-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/redis-2.2.4-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/threadpool-1.2.7-py2.7.egg',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/uuid-1.30-py2.7.egg',
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages',
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>
>>>  '/N/u/paulasoo/HW06_E3/examples/manySJ_2BJ_diff',
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python27.zip',
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/plat-linux2',
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-tk',
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-old',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/lib-dynload',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/python2.7/site-packages',
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>
>>>  '/N/soft/SAGA/external/python/2.7.1/gcc-4.1.2/lib/2.7/site-packages',
>>>
>>>
>>>
>>>  '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>                                    ',
>>>
>>>
>>>
>>> '/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>                                    ']
>>>                                    DEBUG:root:Utilizing
>>>                                    ADVERT Backend
>>>                                    DEBUG:root:Parsing
>>>                                    URL:
>>>                                    advert://advert.cct.lsu.edu:8080
>>>                                    DEBUG:root:Server:
>>>                                    advert.cct.lsu.edu
>>>                                    Port 8080
>>>                                    server_connect_url:
>>>                                    None
>>>                                    DEBUG:root:initialized
>>>                                    BigJob:
>>>
>>>  bigjob:6638ce9e-21d6-11e1-ac76-002215124496
>>>                                    Traceback (most
>>>                                    recent call last):
>>>                                      File
>>>                                    "example_manyjob_affinity.py",
>>>                                    line 61, in <module>
>>>                                        mjs =
>>>
>>>  many_job_affinity_service(resource_list,
>>>                                    COORDINATION_URL)
>>>                                      File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>                                    /many_job_affinity.py",
>>>                                    line 19, in __init__
>>>
>>>                                    super(many_job_affinity_service,
>>>                                    self).__init__(bigjob_list,
>>>                                    advert_host)
>>>                                      File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>
>>>                                    /many_job.py",
>>>                                    line 59, in __init__
>>>
>>>                                    self.__init_bigjobs()
>>>                                      File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>
>>>                                    /many_job.py",
>>>                                    line 74, in
>>>                                    __init_bigjobs
>>>
>>>                                    self.__start_bigjob(i)
>>>                                      File
>>>
>>>
>>>
>>> "/N/soft/SAGA/saga/1.6/gcc-4.1.2/lib/python2.7/site-packages/BigJob-0.3.2-py2.7.egg/bigjob_dynamic
>>>
>>>                                    /many_job.py",
>>>                                    line 98, in
>>>                                    __start_bigjob
>>>
>>>                                    bj_dict["number_of_processes"],
>>>                                    KeyError:
>>>                                    'number_of_processes'
>>>                                    Cancel Pilot Job
>>>                                    stop pilot job:
>>>                                    DEBUG:root:create
>>>                                    advert entry:
>>>                                    advert://advert.cct.lsu.edu:8080/
>>>                                    DEBUG:root:update
>>>                                    state of pilot job
>>>                                    to: Done Stopped:
>>>                                    True
>>>                                    DEBUG:root:delete
>>>                                    pilot job:
>>>
>>>
>>>  _______________________________________________
>>>                                    Bigjob-users mailing
>>>                                    list
>>>                                    Bigjob-users at mail.cct.lsu.edu
>>>
>>>  https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>>>
>>>
>>>
>>>
>>>
>>>      _______________________________________________
>>>      Bigjob-users mailing list
>>>      Bigjob-users at mail.cct.lsu.edu
>>>      https://mail.cct.lsu.edu/mailman/listinfo/bigjob-users
>>>
>>>
>>>
>


More information about the Bigjob-users mailing list