[Bigjob-users] (Many) Problems Running BigJob via GRAM

Ole Weidner oweidner at cct.lsu.edu
Sun Jan 22 20:57:52 CST 2012


Hi all,

BigJob doesn't seem to work properly with Globus GRAM. I have a complete and working saga & globus stack installed on an OSG submit machine (engage-submit3.renci.org). I'm using the latest BigJob from PyPi. I'm using gram://qb1.loni.org/jobmanager-fork as URL and Redis for C&C in example_local_single.py from the website.  I discovered several problems:

(1) Even though Globus is used, BigJob still insists on using SSH (with pasword-less login!!!) to create the remote directory. 

01/22/2012 09:14:29 PM - bigjob - WARNING - Error creating directory: /home/oweidner/agent/bj-fad5478a-4567-11e1-adea-bc305b7ee8dc at: qb1.loni.org SSH password-less login activated?

(2) After password-less login was (reluctantly) activated, it still doesn't work. When BigJob tries to create the working directory remotely via ssh, I see the following error message:

01/22/2012 09:16:38 PM - paramiko.transport.sftp - INFO - [chan 1] Opened sftp connection (server version 3)
01/22/2012 09:16:38 PM - paramiko.transport.sftp - DEBUG - [chan 1] mkdir('/home/oweidner/agent/bj-47b9260c-4568-11e1-92a5-bc305b7ee8dc', 511)
*** print_exception:
Traceback (most recent call last):
  File "/home/oweidner/software/bigjob_pypi/lib/python2.7/site-packages/BigJob-0.4.33-py2.7.egg/bigjob/bigjob_manager.py", line 602, in __create_remote_directory
    sftp.mkdir(target_path)
  File "build/bdist.linux-x86_64/egg/paramiko/sftp_client.py", line 303, in mkdir
    self._request(CMD_MKDIR, path, attr)
IOError: [Errno 2] No such file

First of all, I think that BigJob should *STOP* and *FAIL* at this point and not just print the error and continue (and fail later)! The problem seems to be that it can't create the /home/oweidner/agent/<uuid> directory recursively. If I create /home/oweidner/agent manually, it seems to work. Secondly, I think that as a fall-back solution BigJob should just dump everything in 'pwd' on the remote machine if it can't create directories remotely. 

(3) Once (1) and (2) were fixed, I got the encouraging message: "Pilot Job/BigJob URL: bigjob:bj-5f55ffce-4568-11e1-9479-bc305b7ee8dc:qb1.loni.org State: Running"

However, after that the job never leaves 'Unknown' state:

01/22/2012 09:17:19 PM - bigjob - DEBUG - SJ Attributes: <bigjob.description object at 0x1b33fc8>
state: Unknown
state: Unknown
state: Unknown
state: Unknown
state: Unknown
...

I don't have bigjob installed on QueenBee, since remote bootstrapping shouldn't be a problem when using REDIS. However, when I check the logs in the agent directory, I see the following: 

stdout-bigjob_agent.txt:
========================
SAGA and SAGA Python Bindings not found: BigJob only work w/ non-SAGA backends e.g. Redis, ZMQ.
Python version:  0
Python path: ['/home/oweidner/agent/bj-6ff435a6-456a-11e1-aa75-bc305b7ee8dc/../../', '/home/oweidner/agent/bj-6ff435a6-456a-11e1-aa75-bc305b7ee8dc/../', '', '/usr/lib64/python23.zip', '/usr/lib64/python2.3', '/usr/lib64/python2.3/plat-linux2', '/usr/lib64/python2.3/lib-tk', '/usr/lib64/python2.3/lib-dynload', '/usr/lib64/python2.3/site-packages', '/usr/lib64/python2.3/site-packages/gtk-2.0', '/usr/lib/python2.3/site-packages']
BigJob not installed. Attempting to install it.

So why is the Python version '0'? When I log-in interactively, the default 'python' command points to 2.6 -- apparently this is not used: 

stderr-bigjob_agent.txt:
========================
Python 2.3.4
Traceback (most recent call last):
  File "/home/oweidner/.bigjob/bigjob-bootstrap.py", line 23, in ?
    import subprocess
ImportError: No module named subprocess
Traceback (most recent call last):
  File "<string>", line 25, in ?
IOError: [Errno 2] No such file or directory: '/home/oweidner/.bigjob/python/bin/activate_this.py'

So apparently, the agent is never bootstrapped and started on the remote side. Ok. But why does BigJob not tell me that, but instead claims that "Pilot Job/BigJob URL: bigjob:bj-5f55ffce-4568-11e1-9479-bc305b7ee8dc:qb1.loni.org State: Running"?

It also seems that an error 'ImportError: No module named subprocess' is not properly handled and the script just carries on.

Obviously, BigJob needs lots of improvement, especially when it comes to error reporting and handling! I thought using BigJob with Globus is a standard procedure that has been successfully used for years. Apparently this is not the case (or I'm doing something wrong). 

While we should think about how we can improve BigJob and make it usable outside tightly controlled 'CSA' installations, I would like to know if there's a quick & dirty fix for the problem(s) outlined above? I would really like to show BigJob interoperability between OSG and QueenBee in our paper. However, the deadline is in roughly 24 hours ;-) 

Thoughts? 

Thanks,
Ole





More information about the Bigjob-users mailing list