[Bigjob-users] (Many) Problems Running BigJob via GRAM

Andre Luckow andre.luckow at gmail.com
Sun Jan 22 23:58:52 CST 2012


Hi Ole,
> (1) Even though Globus is used, BigJob still insists on using SSH (with pasword-less login!!!) to create the remote directory.
>
> 01/22/2012 09:14:29 PM - bigjob - WARNING - Error creating directory: /home/oweidner/agent/bj-fad5478a-4567-11e1-adea-bc305b7ee8dc at: qb1.loni.org SSH password-less login activated?

Yes, this is just a warning! It also works without SSH. In this case
the Manager will not create a sub-directory on the resource but will
work in the specified working directory, which must exists!

> (2) After password-less login was (reluctantly) activated, it still doesn't work. When BigJob tries to create the working directory remotely via ssh, I see the following error message:
>
> 01/22/2012 09:16:38 PM - paramiko.transport.sftp - INFO - [chan 1] Opened sftp connection (server version 3)
> 01/22/2012 09:16:38 PM - paramiko.transport.sftp - DEBUG - [chan 1] mkdir('/home/oweidner/agent/bj-47b9260c-4568-11e1-92a5-bc305b7ee8dc', 511)
> *** print_exception:
> Traceback (most recent call last):
>  File "/home/oweidner/software/bigjob_pypi/lib/python2.7/site-packages/BigJob-0.4.33-py2.7.egg/bigjob/bigjob_manager.py", line 602, in __create_remote_directory
>    sftp.mkdir(target_path)
>  File "build/bdist.linux-x86_64/egg/paramiko/sftp_client.py", line 303, in mkdir
>    self._request(CMD_MKDIR, path, attr)
> IOError: [Errno 2] No such file
>
> First of all, I think that BigJob should *STOP* and *FAIL* at this point and not just print the error and continue (and fail later)! The problem seems to be that it can't create the /home/oweidner/agent/<uuid> directory recursively. If I create /home/oweidner/agent manually, it seems to work. Secondly, I think that as a fall-back solution BigJob should just dump everything in 'pwd' on the remote machine if it can't create directories remotely.

I think you diagnosed this will. Actually there is the fallback
solution. Might be the case that this fallback is somehow not
correctly triggered will check.


> (3) Once (1) and (2) were fixed, I got the encouraging message: "Pilot Job/BigJob URL: bigjob:bj-5f55ffce-4568-11e1-9479-bc305b7ee8dc:qb1.loni.org State: Running"
>
> However, after that the job never leaves 'Unknown' state:
>
> 01/22/2012 09:17:19 PM - bigjob - DEBUG - SJ Attributes: <bigjob.description object at 0x1b33fc8>
> state: Unknown
> state: Unknown
> state: Unknown
> state: Unknown
> state: Unknown
> ...
>
> I don't have bigjob installed on QueenBee, since remote bootstrapping shouldn't be a problem when using REDIS. However, when I check the logs in the agent directory, I see the following:
>
> stdout-bigjob_agent.txt:
> ========================
> SAGA and SAGA Python Bindings not found: BigJob only work w/ non-SAGA backends e.g. Redis, ZMQ.
> Python version:  0
> Python path: ['/home/oweidner/agent/bj-6ff435a6-456a-11e1-aa75-bc305b7ee8dc/../../', '/home/oweidner/agent/bj-6ff435a6-456a-11e1-aa75-bc305b7ee8dc/../', '', '/usr/lib64/python23.zip', '/usr/lib64/python2.3', '/usr/lib64/python2.3/plat-linux2', '/usr/lib64/python2.3/lib-tk', '/usr/lib64/python2.3/lib-dynload', '/usr/lib64/python2.3/site-packages', '/usr/lib64/python2.3/site-packages/gtk-2.0', '/usr/lib/python2.3/site-packages']
> BigJob not installed. Attempting to install it.
>
> So why is the Python version '0'? When I log-in interactively, the default 'python' command points to 2.6 -- apparently this is not used:

>From  looking at the PYTHONPATH (/usr/lib64/python2.3/site-packages'),
it looks like that some Python 2.3 is in the default path (which is
not supported by BJ). Thus, the installation fails.


> Obviously, BigJob needs lots of improvement, especially when it comes to error reporting and handling! I thought using BigJob with Globus is a standard procedure that has been successfully used for years. Apparently this is not the case (or I'm doing something wrong).

What error do you see? Unfortunately, sites tend to deploy their own,
custom JobManager, which behaves slightly different on each machine.

Andre


More information about the Bigjob-users mailing list