[Saga-devel] Fwd: help/ideas needed (fwd)

Hartmut Kaiser hkaiser at cct.lsu.edu
Fri Jan 23 10:02:58 CST 2009


Andre,

>From what I can see the UUID generator is screwed up. I see identical UUID's
for different entities, which might be the reason for the mess.
SAGA's UUID generator supports different modes (see
impl/engine/uuid/saga_uuid.h:66):

enum {
    SAGA_UUID_MAKE_V1 = (1 << 0), /* DCE 1.1 v1 SAGA_UUID */
    SAGA_UUID_MAKE_V3 = (1 << 1), /* DCE 1.1 v3 SAGA_UUID */
    SAGA_UUID_MAKE_V4 = (1 << 2), /* DCE 1.1 v4 SAGA_UUID */
#if defined(WIN32) || defined(WIN64)
    SAGA_UUID_MAKE_SYSTEM = (1 << 3), /* rely on the operating system to
generate an uuid */
#else
    SAGA_UUID_MAKE_SYSTEM = SAGA_UUID_MAKE_V1, 
#endif
    SAGA_UUID_MAKE_MC = (1 << 4)  /* enforce multi-cast MAC address */
};

The current mode is always: SAGA_UUID_MAKE_SYSTEM (see:
impl/engine/uuid.hpp:120). 

Could you try to change that to SAGA_UUID_MAKE_V3, SAGA_UUID_MAKE_V4,
SAGA_UUID_MAKE_V3|SAGA_UUID_MAKE_MC, or SAGA_UUID_MAKE_V4|SAGA_UUID_MAKE_MC
and see if that changes the picture?

Just a thought...
Regards Hartmut

> -----Original Message-----
> From: saga-devel-bounces at cct.lsu.edu [mailto:saga-devel-
> bounces at cct.lsu.edu] On Behalf Of Shantenu Jha
> Sent: Friday, January 23, 2009 8:30 AM
> To: saga-devel at cct.lsu.edu
> Subject: [Saga-devel] Fwd: help/ideas needed (fwd)
> 
> 
> >From Andre, Who is having e-mail problems.
> 
> Please read through. Thanks.
> 
> Shantenu
> 
> 
> ---------- Forwarded message ----------
> From: Andre Merzky <andre at merzky.net>
> Date: 2009/1/23
> Subject: Re: help/ideas needed
> To: SAGA Impl List <saga-devel at cct.lsu.edu>
> 
> 
> attachements...
> 
> A
> 
> Quoting [Andre Merzky] (Jan 23 2009):
> > Date: Fri, 23 Jan 2009 14:06:12 +0100
> > From: Andre Merzky <andre at merzky.net>
> > To: SAGA Impl List <saga-devel at cct.lsu.edu>
> > Bcc: Andre Merzky <andre at merzky.net>
> > Subject: help/ideas needed
> >
> > Hi Folx,
> >
> > I stumbled over a really strange problem, and have no idea
> > how to explain, nor how to fix it.  So, I'd like to invite
> > you all to brainstorm - all ideas are apreciated.
> >
> > So, what is happening?
> >
> > I execute the following script:
> >
> > ---------------------------------------------------------------------
> ---
> > #!/bin/sh
> >
> > export SAGA_VERBOSE=100
> > export SAGA_PARENT_JOBID=saga-parent-id10039
> > export SAGA_SSH_KEY=`ls /tmp/saga_saga-parent-id*_ssh`
> > export SAGA_SSH_PUB=`ls /tmp/saga_saga-parent-id*_ssh.pub`
> > export SAGA_SSH_USER=amerzky
> >
> > env > /tmp/env-0
> > /root/saga/examples/misc/context ssh                 1> /tmp/o-0-0 2>
> /tmp/e-0-0
> > saga-file list_dir /                                 1> /tmp/o-0-1 2>
> /tmp/e-0-1
> > saga-file list_dir ssh://amerzky@gg101.cct.lsu.edu/  1> /tmp/o-0-2 2>
> /tmp/e-0-2
> >
> > sleep 600
> >
> > env > /tmp/env-1
> > /root/saga/examples/misc/context ssh                 1> /tmp/o-1-0 2>
> /tmp/e-1-0
> > saga-file list_dir /                                 1> /tmp/o-1-1 2>
> /tmp/e-1-1
> > saga-file list_dir ssh://amerzky@gg101.cct.lsu.edu/  1> /tmp/o-1-2 2>
> /tmp/e-1-2
> > ---------------------------------------------------------------------
> ---
> >
> > You will notice that this is doing the same thing twice,
> > after a sleep of 5 minutes.  This script is executed on a
> > fresh EC2 virtual machine instance, which has SAGA and all
> > requisites preinstalled.  As soon as the ssh deamon is up, I
> > copy a couple of ssh keys araund, and then run that script.
> > Nobody else is accesing the instance.  All other running
> > processes look innocent, and should not (tm) interfere with
> > SAGA.
> >
> > Now, what is the problem?
> >
> > In the above setup, the following files should show no
> > differences, modulo logged saga object IDs:
> >
> >    /tmp/env-0  /tmp/env-1
> >    /tmp/o-0-0  /tmp/o-1-0
> >    /tmp/e-0-0  /tmp/e-1-0
> >    /tmp/o-0-1  /tmp/o-1-1
> >    /tmp/e-0-1  /tmp/e-1-1
> >    /tmp/o-0-2  /tmp/o-1-2
> >    /tmp/e-0-2  /tmp/e-1-2
> >
> > well, the env does not show any diff, but the saga logs do
> > (they are attached)
> > In particular, I make the following observations:
> >
> >   - the set of _loaded_ adaptors seems to be the same
> >   - the set of _used_   adaptors differs
> >
> > I don't understand how that can be!  In the
> > examples/misc/context case: the first run shows that the aws
> > context adaptor fails (this is the _only_ adaptor used).
> > The second run shows that aws_context, default_advert,
> > default_replica, and gridsam_job fail - they all implement
> > the context cpi.  the ssh_context adaptor (which is the one
> > I need) succeeds.
> >
> > ---------------------------------------------------------------------
> ---
> > e-0-0:
> > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
>   /etc/saga.ini
> > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
>   /etc/saga.ini
> > INFO       : engine.cpp                : loading static adaptors
> > INFO       : engine.cpp                : loading dynamic adaptors
> > Created exception: SAGA(BadParameter): aws_context:
> aws_context_adaptor.cpp(285): Can't handle context types others than
> ec2
> eucalyptus gumbocloud - found ssh
> > Created exception: SAGA(NoAdaptor): adaptor_selector.cpp(184): Could
> not
> select any matching adaptor for: context_cpi::__init__
> > Created exception: SAGA(NoAdaptor): proxy.cpp(227): No adaptor
> succeeded
> in executing constructor for context_cpi
> > Created exception: SAGA(BadParameter):
> >   SAGA(BadParameter): aws_context: aws_context_adaptor.cpp(285):
> Can't
> handle context types others than ec2 eucalyptus gumbocloud - found ssh
> >   SAGA(NoAdaptor): proxy.cpp(227): No adaptor succeeded in executing
> constructor for context_cpi
> >
> > SAGA(BadParameter):
> >   SAGA(BadParameter): aws_context: aws_context_adaptor.cpp(285):
> Can't
> handle context types others than ec2 eucalyptus gumbocloud - found ssh
> >   SAGA(NoAdaptor): proxy.cpp(227): No adaptor succeeded in executing
> constructor for context_cpi
> > ---------------------------------------------------------------------
> ---
> >
> > ---------------------------------------------------------------------
> ---
> > e-1-0:
> > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
>   /etc/saga.ini
> > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
>   /etc/saga.ini
> > INFO       : engine.cpp                : loading static adaptors
> > INFO       : engine.cpp                : loading dynamic adaptors
> > Created exception: SAGA(BadParameter): aws_context:
> aws_context_adaptor.cpp(285): Can't handle context types others than
> ec2
> eucalyptus gumbocloud - found ssh
> > Created exception: SAGA(BadParameter): default_advert:
> context.cpp(34):
> Can't handle context types others than 'default_advert_db' (got: ssh)
> > Created exception: SAGA(BadParameter): default_replica:
> context.cpp(33):
> Can't handle context types others than 'default_replica_db' (got: ssh)
> > Created exception: SAGA(BadParameter): omii_gridsam_job:
> omii_gridsam_context.cpp(30): Can't handle context types others than
> 'omii_gridsam' (got: ssh)
> > (no error shown as the ssh_context wins)
> > ---------------------------------------------------------------------
> ---
> >
> > So, how can it be that identical runs behave differently
> > after some waiting time?  Why are some adaptors not
> > tried, after they loaded successfully?
> >
> > It is not the case that the second run always succeeds - the
> > number of run's does not seem to have any effect.  Rather, it
> > is the waiting time between the first run and the success
> > chance of a later run which seem correlated.
> >
> > The above behaviour is screwing with my performance
> > measurements, as you might expect.  Also, it explains the
> > sudden loss of some of the mapreduce workers I saw (the
> > early-starters fail).
> >
> > For now, I can add a 10 min waiting time - then the
> > following SAGA jobs mostly succeed (at the moment).  Well,
> > consider I want to do 10 Mapreduce workers - that are 10 ec2
> > instances, times 10 minutes - that is 1.5 hours per test run
> > - if everything else is perfect!  Difficult to do a parameter
> > sweep, or to get some statistics.
> >
> > BTW, and unrelated: MapReduce should use async ops for
> > creating job services and for creating/running jobs!
> >
> > So, that is where I am stuck.  As said: any idea on what I
> > could investigate further would be appreciated!
> >
> > Cheers, Andre.
> >
> >
> > PS.: CCT is swallowing my mails again -- it is that time of
> > year again it seems.  So, I won't see any answers send via
> > the cct mailer - my mail address itself seems fine (get the
> > usual amount of spam ;-).  Sorry for that additional
> > inconvenience - a ticket is filed (and dormant :-().
> --
> Nothing is ever easy.



More information about the saga-devel mailing list