[Saga-devel] Re: help/ideas needed (fwd)

'Andre Merzky' andre at merzky.net
Sat Jan 24 18:56:10 CST 2009


Hi Hartmut, 

Thanks for the suggestions below!  Alas, they don't really
help.  The application either complains about invalid UUID
modes, or shows the same behaviour...

For the time being, I simply submit a test script which
loops until a simple saga test application completes
successfully.  That is ugly, and slow (obviously).  So, if
you think that boost::uuid or so could fix it, I'm all for
it!

Thanks, Andre.


Quoting [Hartmut Kaiser] (Jan 23 2009):
> 
> Andre,
> 
> >From what I can see the UUID generator is screwed up. I see identical UUID's
> for different entities, which might be the reason for the mess.
> SAGA's UUID generator supports different modes (see
> impl/engine/uuid/saga_uuid.h:66):
> 
> enum {
>     SAGA_UUID_MAKE_V1 = (1 << 0), /* DCE 1.1 v1 SAGA_UUID */
>     SAGA_UUID_MAKE_V3 = (1 << 1), /* DCE 1.1 v3 SAGA_UUID */
>     SAGA_UUID_MAKE_V4 = (1 << 2), /* DCE 1.1 v4 SAGA_UUID */
> #if defined(WIN32) || defined(WIN64)
>     SAGA_UUID_MAKE_SYSTEM = (1 << 3), /* rely on the operating system to
> generate an uuid */
> #else
>     SAGA_UUID_MAKE_SYSTEM = SAGA_UUID_MAKE_V1, 
> #endif
>     SAGA_UUID_MAKE_MC = (1 << 4)  /* enforce multi-cast MAC address */
> };
> 
> The current mode is always: SAGA_UUID_MAKE_SYSTEM (see:
> impl/engine/uuid.hpp:120). 
> 
> Could you try to change that to SAGA_UUID_MAKE_V3, SAGA_UUID_MAKE_V4,
> SAGA_UUID_MAKE_V3|SAGA_UUID_MAKE_MC, or SAGA_UUID_MAKE_V4|SAGA_UUID_MAKE_MC
> and see if that changes the picture?
> 
> Just a thought...
> Regards Hartmut
> 
> > -----Original Message-----
> > From: saga-devel-bounces at cct.lsu.edu [mailto:saga-devel-
> > bounces at cct.lsu.edu] On Behalf Of Shantenu Jha
> > Sent: Friday, January 23, 2009 8:30 AM
> > To: saga-devel at cct.lsu.edu
> > Subject: [Saga-devel] Fwd: help/ideas needed (fwd)
> > 
> > 
> > >From Andre, Who is having e-mail problems.
> > 
> > Please read through. Thanks.
> > 
> > Shantenu
> > 
> > 
> > ---------- Forwarded message ----------
> > From: Andre Merzky <andre at merzky.net>
> > Date: 2009/1/23
> > Subject: Re: help/ideas needed
> > To: SAGA Impl List <saga-devel at cct.lsu.edu>
> > 
> > 
> > attachements...
> > 
> > A
> > 
> > Quoting [Andre Merzky] (Jan 23 2009):
> > > Date: Fri, 23 Jan 2009 14:06:12 +0100
> > > From: Andre Merzky <andre at merzky.net>
> > > To: SAGA Impl List <saga-devel at cct.lsu.edu>
> > > Bcc: Andre Merzky <andre at merzky.net>
> > > Subject: help/ideas needed
> > >
> > > Hi Folx,
> > >
> > > I stumbled over a really strange problem, and have no idea
> > > how to explain, nor how to fix it.  So, I'd like to invite
> > > you all to brainstorm - all ideas are apreciated.
> > >
> > > So, what is happening?
> > >
> > > I execute the following script:
> > >
> > > ---------------------------------------------------------------------
> > ---
> > > #!/bin/sh
> > >
> > > export SAGA_VERBOSE=100
> > > export SAGA_PARENT_JOBID=saga-parent-id10039
> > > export SAGA_SSH_KEY=`ls /tmp/saga_saga-parent-id*_ssh`
> > > export SAGA_SSH_PUB=`ls /tmp/saga_saga-parent-id*_ssh.pub`
> > > export SAGA_SSH_USER=amerzky
> > >
> > > env > /tmp/env-0
> > > /root/saga/examples/misc/context ssh                 1> /tmp/o-0-0 2>
> > /tmp/e-0-0
> > > saga-file list_dir /                                 1> /tmp/o-0-1 2>
> > /tmp/e-0-1
> > > saga-file list_dir ssh://amerzky@gg101.cct.lsu.edu/  1> /tmp/o-0-2 2>
> > /tmp/e-0-2
> > >
> > > sleep 600
> > >
> > > env > /tmp/env-1
> > > /root/saga/examples/misc/context ssh                 1> /tmp/o-1-0 2>
> > /tmp/e-1-0
> > > saga-file list_dir /                                 1> /tmp/o-1-1 2>
> > /tmp/e-1-1
> > > saga-file list_dir ssh://amerzky@gg101.cct.lsu.edu/  1> /tmp/o-1-2 2>
> > /tmp/e-1-2
> > > ---------------------------------------------------------------------
> > ---
> > >
> > > You will notice that this is doing the same thing twice,
> > > after a sleep of 5 minutes.  This script is executed on a
> > > fresh EC2 virtual machine instance, which has SAGA and all
> > > requisites preinstalled.  As soon as the ssh deamon is up, I
> > > copy a couple of ssh keys araund, and then run that script.
> > > Nobody else is accesing the instance.  All other running
> > > processes look innocent, and should not (tm) interfere with
> > > SAGA.
> > >
> > > Now, what is the problem?
> > >
> > > In the above setup, the following files should show no
> > > differences, modulo logged saga object IDs:
> > >
> > >    /tmp/env-0  /tmp/env-1
> > >    /tmp/o-0-0  /tmp/o-1-0
> > >    /tmp/e-0-0  /tmp/e-1-0
> > >    /tmp/o-0-1  /tmp/o-1-1
> > >    /tmp/e-0-1  /tmp/e-1-1
> > >    /tmp/o-0-2  /tmp/o-1-2
> > >    /tmp/e-0-2  /tmp/e-1-2
> > >
> > > well, the env does not show any diff, but the saga logs do
> > > (they are attached)
> > > In particular, I make the following observations:
> > >
> > >   - the set of _loaded_ adaptors seems to be the same
> > >   - the set of _used_   adaptors differs
> > >
> > > I don't understand how that can be!  In the
> > > examples/misc/context case: the first run shows that the aws
> > > context adaptor fails (this is the _only_ adaptor used).
> > > The second run shows that aws_context, default_advert,
> > > default_replica, and gridsam_job fail - they all implement
> > > the context cpi.  the ssh_context adaptor (which is the one
> > > I need) succeeds.
> > >
> > > ---------------------------------------------------------------------
> > ---
> > > e-0-0:
> > > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
> >   /etc/saga.ini
> > > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
> >   /etc/saga.ini
> > > INFO       : engine.cpp                : loading static adaptors
> > > INFO       : engine.cpp                : loading dynamic adaptors
> > > Created exception: SAGA(BadParameter): aws_context:
> > aws_context_adaptor.cpp(285): Can't handle context types others than
> > ec2
> > eucalyptus gumbocloud - found ssh
> > > Created exception: SAGA(NoAdaptor): adaptor_selector.cpp(184): Could
> > not
> > select any matching adaptor for: context_cpi::__init__
> > > Created exception: SAGA(NoAdaptor): proxy.cpp(227): No adaptor
> > succeeded
> > in executing constructor for context_cpi
> > > Created exception: SAGA(BadParameter):
> > >   SAGA(BadParameter): aws_context: aws_context_adaptor.cpp(285):
> > Can't
> > handle context types others than ec2 eucalyptus gumbocloud - found ssh
> > >   SAGA(NoAdaptor): proxy.cpp(227): No adaptor succeeded in executing
> > constructor for context_cpi
> > >
> > > SAGA(BadParameter):
> > >   SAGA(BadParameter): aws_context: aws_context_adaptor.cpp(285):
> > Can't
> > handle context types others than ec2 eucalyptus gumbocloud - found ssh
> > >   SAGA(NoAdaptor): proxy.cpp(227): No adaptor succeeded in executing
> > constructor for context_cpi
> > > ---------------------------------------------------------------------
> > ---
> > >
> > > ---------------------------------------------------------------------
> > ---
> > > e-1-0:
> > > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
> >   /etc/saga.ini
> > > Created exception: SAGA(NoSuccess): ini.cpp(646): Cannot open file
> >   /etc/saga.ini
> > > INFO       : engine.cpp                : loading static adaptors
> > > INFO       : engine.cpp                : loading dynamic adaptors
> > > Created exception: SAGA(BadParameter): aws_context:
> > aws_context_adaptor.cpp(285): Can't handle context types others than
> > ec2
> > eucalyptus gumbocloud - found ssh
> > > Created exception: SAGA(BadParameter): default_advert:
> > context.cpp(34):
> > Can't handle context types others than 'default_advert_db' (got: ssh)
> > > Created exception: SAGA(BadParameter): default_replica:
> > context.cpp(33):
> > Can't handle context types others than 'default_replica_db' (got: ssh)
> > > Created exception: SAGA(BadParameter): omii_gridsam_job:
> > omii_gridsam_context.cpp(30): Can't handle context types others than
> > 'omii_gridsam' (got: ssh)
> > > (no error shown as the ssh_context wins)
> > > ---------------------------------------------------------------------
> > ---
> > >
> > > So, how can it be that identical runs behave differently
> > > after some waiting time?  Why are some adaptors not
> > > tried, after they loaded successfully?
> > >
> > > It is not the case that the second run always succeeds - the
> > > number of run's does not seem to have any effect.  Rather, it
> > > is the waiting time between the first run and the success
> > > chance of a later run which seem correlated.
> > >
> > > The above behaviour is screwing with my performance
> > > measurements, as you might expect.  Also, it explains the
> > > sudden loss of some of the mapreduce workers I saw (the
> > > early-starters fail).
> > >
> > > For now, I can add a 10 min waiting time - then the
> > > following SAGA jobs mostly succeed (at the moment).  Well,
> > > consider I want to do 10 Mapreduce workers - that are 10 ec2
> > > instances, times 10 minutes - that is 1.5 hours per test run
> > > - if everything else is perfect!  Difficult to do a parameter
> > > sweep, or to get some statistics.
> > >
> > > BTW, and unrelated: MapReduce should use async ops for
> > > creating job services and for creating/running jobs!
> > >
> > > So, that is where I am stuck.  As said: any idea on what I
> > > could investigate further would be appreciated!
> > >
> > > Cheers, Andre.
> > >
> > >
> > > PS.: CCT is swallowing my mails again -- it is that time of
> > > year again it seems.  So, I won't see any answers send via
> > > the cct mailer - my mail address itself seems fine (get the
> > > usual amount of spam ;-).  Sorry for that additional
> > > inconvenience - a ticket is filed (and dormant :-().
> > --
> > Nothing is ever easy.
> 



-- 
Nothing is ever easy.


More information about the saga-devel mailing list