[Simfactory] Presubmitting
Ian Hinder
ian.hinder at aei.mpg.de
Thu Jul 9 11:50:00 CDT 2009
On 7 Jul 2009, at 16:46, Erik Schnetter wrote:
> On Jun 30, 2009, at 15:21:40, Ian Hinder wrote:
>
>> Hi,
>>
>> I currently create and submit simulations with "create-submit". If I
>> want to submit a chain of simulations, I then use "submit --
>> presubmit"
>> for each one. If I want to add to the chain, I then either have to
>> do
>> "submit --presubmit" again if there is at least one of the jobs in
>> the
>> queue, or "cleanup-submit" if they have all finished.
>>
>> 1. Would it be possible to specify the number of "presubmits" at the
>> "create-submit" stage? And also when using the "submit --presubmit"
>> command?
>
> Would a --count argument be good for this? Or should the --
> presubmit option take a numeric argument? If so, should that
> argument be optional?
I think the presubmit option should take a numeric argument, and that
it should be optional.
>> 2. It would be nice to be able to "extend" the chain of simulations
>> in
>> a uniform way which did not depend on whether the jobs had finished
>> yet or not.
>
> You mean something like maybe-cleanup-submit-presubmit?
Basically, I think that it should check to see if there are any jobs
in the queue. If there are, then it should presubmit (a given
number). If there are not, then it should cleanup the simulation,
submit a new one, and then presubmit any more that have been asked for.
>> 3. Since each of these commands (when run remotely) causes an ssh
>> connection, it can take quite a while. It would be nice to be able
>> to
>> "continue" more than one simulation in a single command. This would
>> be helpful when managing a batch of simulations.
>
> Here is a bit of brain storming:
>
> sim remote kraken { submit A; submit B; submit C; }
> sim remote kraken submit { A B C }
> sim remote kraken { cleanup submit } A
> sim remote { kraken ranger } { cleanup submit } { A B C }
>
> sim remote kraken foreach S (A B C) cleanup-submit S
>
> Actually, the following works already:
>
> ./simfactory/sim remote damiana execute 'for id in 0006 0007 0008;
> do ./simfactory/sim cleanup $id; done'
>
> I think that the simulation factory should not try to duplicate what
> the shell can already do. I don't want to invent a new language --
> it would be better to make the simulation factory work smoothly with
> the shell, as the shell has already a for command. Alternatively,
> the simfactory commands could be exported as perl (or python)
> commands, and you could then use perl or python looping constructs.
I am already using shell commands, but on the local machine. I didn't
know you could use the "execute" command. However, I think that all
these "continuation" commands only need one argument, the simulation
name. So it would make MUCH more sense to just let you specify a list
of simulations, as in
> sim remote kraken submit A B C
Does this currently work?
>> 4. Whenever I run create-submit, I invariably want to perform a sync
>> beforehand to make sure that the parfiles directory on the remote
>> machine is up to date. However, only the parfiles directory needs to
>> be synced, not the whole tree. Would it be possible for this to be
>> implemented in simfactory?
>
> What about a new command sync-parfiles? Or maybe sync --
> subdir=parfiles?
I think that this would only ever be used with "create", as that is
the time that the parameter file is used. So it would make more sense
to have it as a sub-option of create (and create-submit). Maybe
sim remote kraken create-submit --sync-parfiles ...
I would want the sync-parfiles option to default to "true", so if that
is not what everyone wants, the default could be specified in the
configuration database.
>> 5. When I run "sim stop", only the first job in a chain is stopped.
>> Is there a way to make it stop all the jobs? Maybe a different
>> command, or an option to "stop"?
>
> The stop command aborts a job, removing it from the queue. What
> should happen to the other jobs? Should they be held or deleted?
> (Maybe either, depending on an option.) What should be the
> default? (They should probably be removed from the queue.) Should
> there also be "hold" and a "release" commands for jobs?
It would be nice to work at an abstraction level above that of jobs;
i.e. at the "simulation" level. On a high level, these are the
operations I like to perform:
* Start a simulation (with a given number of jobs in a chain)
* Stop a simulation (killing all jobs)
* Continue a simulation, with a given number of jobs
So yes, I agree that they should be removed, as that is the usual
case. You could then have an option to "stop" like "only-first" or
something (that is a bad name), but for me this is not a priority.
The things I find myself still doing manually by logging into the
cluster are:
* Checking what jobs are currently running and queued
* Killing jobs, because using "stop" does not kill all of them
* Sometimes releasing a hold on the first presubmitted job, as the
mechanism sometimes gets confused (maybe when you "continue" a
simulation, it could check to see if there was a nonsensicle hold on
the first job, e.g. waiting for a job which doesn't exist, and remove
it)
* Watching the standard output and error (simfactory does not have
continuous update for this, and is generally quite slow)
--
Ian Hinder
ian.hinder at aei.mpg.de
More information about the SimFactory
mailing list