[Simfactory] "Cleanup" issues
Erik Schnetter
schnetter at cct.lsu.edu
Mon Feb 7 09:33:23 CST 2011
One of the new features of the Python SimFactory is that cleanup is
now executed automatically. This was always planned but never
implemented in the Perl version.
Background: A simulation goes through "cycles" in its evolution. It is
first created, which determines the executable and the parameter file,
and is then in an inactive state. Each time a restart is submitted,
the simulation becomes active. While active, it is first waiting in
the queue, then running in the queue, and then finished. Cleaning up
the restart moves the simulation back into the inactive state, where
the next restart can be submitted if necessary. Cleaning up a
simulation performs a number of currently small but possibly important
actions, such as e.g. correcting file permissions on stdout/stderr or
removing half-written checkpoint files. It can also (although this is
not yet done) save disk space by compressing files, hard-linking
Formaline tarballs, or scheduling archival of simulation results to a
tape archive to ensure simulation data are not lost. Basically,
everything that one would like to perform automatically after a
restart has finished should be done during cleanup, ensuring these
actions are performed automatically and consistently.
In Perl, it used to be necessary to run cleanup manually, which means
that it often would not be run. In Python it runs automatically when
certain simfactory actions are performed (e.g. submitting a job). This
creates certain problems, and I am now looking for ideas and
suggestions to correct these.
1. If there is an error in the MDB, then simfactory erroneously thinks
that a restart is finished and runs cleanup, although the restart is
actually queued or running. This depends on regular expressions
matching qstat output and is somewhat fragile; it may change on
systems without warning, and is not easy to get correct.
2. Since cleanup runs automatically and in the background, there is no
good way to output its actions to the screen, since people do not
expect it to run, and are working on something else. For example, if
someone submits a simulation, they want to ensure this is done
correctly, and is somewhat stressful since this is a critical step
where errors can waste days or weeks. If cleanup outputs additional
errors at this time, people are annoyed, confused, or will ignore
them.
3. Cleanup output is often littered with error messages that can be
safely ignored. For example, if data are deleted automatically, some
old simulations are incomplete, and cleanup will detect this and then
emit an error message about this simulation.
It thus seems to me that we need a safer way to run cleanup. These
problems are fundamental, and we need a different approach, not just a
gradual improvement. Here is one idea:
- When a restart finishes successfully (and only then), it outputs a
marker into the restart directory that this simulation is ready for
cleanup.
- Only simulations with such a marker are cleaned up automatically.
Cleanup output goes into a specific file in that restart, similar to
the job stdout and stderr. Output from an automatic cleanup is never
printed to the screen so that people are not confused.
- Whenever an action is performed on a simulation that changes its
state, this restart is also cleaned up automatically. This allows
re-submitting a failed restart without having to clean it up manually.
- There is a cleanup command that can be run explicitly if necessary,
and which is a harmless no-op if not necessary. Output from the manual
cleanup is written both to the restart and output to screen. (Of
course, all output is also written to the simulation's log file.)
This new strategy would have several advantages:
- Cleanup is still run automatically and consistently
- Cleanup is faster, since fewer simulations need to be checked
- Automatic cleanup is less aggressive, and there is less danger
- Cleanup output is always found in a specific place, so that people
know where to look for errors
- Cleanup output never confuses people
- Spurious errors (about old simulations) are avoided
One issue that is not addressed are that cleaning up a simulation may
take a significant amount of time, if e.g. the qstat command is slow,
or if scheduling archiving a simulation takes a long time. If this is
the case, then cleanups should be run in the background. There could
also be a cron job running automatic cleanups every day. Or jobs that
are finished could launch a cleanup of themselves automatically; if
run on the head node, this would not count towards the queue time.
-erik
--
Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
More information about the SimFactory
mailing list