[Simfactory] "Cleanup" issues
ian.hinder at aei.mpg.de
Thu Feb 24 08:09:06 CST 2011
On 24 Feb 2011, at 15:03, Erik Schnetter wrote:
> This is almost sufficient; what is missing is an automatic mechanism
> to clean up those simulations that need it. For example, I think that
> archiving a simulation (on systems which delete simulation data
> automatically) should happen automatically.
> (Tape space is cheap, and if the archiving happens automatically and
> in the background, it wouldn't bother anybody. Assuming there is a
> convenient way to see which simulations are archived, it would be easy
> for the user to delete those simulations that are not needed -- and
> this would be much preferable to the current state where the user has
> to take action to save their data. If you really want to delete
> simulations automatically, you can still do that, but you are then
> able to set your own policy for this.)
> What are your ideas for automating cleanup?
> My idea is that SimFactory should check all simulations for cleanup
> whenever it runs a command that modifies simulations (create, submit,
> delete, cleanup, purge, etc.). If SimFactory is still too aggressive
> then we should address that, but I really would like to clean up
> automatically. Michael, how much work would it be to try to clean up
> after SimFactory executed the user's command, and to have the cleanup
> run in the background, ensuring that at most one cleanup command runs
> at a time?
I don't think this is the right solution to the problem. In the absence of a good mechanism for making something happen "automatically", we are making simfactory do something every time it is run. This could be both too infrequent to be useful, or it could be too frequent to be feasible (e.g. statting hundreds of simulations on a slow production filesystem). Rather than making every simfactory command run a cleanup, I think we should come up with a better way to automate it. Further, I think that even if we *can't* come up with a better way, the current way is worse than none at all.
Should we ask the teragrid helpdesk for their opinions on how to automatically archive a simulation? Or will they just say "don't do it automatically, do it manually"?
> On Thu, Feb 24, 2011 at 8:15 AM, Barry Wardell <barry.wardell at aei.mpg.de> wrote:
>> I understand the need for cleaning up. My concern was mainly that the
>> cleanup is done too aggressively. For example, when I run 'sim help' I
>> don't expect the cleanup to happen, yet that's what currently happens.
>> Likewise for many other commands. Here's my suggestion about how it
>> should work:
>> 1. In general, SimFactory does not clenup restarts when run.
>> 2. For commands which operate on a specific simulation (eg. submit,
>> run), clean up only that simulation, if necessary for that command to
>> 3. Provide a cleanup command which forces a cleanup of a specific simulation.
>> 4. Provide a cleanup-all command which forces a cleanup of all simulations.
>> Are there cases where this behavior would not be sufficient?
>> On Wed, Feb 23, 2011 at 5:35 PM, Erik Schnetter <schnetter at cct.lsu.edu> wrote:
>>> Cleaning up a simulation is an action that should, ideally, happen
>>> automatically after a restart has finished running. Ideally, this
>>> would happen right after the job finishes, but there are obvious
>>> problem with this -- they job may have run out of time, and we don't
>>> want to waste valuable parallel queue time on cleaning up a
>>> We are thinking of various ways in which the cleanup could happen
>>> automatically. One way could be for the finishing jobs to call a shell
>>> script on the head node, another way could be a cron job. At the
>>> moment, SimFactory checks simulations and cleans them up if necessary
>>> -- as you noticed, this doesn't work well on slow file systems, so
>>> we've made SimFactory less "aggressive" here. Maybe this check should
>>> run in the background instead.
>>> We are open for new ideas!
>>> On Wed, Feb 23, 2011 at 11:10 AM, Michael Thomas <mthomas at cct.lsu.edu> wrote:
>>>> Yes, this is the intended behavior. Simfactory still needs to be able to make intelligent decisions about which simulations are active, and which aren't. It is necessary to at least attempt to perform cleanups each time simfactory is executed. Simfactory now rate limits cleanups, eg, it will not attempt to cleanup a simulation more than once every 30 seconds, and it now also will only attempt to clean up a simulation that has a simulation marked as active, and attempt to clean up only the restart flagged as active. These changes should make the cleanup procedure less intrusive, and hopefully have it do the right thing in cleaning up a simulation.
>>>> Furthermore, I've added another level of detection into the job status detection code. When simfactory attempts to assess the job status of a given restart, it relies on several regular expressions in the mdb entry for the machine -- submitpattern, holdingpattern, runningpattern, and queuedpattern. If all of these regular expressions fail to produce a match, simfactory assigns the job status of 'U', which means basically 'complete' and available to be cleaned up. Now, if all of these regular expressions fail to match but the raw output contains the job id, I assign a status of 'E', report it as a warning, but don't clean the simulation up. This allows for less of a chance of accidental cleanup of an active simulation, which is the issue all of this is attempting to avoid.
>>>>> Hi Michael,
>>>>> I have not yet had the opportunity to test the new changes. However,
>>>>> looking at lib/sim.py I can see that my main question has not yet been
>>>>> addressed: CleanupRestarts() is called every time simfactory is run. Is
>>>>> this the intended behavior?
>>>>> On 18/02/2011 18:34, Michael Thomas wrote:
>>>>>> The internal structure of simfactory's run/submit/cleanup system has been greatly refactored as of revision r1215. Please retest these issues that have been occurring with cleanups, etc, and let me know if they are still present/go away/now cause your computer monitor to explode. If you submit a job and it does not run and errors with an AssertionError on simrestart.IsActive(), please report this and the machine that it happened on. The run() command now asserts the simulation is active (output-xxxx-active) exists before attempting to run.
>>>>>> Thanks everyone for your patience and help,
>>>>>> SimFactory mailing list
>>>>>> SimFactory at cct.lsu.edu
>>>>> SimFactory mailing list
>>>>> SimFactory at cct.lsu.edu
>>>> SimFactory mailing list
>>>> SimFactory at cct.lsu.edu
>>> Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
>>> SimFactory mailing list
>>> SimFactory at cct.lsu.edu
>> SimFactory mailing list
>> SimFactory at cct.lsu.edu
> Erik Schnetter <schnetter at cct.lsu.edu> http://www.cct.lsu.edu/~eschnett/
> SimFactory mailing list
> SimFactory at cct.lsu.edu
ian.hinder at aei.mpg.de
More information about the SimFactory