Skip to content

Commit

Permalink
Add more checkpoint docs
Browse files Browse the repository at this point in the history
  • Loading branch information
Chrismarsh committed Jun 24, 2024
1 parent 1cacf72 commit b7e971b
Showing 1 changed file with 37 additions and 0 deletions.
37 changes: 37 additions & 0 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -807,6 +807,13 @@ To enable checkpoints, ``save_checkpoint`` must be enabled and one of the ``on_*
checkpoints every ``on_frequency`` timesteps as well as on the last timestep.


.. warning::

The configuration file can be modified in between checkpoint runs. However, the changes MUST be coherent with
what is saved in the checkpoint file. For example the following is not allowed:
a new module is added that does not have checkpointed data.


.. confval:: save_checkpoint

:type: boolean
Expand Down Expand Up @@ -841,9 +848,15 @@ checkpoints every ``on_frequency`` timesteps as well as on the last timestep.

.. code:: shell
# All PBS / SLURM options here
# [...]
# Set the env var
CHM_WALLCLOCK_LIMIT=$(squeue -j $SLURM_JOB_ID -h --Format TimeLimit)
CHM_WALLCLOCK_LIMIT=$(qstat -f $PBS_JOBID | sed -rn 's/.*Resource_List.walltime = (.*)/\1/p')
# Then run CHM
mpirun [...] CHM -f [...]
.. confval:: minutes_of_wallclock

Expand All @@ -869,6 +882,17 @@ checkpoints every ``on_frequency`` timesteps as well as on the last timestep.
Set to ``true`` to auto-resume from the most recent checkpoint that exists in ``output_folder/checkpoint``.
Doing so allows for easily and repeatedly resuming from a checkpoint file.


.. note::

Upon a successful completion of a simulation of CHM, a sentinel file is written to the output folder
"<output_folder>/clean_exit". This will not be written if CHM suspends due to a wall lock limit. Therefor, the
intent of this file is to be used to allow repeated automatic requeing of a job on HPC that have short
wallclock limits.


Basic checkpoint example

.. code:: json
"checkpoint":
Expand All @@ -880,6 +904,19 @@ checkpoints every ``on_frequency`` timesteps as well as on the last timestep.
}
An example of auto save and resume. This will checkpoint with 5 minutes of wall clock left. Then,
as long as the same output directory is used, a subsequent run will detect the most recent checkpoint and resume from
it.

.. code:: json
"checkpoint":
{
"save_checkpoint": true,
"on_wallclock_limit": true,
"minutes_of_wallclock": 5,
"auto_resumed": true
}

0 comments on commit b7e971b

Please sign in to comment.