Skip to content

Commit

Permalink
Merge pull request #6640 from grondo/issue#6606
Browse files Browse the repository at this point in the history
doc: expand job termination documentation in `flux-config-exec(5)`
  • Loading branch information
mergify[bot] authored Feb 14, 2025
2 parents f6bef37 + bc37f29 commit 934fcd8
Show file tree
Hide file tree
Showing 2 changed files with 37 additions and 7 deletions.
41 changes: 34 additions & 7 deletions doc/man5/flux-config-exec.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,19 +36,26 @@ job-shell

sdexec-properties
(optional) A table of systemd properties to set for all jobs. All values
must be strings. See SDEXEC PROPERTIES below.
must be strings. See :ref:`sdexec_properties` below.

kill-timeout
(optional) The amount of time to wait after ``SIGTERM`` is sent to a job
before sending ``SIGKILL``.
(optional) The amount of time in FSD to wait after ``SIGTERM`` is
sent to a job before sending ``SIGKILL``. The default is "5s". See
:ref:`job_termination` below for details.

max-kill-count
(optional) The maximum number of times a job will be sent ``kill-signal``
before the execution system will consider the job unkillable and drains
the node. The default is 8. See :ref:`job_termination` below for details.
for details.

term-signal
(optional) Specify an alternate signal to ``SIGTERM`` when terminating
job tasks. Mainly used for testing.
(optional) A string specifying an alternate signal to ``SIGTERM`` when
terminating job tasks. Mainly used for testing.

kill-signal
(optional) Specify an alternate signal to ``SIGKILL`` when killing tasks
and the job shell. Mainly used for testing.
(optional) A string specifying an alternate signal to ``SIGKILL`` when
killing tasks and the job shell. Mainly used for testing.

barrier-timeout
(optional) Specify the default job shell start barrier timeout in FSD.
Expand All @@ -65,6 +72,8 @@ testexec
**job-exec** test execution implementation (used in mainly for testing).


.. _sdexec_properties:

SDEXEC PROPERTIES
=================

Expand Down Expand Up @@ -96,6 +105,24 @@ allow-guests
Boolean value enables access to the testexec implementation from guest
users. By default, guests cannot use this implementation.

.. _job_termination:

JOB TERMINATION
===============

When a job is canceled or gets a fatal exception it is terminated using
the following sequence

- The job shells are notified to send ``term-signal`` to job tasks, unless
the job is being terminated due to a time limit, in which case ``SIGALRM``
is sent instead.
- After ``kill-timeout``, any remaining shells are sent ``kill-signal``
- This continues with an exponential backoff, with the timeout doubling
after each attempt (capped at 300s)
- After a total of ``max-kill-count`` attempts, any nodes still running
processes are drained with the message: "unkillable user processes for job
JOBID"

EXAMPLES
========

Expand Down
3 changes: 3 additions & 0 deletions doc/test/spell.en.pws
Original file line number Diff line number Diff line change
Expand Up @@ -943,3 +943,6 @@ myprogram
unref
sigprocmask
iso
SIGALRM
backoff
unkillable

0 comments on commit 934fcd8

Please sign in to comment.