-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add a timeout on rapids-conda-retry, document testing in CI #140
Conversation
# | ||
|
||
# This must be set in order for the script to recognize failing exit codes when | ||
# output is piped to tee |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was a mistake that this got mixed in with the multi-line comment about arguments... seems like it was supposed to specific to the set -o pipefail
line.
if (( needToClean == 1 )); then | ||
rapids-echo-stderr "Cleaning tarball cache before retrying..." | ||
${condaCmd} clean --tarballs -y | ||
if (( needToRetry == 1 )); then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring this because in the previous example, rapids-echo-stderr "${retryingMsg}"
always ran, resulting in 1 empty log line in the case where needToRetry
is not 1
.
exitMsg="Exiting, command exited with status 124 which often indicates a timeout (configured timeout='${timeout_duration}')." | ||
exitMsg+=" To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT." | ||
rapids-echo-stderr "${exitMsg}" | ||
needToRetry=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling your attention to an important design decision.
proposal: timeouts should not be retried
Making a timeout non-retryable makes the timeout setting easier to understand... it means there's no meaningful difference between these:
- "fail if
conda install
runs for more than 45 minutes" - "fail if
rapids-conda-retry install
runs for more than 45 minutes".
And note that by default, all the timeouts around things like SSL verification, network connections, etc. will still end up getting retried, because those raise errors like CondaHTTPError
or Timeout was reached
, with default configurations on the order of seconds and not 10s of minutes.
This blunter mechanism covers other issues that might generally not be retryable, like solver deadlock or waiting indefinitely for the filesystem to respond.
If we were to apply a timeout to the entire rapids-{conda,mamba}-retry
operation instead, it'd be more complicated to get this right. Because you'd have to set a value that covers the total runtime of all retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, thanks for detailing!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jameslamb , this is great! I'm sure we'll save at least a dozen hours of CI resources a week (probably more) with this!
@@ -67,7 +77,7 @@ condaCmd=${RAPIDS_CONDA_EXE:=conda} | |||
# needToRetry: 1 if the command should be retried, 0 if it should not be | |||
function runConda { | |||
# shellcheck disable=SC2086 | |||
${condaCmd} ${args} 2>&1| tee "${outfile}" | |||
timeout --verbose "${timeout_duration}" ${condaCmd} ${args} 2>&1| tee "${outfile}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about --verbose
, very interesting!
exitMsg="Exiting, command exited with status 124 which often indicates a timeout (configured timeout='${timeout_duration}')." | ||
exitMsg+=" To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT." | ||
rapids-echo-stderr "${exitMsg}" | ||
needToRetry=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me, thanks for detailing!
Co-authored-by: Peter Andreas Entschev <[email protected]>
Seeing no other comments here, I'm going to merge this. I'm confident in it, based on the details in the "How I tested this" section of the description. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jameslamb, all looks good to me. Apologies, this PR got lost on my queue. Thanks for merging!
Fixes #129
As described in #129, we have sometimes observed
conda install
,conda env update
, or similar run indefinitely. This is problematic because it means that a doomed-to-fail job can end up occupying a GPU-enabled CI runner for up to 6 hours (the hard limit on runtime for a GitHub Actions job).Proposes changes:
rapids-{conda,mamba}-retry
to runconda
/mamba
with the Unixtimeout
utilityconda install
, 6 hours forconda mambabuild
)RAPIDS_CONDA_RETRY_TIMEOUT
While doing this, I found myself testing in CI, so also:
gha-tools
changes in CINotes for Reviewers
How I tested this (locally)
Tested locally, on a branch with the default timeout set to 1 second, like this:
How I tested this (in CI)
Used a PR into
ucxx
: rapidsai/ucxx#365run 1: timeouts set to small values, to show what failures look like
details (click me)
Saw
rapids-mamba-retry env create
inchecks:
job fail like this:(build link)
And
rapids-conda-retry mambabuild
incpp-build:
job fail like this:(build link)
run 2: all timeouts set to their proposed defaults, to confirm it works normally
details (click me)
Saw
rapids-mamba-retry env create
inchecks:
job succeed:(build link)
And
rapids-conda-retry mambabuild
incpp-build:
job succeed:(build link)
And
rapids-mamba-retry env create
andrapids-mamba-retry install
in conda test jobs succeed:(build link)
run 3: timeouts overridden by setting an environment variable, to confirm that mechanism works
details (click me)
Added the following in CI scripts:
export RAPIDS_CONDA_RETRY_TIMEOUT=120
That timeout was enough for the inexpensive env creation in
checks:
to succeed:(build link)
But as expected, caused the
conda-cpp-build
jobs to fail:(build link)
And repeated that with
RAPIDS_MAMBA_RETRY_TIMEOUT
instead ofRAPIDS_CONDA_RETRY_TIMEOUT
, saw the expected things: https://github.com/rapidsai/ucxx/actions/runs/13143808352?pr=365