add a timeout on rapids-conda-retry, document testing in CI #140

jameslamb · 2025-02-04T17:54:52Z

Fixes #129

As described in #129, we have sometimes observed conda install, conda env update, or similar run indefinitely. This is problematic because it means that a doomed-to-fail job can end up occupying a GPU-enabled CI runner for up to 6 hours (the hard limit on runtime for a GitHub Actions job).

Proposes changes:

modifies rapids-{conda,mamba}-retry to run conda / mamba with the Unix timeout utility
sets default timeouts based on command (45 minutes for conda install, 6 hours for conda mambabuild)
makes timeout configurable via a new env variable RAPIDS_CONDA_RETRY_TIMEOUT

While doing this, I found myself testing in CI, so also:

adds docs on how to test gha-tools changes in CI

Notes for Reviewers

How I tested this (locally)

Tested locally, on a branch with the default timeout set to 1 second, like this:

docker run \
    --rm \
    -v $(pwd):/opt/work \
    -it rapidsai/ci-conda:latest \
    bash

export PATH="/opt/work/tools:${PATH}"

rapids-conda-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
# [rapids-conda-retry] conda returned exit code: 124
# [rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='1s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.
echo $?
# 124

RAPIDS_CONDA_RETRY_TIMEOUT='5m' \
rapids-conda-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
echo $?
# 0

rapids-mamba-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
# [rapids-conda-retry] conda returned exit code: 124
# [rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='1s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.
echo $?
# 124

RAPIDS_CONDA_RETRY_TIMEOUT='1s' \
rapids-mamba-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
# [rapids-conda-retry] timeout for conda operations: '5m'
echo $?
# 0

How I tested this (in CI)

Used a PR into ucxx: rapidsai/ucxx#365

run 1: timeouts set to small values, to show what failures look like

details (click me)

Saw rapids-mamba-retry env create in checks: job fail like this:

[rapids-conda-retry] timeout for conda operations: '1s'
...
Collecting package metadata (repodata.json): ...working... timeout: sending signal TERM to command 'mamba'
[rapids-conda-retry] conda returned exit code: 124
[rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='1s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.

(build link)

And rapids-conda-retry mambabuild in cpp-build: job fail like this:

Statistics zeroed.
[rapids-conda-retry] timeout for conda operations: '2s'
INFO:conda_index.index.convert_cache:Migrate database
...
WARNING: No numpy version specified in conda_build_config.yaml.  Falling back to default numpy value of 1.22
Updating build index: /tmp/conda-bld-output
...
Copying /__w/ucxx/ucxx to /opt/conda/conda-bld/work/
Adding in variants from internal_defaults
Adding in variants from /__w/ucxx/ucxx/conda/recipes/ucxx/conda_build_config.yaml
...
Attempting to finalize metadata for libucxx
timeout: sending signal TERM to command 'conda'
[rapids-conda-retry] conda returned exit code: 124
[rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='2s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.

(build link)

run 2: all timeouts set to their proposed defaults, to confirm it works normally

details (click me)

Saw rapids-mamba-retry env create in checks: job succeed:

[rapids-conda-retry] timeout for conda operations: '45m'
...
 done
...
#
# To activate this environment, use
#
#     $ conda activate checks

(build link)

And rapids-conda-retry mambabuild in cpp-build: job succeed:

Statistics zeroed.
[rapids-conda-retry] timeout for conda operations: '6h'
...
INFO :: The inputs making up the hashes for the built packages are as follows:
{
  "distributed-ucxx-0.43.00a-py3.10_250204_gf9da7f9_12.conda": {
...
  "distributed-ucxx-0.43.00a-py3.11_250204_gf9da7f9_12.conda": {
...
  "distributed-ucxx-0.43.00a-py3.12_250204_gf9da7f9_12.conda": {
...
  "libucxx-0.43.00a-cuda11_250204_gf9da7f9_12.conda": {
...
  "libucxx-examples-0.43.00a-250204_gf9da7f9_12.conda": {
...
  "libucxx-tests-0.43.00a-cuda11_250204_gf9da7f9_12.conda": {
...
  "ucxx-0.43.00a-cuda11_py3.10_250204_gf9da7f9_12.conda": {
...
  "ucxx-0.43.00a-cuda11_py3.11_250204_gf9da7f9_12.conda": {
...
  },
  "ucxx-0.43.00a-cuda11_py3.12_250204_gf9da7f9_12.conda": {
...
}

(build link)

And rapids-mamba-retry env create and rapids-mamba-retry install in conda test jobs succeed:

[rapids-conda-retry] timeout for conda operations: '45m'
...
done
...
#
# To activate this environment, use
#
#     $ conda activate test
...
[rapids-conda-retry] timeout for conda operations: '45m'
...
Looking for: ['libucxx=0.43.00', 'libucxx-examples=0.43.00', 'libucxx-tests=0.43.00']
Downloading and Extracting Packages: ...working... done

(build link)

run 3: timeouts overridden by setting an environment variable, to confirm that mechanism works

details (click me)

Added the following in CI scripts:

export RAPIDS_CONDA_RETRY_TIMEOUT=120

That timeout was enough for the inexpensive env creation in checks: to succeed:

[rapids-conda-retry] timeout for conda operations: '120'
...
 done
...
#
# To activate this environment, use
#
#     $ conda activate checks

(build link)

But as expected, caused the conda-cpp-build jobs to fail:

Statistics zeroed.
[rapids-conda-retry] timeout for conda operations: '120'
INFO:conda_index.index.convert_cache:Migrate database
...
timeout: sending signal TERM to command 'conda'
[rapids-conda-retry] conda returned exit code: 124
[rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='120'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.

(build link)

And repeated that with RAPIDS_MAMBA_RETRY_TIMEOUT instead of RAPIDS_CONDA_RETRY_TIMEOUT, saw the expected things: https://github.com/rapidsai/ucxx/actions/runs/13143808352?pr=365

jameslamb · 2025-02-04T21:51:12Z

tools/rapids-conda-retry

 #
+
+# This must be set in order for the script to recognize failing exit codes when
+# output is piped to tee


I think it was a mistake that this got mixed in with the multi-line comment about arguments... seems like it was supposed to specific to the set -o pipefail line.

jameslamb · 2025-02-04T21:52:05Z

tools/rapids-conda-retry

-            if (( needToClean == 1 )); then
-                rapids-echo-stderr "Cleaning tarball cache before retrying..."
-                ${condaCmd} clean --tarballs -y
+        if (( needToRetry == 1 )); then


Refactoring this because in the previous example, rapids-echo-stderr "${retryingMsg}" always ran, resulting in 1 empty log line in the case where needToRetry is not 1.

jameslamb · 2025-02-04T22:03:11Z

tools/rapids-conda-retry

+            exitMsg="Exiting, command exited with status 124 which often indicates a timeout (configured timeout='${timeout_duration}')."
+            exitMsg+=" To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT."
+            rapids-echo-stderr "${exitMsg}"
+            needToRetry=0


Calling your attention to an important design decision.

proposal: timeouts should not be retried

Making a timeout non-retryable makes the timeout setting easier to understand... it means there's no meaningful difference between these:

"fail if conda install runs for more than 45 minutes"

"fail if rapids-conda-retry install runs for more than 45 minutes".

And note that by default, all the timeouts around things like SSL verification, network connections, etc. will still end up getting retried, because those raise errors like CondaHTTPError or Timeout was reached, with default configurations on the order of seconds and not 10s of minutes.

This blunter mechanism covers other issues that might generally not be retryable, like solver deadlock or waiting indefinitely for the filesystem to respond.

If we were to apply a timeout to the entire rapids-{conda,mamba}-retry operation instead, it'd be more complicated to get this right. Because you'd have to set a value that covers the total runtime of all retries.

Makes sense to me, thanks for detailing!

pentschev

Thanks @jameslamb , this is great! I'm sure we'll save at least a dozen hours of CI resources a week (probably more) with this!

CONTRIBUTING.md

pentschev · 2025-02-04T22:28:24Z

tools/rapids-conda-retry

@@ -67,7 +77,7 @@ condaCmd=${RAPIDS_CONDA_EXE:=conda}
 #    needToRetry: 1 if the command should be retried, 0 if it should not be
 function runConda {
    # shellcheck disable=SC2086
-    ${condaCmd} ${args} 2>&1| tee "${outfile}"
+    timeout --verbose "${timeout_duration}" ${condaCmd} ${args} 2>&1| tee "${outfile}"


I didn't know about --verbose, very interesting!

pentschev · 2025-02-04T22:29:51Z

tools/rapids-conda-retry

+            exitMsg="Exiting, command exited with status 124 which often indicates a timeout (configured timeout='${timeout_duration}')."
+            exitMsg+=" To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT."
+            rapids-echo-stderr "${exitMsg}"
+            needToRetry=0


Makes sense to me, thanks for detailing!

Co-authored-by: Peter Andreas Entschev <[email protected]>

jameslamb · 2025-02-11T14:44:14Z

Seeing no other comments here, I'm going to merge this. I'm confident in it, based on the details in the "How I tested this" section of the description.

bdice

Thanks @jameslamb, all looks good to me. Apologies, this PR got lost on my queue. Thanks for merging!

jameslamb added 2 commits February 4, 2025 11:47

add a timeout on rapids-conda-retry and rapids-mamba-retry

cde0e7e

clarify comment

c8c5126

jameslamb added breaking Introduces a breaking change improvement Improves an existing functionality labels Feb 4, 2025

jameslamb mentioned this pull request Feb 4, 2025

WIP: [DO NOT MERGE] test gha-tools conda timeout rapidsai/ucxx#365

Closed

jameslamb added 3 commits February 4, 2025 12:26

add docs on testing in CI

0c89444

use defensible default timeouts

e756da0

add mamba equivalent

615a2f5

jameslamb changed the title ~~WIP: add a timeout on rapids-conda-retry and rapids-mamba-retry~~ WIP: add a timeout on rapids-conda-retry, document testing in CI Feb 4, 2025

jameslamb commented Feb 4, 2025

View reviewed changes

jameslamb added 2 commits February 4, 2025 15:52

Update tools/rapids-conda-retry

461f360

Update tools/rapids-mamba-retry

42f805a

jameslamb commented Feb 4, 2025

View reviewed changes

jameslamb changed the title ~~WIP: add a timeout on rapids-conda-retry, document testing in CI~~ add a timeout on rapids-conda-retry, document testing in CI Feb 4, 2025

jameslamb requested review from gforsyth and pentschev February 4, 2025 22:04

jameslamb marked this pull request as ready for review February 4, 2025 22:04

jameslamb requested a review from a team as a code owner February 4, 2025 22:04

jameslamb mentioned this pull request Feb 4, 2025

Consider adding a timeout for custom install scripts #129

Closed

pentschev approved these changes Feb 4, 2025

View reviewed changes

Update CONTRIBUTING.md

38e0ad4

Co-authored-by: Peter Andreas Entschev <[email protected]>

msarahan approved these changes Feb 5, 2025

View reviewed changes

jameslamb merged commit 1edd35a into rapidsai:main Feb 11, 2025
1 check passed

jameslamb deleted the conda-install-timeout branch February 11, 2025 14:45

bdice reviewed Feb 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a timeout on rapids-conda-retry, document testing in CI #140

add a timeout on rapids-conda-retry, document testing in CI #140

jameslamb commented Feb 4, 2025 •

edited

Loading

jameslamb Feb 4, 2025

jameslamb Feb 4, 2025 •

edited

Loading

jameslamb Feb 4, 2025

pentschev Feb 4, 2025

pentschev left a comment

pentschev Feb 4, 2025

pentschev Feb 4, 2025

jameslamb commented Feb 11, 2025

bdice left a comment

add a timeout on rapids-conda-retry, document testing in CI #140

add a timeout on rapids-conda-retry, document testing in CI #140

Conversation

jameslamb commented Feb 4, 2025 • edited Loading

Notes for Reviewers

How I tested this (locally)

How I tested this (in CI)

jameslamb Feb 4, 2025

Choose a reason for hiding this comment

jameslamb Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

jameslamb Feb 4, 2025

Choose a reason for hiding this comment

pentschev Feb 4, 2025

Choose a reason for hiding this comment

pentschev left a comment

Choose a reason for hiding this comment

pentschev Feb 4, 2025

Choose a reason for hiding this comment

pentschev Feb 4, 2025

Choose a reason for hiding this comment

jameslamb commented Feb 11, 2025

bdice left a comment

Choose a reason for hiding this comment

jameslamb commented Feb 4, 2025 •

edited

Loading

jameslamb Feb 4, 2025 •

edited

Loading