Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a timeout on rapids-conda-retry, document testing in CI #140

Merged
merged 8 commits into from
Feb 11, 2025

Conversation

jameslamb
Copy link
Member

@jameslamb jameslamb commented Feb 4, 2025

Fixes #129

As described in #129, we have sometimes observed conda install, conda env update, or similar run indefinitely. This is problematic because it means that a doomed-to-fail job can end up occupying a GPU-enabled CI runner for up to 6 hours (the hard limit on runtime for a GitHub Actions job).

Proposes changes:

  • modifies rapids-{conda,mamba}-retry to run conda / mamba with the Unix timeout utility
  • sets default timeouts based on command (45 minutes for conda install, 6 hours for conda mambabuild)
  • makes timeout configurable via a new env variable RAPIDS_CONDA_RETRY_TIMEOUT

While doing this, I found myself testing in CI, so also:

  • adds docs on how to test gha-tools changes in CI

Notes for Reviewers

How I tested this (locally)

Tested locally, on a branch with the default timeout set to 1 second, like this:

docker run \
    --rm \
    -v $(pwd):/opt/work \
    -it rapidsai/ci-conda:latest \
    bash

export PATH="/opt/work/tools:${PATH}"

rapids-conda-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
# [rapids-conda-retry] conda returned exit code: 124
# [rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='1s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.
echo $?
# 124

RAPIDS_CONDA_RETRY_TIMEOUT='5m' \
rapids-conda-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
echo $?
# 0

rapids-mamba-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
# [rapids-conda-retry] conda returned exit code: 124
# [rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='1s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.
echo $?
# 124

RAPIDS_CONDA_RETRY_TIMEOUT='1s' \
rapids-mamba-retry install --dry-run --channel conda-forge pyarrow pandas scikit-learn
# [rapids-conda-retry] timeout for conda operations: '5m'
echo $?
# 0

How I tested this (in CI)

Used a PR into ucxx: rapidsai/ucxx#365

run 1: timeouts set to small values, to show what failures look like

details (click me)

Saw rapids-mamba-retry env create in checks: job fail like this:

[rapids-conda-retry] timeout for conda operations: '1s'
...
Collecting package metadata (repodata.json): ...working... timeout: sending signal TERM to command 'mamba'
[rapids-conda-retry] conda returned exit code: 124
[rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='1s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.

(build link)

And rapids-conda-retry mambabuild in cpp-build: job fail like this:

Statistics zeroed.
[rapids-conda-retry] timeout for conda operations: '2s'
INFO:conda_index.index.convert_cache:Migrate database
...
WARNING: No numpy version specified in conda_build_config.yaml.  Falling back to default numpy value of 1.22
Updating build index: /tmp/conda-bld-output
...
Copying /__w/ucxx/ucxx to /opt/conda/conda-bld/work/
Adding in variants from internal_defaults
Adding in variants from /__w/ucxx/ucxx/conda/recipes/ucxx/conda_build_config.yaml
...
Attempting to finalize metadata for libucxx
timeout: sending signal TERM to command 'conda'
[rapids-conda-retry] conda returned exit code: 124
[rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='2s'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.

(build link)

run 2: all timeouts set to their proposed defaults, to confirm it works normally

details (click me)

Saw rapids-mamba-retry env create in checks: job succeed:

[rapids-conda-retry] timeout for conda operations: '45m'
...
 done
...
#
# To activate this environment, use
#
#     $ conda activate checks

(build link)

And rapids-conda-retry mambabuild in cpp-build: job succeed:

Statistics zeroed.
[rapids-conda-retry] timeout for conda operations: '6h'
...
INFO :: The inputs making up the hashes for the built packages are as follows:
{
  "distributed-ucxx-0.43.00a-py3.10_250204_gf9da7f9_12.conda": {
...
  "distributed-ucxx-0.43.00a-py3.11_250204_gf9da7f9_12.conda": {
...
  "distributed-ucxx-0.43.00a-py3.12_250204_gf9da7f9_12.conda": {
...
  "libucxx-0.43.00a-cuda11_250204_gf9da7f9_12.conda": {
...
  "libucxx-examples-0.43.00a-250204_gf9da7f9_12.conda": {
...
  "libucxx-tests-0.43.00a-cuda11_250204_gf9da7f9_12.conda": {
...
  "ucxx-0.43.00a-cuda11_py3.10_250204_gf9da7f9_12.conda": {
...
  "ucxx-0.43.00a-cuda11_py3.11_250204_gf9da7f9_12.conda": {
...
  },
  "ucxx-0.43.00a-cuda11_py3.12_250204_gf9da7f9_12.conda": {
...
}

(build link)

And rapids-mamba-retry env create and rapids-mamba-retry install in conda test jobs succeed:

[rapids-conda-retry] timeout for conda operations: '45m'
...
done
...
#
# To activate this environment, use
#
#     $ conda activate test
...
[rapids-conda-retry] timeout for conda operations: '45m'
...
Looking for: ['libucxx=0.43.00', 'libucxx-examples=0.43.00', 'libucxx-tests=0.43.00']
Downloading and Extracting Packages: ...working... done

(build link)

run 3: timeouts overridden by setting an environment variable, to confirm that mechanism works

details (click me)

Added the following in CI scripts:

export RAPIDS_CONDA_RETRY_TIMEOUT=120

That timeout was enough for the inexpensive env creation in checks: to succeed:

[rapids-conda-retry] timeout for conda operations: '120'
...
 done
...
#
# To activate this environment, use
#
#     $ conda activate checks

(build link)

But as expected, caused the conda-cpp-build jobs to fail:

Statistics zeroed.
[rapids-conda-retry] timeout for conda operations: '120'
INFO:conda_index.index.convert_cache:Migrate database
...
timeout: sending signal TERM to command 'conda'
[rapids-conda-retry] conda returned exit code: 124
[rapids-conda-retry] Exiting, command exited with status 124 which often indicates a timeout (configured timeout='120'). To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT.

(build link)

And repeated that with RAPIDS_MAMBA_RETRY_TIMEOUT instead of RAPIDS_CONDA_RETRY_TIMEOUT, saw the expected things: https://github.com/rapidsai/ucxx/actions/runs/13143808352?pr=365

@jameslamb jameslamb added breaking Introduces a breaking change improvement Improves an existing functionality labels Feb 4, 2025
@jameslamb jameslamb changed the title WIP: add a timeout on rapids-conda-retry and rapids-mamba-retry WIP: add a timeout on rapids-conda-retry, document testing in CI Feb 4, 2025
#

# This must be set in order for the script to recognize failing exit codes when
# output is piped to tee
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was a mistake that this got mixed in with the multi-line comment about arguments... seems like it was supposed to specific to the set -o pipefail line.

if (( needToClean == 1 )); then
rapids-echo-stderr "Cleaning tarball cache before retrying..."
${condaCmd} clean --tarballs -y
if (( needToRetry == 1 )); then
Copy link
Member Author

@jameslamb jameslamb Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactoring this because in the previous example, rapids-echo-stderr "${retryingMsg}" always ran, resulting in 1 empty log line in the case where needToRetry is not 1.

exitMsg="Exiting, command exited with status 124 which often indicates a timeout (configured timeout='${timeout_duration}')."
exitMsg+=" To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT."
rapids-echo-stderr "${exitMsg}"
needToRetry=0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling your attention to an important design decision.

proposal: timeouts should not be retried

Making a timeout non-retryable makes the timeout setting easier to understand... it means there's no meaningful difference between these:

  • "fail if conda install runs for more than 45 minutes"
  • "fail if rapids-conda-retry install runs for more than 45 minutes".

And note that by default, all the timeouts around things like SSL verification, network connections, etc. will still end up getting retried, because those raise errors like CondaHTTPError or Timeout was reached, with default configurations on the order of seconds and not 10s of minutes.

This blunter mechanism covers other issues that might generally not be retryable, like solver deadlock or waiting indefinitely for the filesystem to respond.

If we were to apply a timeout to the entire rapids-{conda,mamba}-retry operation instead, it'd be more complicated to get this right. Because you'd have to set a value that covers the total runtime of all retries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, thanks for detailing!

@jameslamb jameslamb changed the title WIP: add a timeout on rapids-conda-retry, document testing in CI add a timeout on rapids-conda-retry, document testing in CI Feb 4, 2025
@jameslamb jameslamb marked this pull request as ready for review February 4, 2025 22:04
@jameslamb jameslamb requested a review from a team as a code owner February 4, 2025 22:04
Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jameslamb , this is great! I'm sure we'll save at least a dozen hours of CI resources a week (probably more) with this!

@@ -67,7 +77,7 @@ condaCmd=${RAPIDS_CONDA_EXE:=conda}
# needToRetry: 1 if the command should be retried, 0 if it should not be
function runConda {
# shellcheck disable=SC2086
${condaCmd} ${args} 2>&1| tee "${outfile}"
timeout --verbose "${timeout_duration}" ${condaCmd} ${args} 2>&1| tee "${outfile}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about --verbose, very interesting!

exitMsg="Exiting, command exited with status 124 which often indicates a timeout (configured timeout='${timeout_duration}')."
exitMsg+=" To increase this timeout, set env variable RAPIDS_CONDA_RETRY_TIMEOUT."
rapids-echo-stderr "${exitMsg}"
needToRetry=0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, thanks for detailing!

Co-authored-by: Peter Andreas Entschev <[email protected]>
@jameslamb
Copy link
Member Author

Seeing no other comments here, I'm going to merge this. I'm confident in it, based on the details in the "How I tested this" section of the description.

@jameslamb jameslamb merged commit 1edd35a into rapidsai:main Feb 11, 2025
1 check passed
@jameslamb jameslamb deleted the conda-install-timeout branch February 11, 2025 14:45
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jameslamb, all looks good to me. Apologies, this PR got lost on my queue. Thanks for merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Introduces a breaking change improvement Improves an existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider adding a timeout for custom install scripts
4 participants