-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add time-outs to GPU CI jobs #45
Comments
Sorry about this! I am trying to get a benchmark CI set up like the one in Lux, so there are going to be quite a few bumps and bruises, and I did not realize how much this was taxing the GPU resources. Would it make sense to set a timeout of like 1-2 minutes until this is properly set up, and then once I know things are active, then set a more realistic timeout of, say 15 minutes, like the one you suggested? |
Yep, that's fine! You could also consider using something like https://github.com/staticfloat/forerunner-buildkite-plugin to only trigger CI when relevant sources have been modified. |
I will look into that as well, thank you! Btw, is there a way to cancel a long-running pipeline? I just modified the testing portion of my buildkite pipeline to have a |
Only if you're part of the Buildkite org, so not easily. I'd recommend adding a reasonable timeout as quickly as possible to avoid this happening in the future. |
I have set the timeouts to 5 min, but it does not seem to matter, as AMDGPU is hanging right now and has been running for ~15 mins. Is there something I am still doing wrong? I do not want to waste resources as this is very generous to offer this to the community |
Can you link to a build? The only ones I see are waiting for an AMDGPU agent, not actually running. |
Oh yes, those are the ones I was talking about. So it doesn't waste resources while waiting on an agent? |
That's right, so no need to worry about that. |
Several recent CUDA and oneAPI job are simply hanging, e.g., https://buildkite.com/julialang/distancetransforms-dot-jl/builds/22
Please add reasonable time-outs, e.g.,
timeout_in_minutes: 15
.And more generally, please be more mindful of shared compute resources. You pushed 14 commits yesterday, running over 4 hours of GPU CI time for each of those, all for nothing because the jobs fail anyway. This is a not a reasonable use of community CI resources. When not changing source files, e.g. your changes to
.gitignore
, at the very least include a[ci skip]
in your commit message.The text was updated successfully, but these errors were encountered: