Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add time-outs to GPU CI jobs #45

Open
maleadt opened this issue Sep 11, 2024 · 8 comments
Open

Add time-outs to GPU CI jobs #45

maleadt opened this issue Sep 11, 2024 · 8 comments

Comments

@maleadt
Copy link

maleadt commented Sep 11, 2024

Several recent CUDA and oneAPI job are simply hanging, e.g., https://buildkite.com/julialang/distancetransforms-dot-jl/builds/22

Please add reasonable time-outs, e.g., timeout_in_minutes: 15.

And more generally, please be more mindful of shared compute resources. You pushed 14 commits yesterday, running over 4 hours of GPU CI time for each of those, all for nothing because the jobs fail anyway. This is a not a reasonable use of community CI resources. When not changing source files, e.g. your changes to .gitignore, at the very least include a [ci skip] in your commit message.

@Dale-Black
Copy link
Collaborator

Sorry about this! I am trying to get a benchmark CI set up like the one in Lux, so there are going to be quite a few bumps and bruises, and I did not realize how much this was taxing the GPU resources. Would it make sense to set a timeout of like 1-2 minutes until this is properly set up, and then once I know things are active, then set a more realistic timeout of, say 15 minutes, like the one you suggested?

@maleadt
Copy link
Author

maleadt commented Sep 11, 2024

Yep, that's fine! You could also consider using something like https://github.com/staticfloat/forerunner-buildkite-plugin to only trigger CI when relevant sources have been modified.

@Dale-Black
Copy link
Collaborator

I will look into that as well, thank you!

Btw, is there a way to cancel a long-running pipeline? I just modified the testing portion of my buildkite pipeline to have a timeout_in_minutes: 0 but it seems like CUDA test is failing and still running (~25 mins now). I would like to cancel this and just comment out my full pipeline of tests right now while I focus on getting the benchmarks to work properly

@maleadt
Copy link
Author

maleadt commented Sep 11, 2024

Only if you're part of the Buildkite org, so not easily. I'd recommend adding a reasonable timeout as quickly as possible to avoid this happening in the future.

@Dale-Black
Copy link
Collaborator

I have set the timeouts to 5 min, but it does not seem to matter, as AMDGPU is hanging right now and has been running for ~15 mins. Is there something I am still doing wrong? I do not want to waste resources as this is very generous to offer this to the community

@maleadt
Copy link
Author

maleadt commented Sep 12, 2024

Can you link to a build? The only ones I see are waiting for an AMDGPU agent, not actually running.

@Dale-Black
Copy link
Collaborator

Oh yes, those are the ones I was talking about. So it doesn't waste resources while waiting on an agent?

@maleadt
Copy link
Author

maleadt commented Sep 13, 2024

So it doesn't waste resources while waiting on an agent?

That's right, so no need to worry about that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants