Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs started because opam-repository has been updated should have a low priority #303

Open
kit-ty-kate opened this issue Jan 14, 2021 · 3 comments
Labels
type/enhancement New feature or request

Comments

@kit-ty-kate
Copy link
Contributor

Feel free to close this issue if I'm wrong but it seems that the jobs started by ocaml-ci when a dependency has been updated in opam-repository are started with a hight priority.

However I would argue that this should not be the case as, for instance a new release of dune would start (and has started, today) more than 8k jobs at the same time, which blocks regular commits and PRs from being tested for several hours.

@talex5
Copy link
Contributor

talex5 commented Jan 14, 2021

The jobs will be low priority.

I suspect the reason jobs are delayed is because so many of the builders are broken (7 at the moment, according to the monitoring!). Probably best to investigate why that is (I can't ssh to them).

@kit-ty-kate
Copy link
Contributor Author

The thing that makes me thing it might be a priority issue is that opam-repo-ci worked fine at the same time and did not seem to be overloaded. In general I would expect ocaml-ci jobs to be less demanding than opam-repo-ci so I was rather surprised to see high-priority ocaml-ci jobs stuck for hours.

@talex5
Copy link
Contributor

talex5 commented Jan 16, 2021

The jobs might have been assigned to a stuck machine. If a machine is completely down then the lost TCP keep-alive messages will cause the connection to be dropped, but if the kernel is still responding but the agent process is stuck then, at the moment, the scheduler will just keep waiting.

The agent might be stuck because a btrfs deadlock has put it in an uninterruptable wait, or because there's so little RAM that it's swapping too much (e.g. with dune taking all the memory - even ssh doesn't respond).

If you have the admin cap for the cluster, you can see the queue with ocluster-admin show. Running ocluster-admin update on a worker will reassign all its queued jobs and ask it to restart, which can be useful as a workaround.

Implementing ocurrent/ocluster#110 would help.

@tmcgilchrist tmcgilchrist added the type/enhancement New feature or request label Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants