Jobs started because opam-repository has been updated should have a low priority #303

kit-ty-kate · 2021-01-14T16:13:06Z

Feel free to close this issue if I'm wrong but it seems that the jobs started by ocaml-ci when a dependency has been updated in opam-repository are started with a hight priority.

However I would argue that this should not be the case as, for instance a new release of dune would start (and has started, today) more than 8k jobs at the same time, which blocks regular commits and PRs from being tested for several hours.

talex5 · 2021-01-14T17:16:33Z

The jobs will be low priority.

I suspect the reason jobs are delayed is because so many of the builders are broken (7 at the moment, according to the monitoring!). Probably best to investigate why that is (I can't ssh to them).

kit-ty-kate · 2021-01-14T17:38:04Z

The thing that makes me thing it might be a priority issue is that opam-repo-ci worked fine at the same time and did not seem to be overloaded. In general I would expect ocaml-ci jobs to be less demanding than opam-repo-ci so I was rather surprised to see high-priority ocaml-ci jobs stuck for hours.

talex5 · 2021-01-16T11:15:05Z

The jobs might have been assigned to a stuck machine. If a machine is completely down then the lost TCP keep-alive messages will cause the connection to be dropped, but if the kernel is still responding but the agent process is stuck then, at the moment, the scheduler will just keep waiting.

The agent might be stuck because a btrfs deadlock has put it in an uninterruptable wait, or because there's so little RAM that it's swapping too much (e.g. with dune taking all the memory - even ssh doesn't respond).

If you have the admin cap for the cluster, you can see the queue with ocluster-admin show. Running ocluster-admin update on a worker will reassign all its queued jobs and ask it to restart, which can be useful as a workaround.

Implementing ocurrent/ocluster#110 would help.

tmcgilchrist added the type/enhancement New feature or request label Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs started because opam-repository has been updated should have a low priority #303

Jobs started because opam-repository has been updated should have a low priority #303

kit-ty-kate commented Jan 14, 2021

talex5 commented Jan 14, 2021

kit-ty-kate commented Jan 14, 2021

talex5 commented Jan 16, 2021

Jobs started because opam-repository has been updated should have a low priority #303

Jobs started because opam-repository has been updated should have a low priority #303

Comments

kit-ty-kate commented Jan 14, 2021

talex5 commented Jan 14, 2021

kit-ty-kate commented Jan 14, 2021

talex5 commented Jan 16, 2021