You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Feel free to close this issue if I'm wrong but it seems that the jobs started by ocaml-ci when a dependency has been updated in opam-repository are started with a hight priority.
However I would argue that this should not be the case as, for instance a new release of dune would start (and has started, today) more than 8k jobs at the same time, which blocks regular commits and PRs from being tested for several hours.
The text was updated successfully, but these errors were encountered:
I suspect the reason jobs are delayed is because so many of the builders are broken (7 at the moment, according to the monitoring!). Probably best to investigate why that is (I can't ssh to them).
The thing that makes me thing it might be a priority issue is that opam-repo-ci worked fine at the same time and did not seem to be overloaded. In general I would expect ocaml-ci jobs to be less demanding than opam-repo-ci so I was rather surprised to see high-priority ocaml-ci jobs stuck for hours.
The jobs might have been assigned to a stuck machine. If a machine is completely down then the lost TCP keep-alive messages will cause the connection to be dropped, but if the kernel is still responding but the agent process is stuck then, at the moment, the scheduler will just keep waiting.
The agent might be stuck because a btrfs deadlock has put it in an uninterruptable wait, or because there's so little RAM that it's swapping too much (e.g. with dune taking all the memory - even ssh doesn't respond).
If you have the admin cap for the cluster, you can see the queue with ocluster-admin show. Running ocluster-admin update on a worker will reassign all its queued jobs and ask it to restart, which can be useful as a workaround.
Feel free to close this issue if I'm wrong but it seems that the jobs started by ocaml-ci when a dependency has been updated in opam-repository are started with a hight priority.
However I would argue that this should not be the case as, for instance a new release of dune would start (and has started, today) more than 8k jobs at the same time, which blocks regular commits and PRs from being tested for several hours.
The text was updated successfully, but these errors were encountered: