Prevent sending interrupt signals by lost killer tasks by activity #231
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The activity timeout quite often stops the builds on our servers. Based on the log messages, it shouldn't behave like this. A colleague of ours analyzed the plugin source code and figured out that the problem is caused by "lost" killer tasks. When multiple threads operate on the same
TimeoutStepExecution
object, it is possible that killers are not stopped when they should be. He wanted to fix the current implementation, but it is very confusing, for example method interactions ("→" = calls):resetTimer
→setupTimer
→ sets timer with:cancel
→resetTimer
→setupTimer
→cancel
We are testing the new implementation on our server and we haven't hit any issues yet.
The introduced changes also made the timer more precise. The current one allowed extending the time by 1/10 time or more.
He was aware that for some time we will have to use a forked version, so instead of overwriting the class, he introduced a new one -
TimeoutStepExecutionThreadSafe
. It is used instead of the original one when theorg.jenkinsci.plugins.workflow.steps.TimeoutStep.threadsafe
is set totrue
(false
by default). It should also make the first review cycle easier, when the diff it simple and the original class is easily available to compare with the new one.We executed the
TimeoutStepTest
tests with the new implementation and all tests finished successfully.git message:
When many activity timeouts are run at the same time, sometimes the "Sending interrupt signal to process" message appears and the build is aborted (JENKINS-58752). The "Cancelling nested steps due to timeout" message is never printed. The code has been refactored to prevent such issues:
Tick
class is replaced by a listener which notifies the step less frequently about the changes. The behavior could be controlled by setting theorg.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityNotifyWaitRatio
property. It informs when the earliest the information about new activities should be sent to the timeout (time * ratio
). When there were no activities in that time, then the next activity will be announced right after it has been reportedThere are additional changes introduced in this commit:
org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityPrecision
. It is necessary to not abort the logic due to delay in the notification process.