Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent sending interrupt signals by lost killer tasks by activity #231

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dsteiert
Copy link

@dsteiert dsteiert commented Aug 22, 2022

The activity timeout quite often stops the builds on our servers. Based on the log messages, it shouldn't behave like this. A colleague of ours analyzed the plugin source code and figured out that the problem is caused by "lost" killer tasks. When multiple threads operate on the same TimeoutStepExecution object, it is possible that killers are not stopped when they should be. He wanted to fix the current implementation, but it is very confusing, for example method interactions ("→" = calls):
resetTimersetupTimer → sets timer with: cancelresetTimersetupTimercancel

We are testing the new implementation on our server and we haven't hit any issues yet.

The introduced changes also made the timer more precise. The current one allowed extending the time by 1/10 time or more.

He was aware that for some time we will have to use a forked version, so instead of overwriting the class, he introduced a new one - TimeoutStepExecutionThreadSafe. It is used instead of the original one when the org.jenkinsci.plugins.workflow.steps.TimeoutStep.threadsafe is set to true (false by default). It should also make the first review cycle easier, when the diff it simple and the original class is easily available to compare with the new one.

We executed the TimeoutStepTest tests with the new implementation and all tests finished successfully.


git message:

When many activity timeouts are run at the same time, sometimes the "Sending interrupt signal to process" message appears and the build is aborted (JENKINS-58752). The "Cancelling nested steps due to timeout" message is never printed. The code has been refactored to prevent such issues:

  • the implementation of the activity and absolute timeouts have been separated to improve the code readability
  • the tasks executed after a delay are always created in synchronized sections, to prevent losing tasks which should be canceled
  • the timer logic of the activity timeout is changed from always stopping to verifying if the logic should be stopped or continued, so the number of timers is always under control (less instances)
  • the Tick class is replaced by a listener which notifies the step less frequently about the changes. The behavior could be controlled by setting the org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityNotifyWaitRatio property. It informs when the earliest the information about new activities should be sent to the timeout (time * ratio). When there were no activities in that time, then the next activity will be announced right after it has been reported

There are additional changes introduced in this commit:

  • the timeout id is always printed in the log messages which improves debugging (e.g. when timeouts are nested)
  • the logic is more precise. The previous implementation allowed exceeding the timeout by 1/10 of the time (JENKINS-63696). The new stops the process right after the timeout is reached. There is a property to allow exceeding it a little bit - org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityPrecision. It is necessary to not abort the logic due to delay in the notification process.

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

When many activity timeouts are run at the same time, sometimes the "Sending interrupt signal to process" message appears and the build is aborted (JENKINS-58752). The "Cancelling nested steps due to timeout" message is never printed. The code has been refactored to prevent such issues:
- the implementation of the activity and absolute timeouts have been separated to improve the code readability
- the tasks executed after a delay are always created in synchronized sections, to prevent losing tasks which should be canceled
- the timer logic of the activity timeout is changed from always stopping to verifying if the logic should be stopped or continued, so the number of timers is always under control (less instances)
- the `Tick` class is replaced by a listener which notifies the step less frequently about the changes. The behavior could be controlled by setting the `org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityNotifyWaitRatio` property. It informs when the earliest the information about new activities should be sent to the timeout (`time * ratio`). When there were no activities in that time, then the next activity will be announced right after it has been reported

There are additional changes introduced in this commit:
- the timeout id is always printed in the log messages which improves debugging (e.g. when timeouts are nested)
- the logic is more precise. The previous implementation allowed exceeding the timeout by 1/10 of the time (JENKINS-63696). The new stops the process right after the timeout is reached. There is a property to allow exceeding it a little bit - `org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.activityPrecision`. It is necessary to not abort the logic due to delay in the notification process.
@jglick
Copy link
Member

jglick commented Nov 2, 2022

It should also make the first review cycle easier, when the diff it simple and the original class is easily available to compare with the new one.

In fact it makes this PR unreviewable since it is all new code—not visible what you are modifying.

@stevenh
Copy link

stevenh commented Oct 20, 2023

@jglick I've had a look at this PR as we're also seeing random cancellation when there is clear activity as detailed in JENKINS-58752.

As described in the notes above it's quite a major refactor to address a number of issues. I believe I can apply it to the original file but not in a way which would maintain serial compatibility, would that be acceptable?

@jglick
Copy link
Member

jglick commented Oct 21, 2023

Serial compatibility is expected, so if there is no way to retain that on an existing class then the only option is to deprecate the original class (along with any now dead code such as constructors which would never again be called).

There are several examples just in this plugin, such as: c1733a9

@jglick jglick requested a review from a team October 21, 2023 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants