Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix https_batch deadlock due to golang timer changes #648

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nicklas-dohrn
Copy link
Contributor

@nicklas-dohrn nicklas-dohrn commented Dec 13, 2024

What is changed:

This changes the way the syslog batches are triggered.
The new implementation no longer uses Timers that need to be reset and checked, it just ticks once every second per http_batch drain to send a batch at least once a second.
The new logic is as follows:

  • If there is a log present in the queue, send it at last one second later with every log from the same second.
  • If there are more logs in one second than the 'batchSize' allows, send the batch immediately

The problem fixed by this change:

The old implementation was able to deadlock itself due to the way the time channel was reset. There are even more changes to this behaviour to be expected by the golang 1.23 update. (https://tip.golang.org/doc/go1.23#timer-changes)
-> Changed logic necessary to prevent further issues on further golang upgrades

Impact:

The new implementation will tick more often, but the overhead will be pretty minimal, due to not doing anything for empty drains.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Testing performed?

  • Unit tests
  • Tested in dev cluster

Checklist:

  • This PR is being made against the main branch, or relevant version branch
  • I have made corresponding changes to the documentation
  • I have added testing for my changes

@nicklas-dohrn nicklas-dohrn requested a review from a team as a code owner December 13, 2024 13:08
@nicklas-dohrn nicklas-dohrn force-pushed the main branch 2 times, most recently from f2dac59 to 5e8242f Compare December 13, 2024 13:22
@juergen-walter
Copy link

@ctlong can you check please

Copy link
Contributor

@chombium chombium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklas-dohrn Thanks for the fix. Please take a look at the linting errors. There is no error checking after the sendHttpRequest call.

src/pkg/egress/syslog/https_batch.go Outdated Show resolved Hide resolved
Copy link
Member

@ctlong ctlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you write a failing test for the old logic that works for the new logic?

@nicklas-dohrn
Copy link
Contributor Author

I am currently in the process to rewrite the tests anyway, but I will make sure to have a test that proves the point, that the old implementation deadlocks/hangs in certain scenarios.

@nicklas-dohrn nicklas-dohrn force-pushed the main branch 2 times, most recently from 6a6a850 to f0d7e6c Compare December 29, 2024 11:19
@nicklas-dohrn nicklas-dohrn requested a review from ctlong December 29, 2024 11:20
@nicklas-dohrn
Copy link
Contributor Author

This change now contains a test case, that will not work with the old timer based implementation.
I added a test configurable defaultTime for the ticker, to keep test cases shorter than I would need with 1 second intervalls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

4 participants