Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use runner pkg in loki source file #2428

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

wildum
Copy link
Contributor

@wildum wildum commented Jan 16, 2025

PR Description

Why is a rework needed?

The current implementation of the loki.source.file has a bug on Windows: if you rotate a file by deleting it and recreating it with the same name, the file won't be tailed anymore unless a discovery component had time to update that the file was every missing (which does not usually happen because the sync period is usually much higher that the time it takes to rotate the file).

An attempt to fix this bug was done here: #2282, but the fix was not satisfying:

  • it added complexity
  • it added an argument that's only relevant for a Windows edge case

From there it was decided to rework the component to add a proper retry mechanism.

What are the changes?

Previously, all readers were stopped and deleted on every update. The tailing was also started in the Update function.
Now the readers are created unstarted in the Update function and the runner pkg that we use in other components is responsible for running them in the run function.

What are the gains from the rework?
  • The log rotation file bug on Windows is fixed by the backoff retry that is handled by the runners. The test "TestDeleteRecreateFile" fails on Windows on main but passes with this solution.
  • The code is easier to follow because it reuses a generic runner implementation. It's also easier to test because all the goroutines are now started and stopped via the Run function. Previously, if you created the component, it would call Update which would start a bunch of goroutines that would only be cleaned up in the Run function.
  • CPU usage seems to be lower on heavy setups because the existing readers are kept between updates as long as the target is still there.
  • Increased test coverage
  • Cleaned up some old comments that don't apply anymore
What about the performances?

I ran locally in parallel two Alloy (latest main vs rework). They tailed 2000 files each containing a few lines of logs. Every 3 seconds 10 of these files were deleted and 10 new ones were created with different names.

The number of goroutines and the amount of memory are still the same:

  • 12036 vs 12042 (that's a lot of goroutines, further work could be done to reduce it if that's a problem)
  • 44MB vs 46MB

But in terms of CPU, there seems to be a significant improvement (50% according to the flamegraph):

Before

Screenshot 2025-01-16 at 12 13 17

After

Screenshot 2025-01-16 at 12 13 31

Which issue(s) this PR fixes

Fixes #2345

PR Checklist

  • CHANGELOG.md updated
  • Tests updated

@wildum wildum requested a review from a team as a code owner January 16, 2025 11:15
func (c *Component) reportSize(path, labels string) {
// Ask the reader to update the size if a reader exists, this keeps
// position and size metrics in sync.
if reader, ok := c.readers[positions.Entry{Path: path, Labels: labels}]; ok {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic was wrong because the readers were reset at the start of every update and because the readers were dedup before the size report so I just got rid of it

Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would ideally like to see it running for some time in a dev cluster so we can verify there are no obvious issues with it. Do you think we could set something up? 🤔

internal/component/loki/source/file/runner.go Outdated Show resolved Hide resolved
internal/component/loki/source/file/tailer.go Outdated Show resolved Hide resolved
internal/component/loki/source/file/decompresser.go Outdated Show resolved Hide resolved
internal/component/loki/source/file/decompresser.go Outdated Show resolved Hide resolved
internal/component/loki/source/file/file.go Outdated Show resolved Hide resolved
return
}

if fi.Size() < pos {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a file is rotated but the new rotated file is larger than the previous, do we miss the first pos lines?

Copy link
Contributor Author

@wildum wildum Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we do. That would be a problem for Windows because on Unix the file rotation does not exit the readlines function. But it also a problem in general: if you stop Alloy and replace the file with another file of the same name but bigger, it will also only start reading at the saved pos.
I suggest to add a comment there and see whether this becomes a problem or not. If it does, saving and comparing the file creation timestamps might be a good approach to solve this

internal/component/loki/source/file/tailer.go Outdated Show resolved Hide resolved
level.Error(t.logger).Log("msg", "error marking file position when stopping tailer", "path", t.path, "error", err)
}
// Save the current position before shutting down tailer
err := t.markPositionAndSize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark after we wait for readLines to complete?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it for the decompressor but for the tailer we can't do it directly because you cannot get the tail size after you stopped the tail

@wildum
Copy link
Contributor Author

wildum commented Jan 17, 2025

@thampiotr @dehaansa while going over the review from Sam, I noticed that there were some race conditions between the Stop and the Run functions. To address it:

  • I made sure that the Stop() function exits before exiting the Run function of the runner
  • I protected the first part of the Run function and the first part of the Stop function with mutex and added a "stopping" boolean. This handles the scenario where the Stop function is called before the Run function (or in between two Run functions). When this happens, the Run function will check the stopping bool and exit early + cleanup to release the Stop function.

I hope that the logic is still ok to follow. The added complexity is not ideal but all the different go routines that this component is spawning are not making the job easy. LMK what you think

I ran it with 4000 files and did not see any problem + there was still a 50% CPU improvement

@thampiotr thampiotr self-requested a review January 17, 2025 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rework of loki.source.file
3 participants