Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise compaction state machine #410

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sebastianburckhardt
Copy link
Member

This PR addresses several issues and makes some fixes related to the following issues we observed:

Issue 1. The partition load table does not show activity that originates from compactions or checkpoints being performed. This is not ideal because it can mean that (a) issues such as infinite checkpointing or compaction loops easily are not immediately visible (e.g. see #409), and (b) the scale controller is not aware of the true amount of work being performed by partitions.

To fix this we add activity level (L) to the partition load monitor whenever a checkpoint or compaction is in progress. Also, if compactions take particularly long, we add (M) or (H) indicators.

Issue 2. Compactions were only triggered during idle periods, and only after some time delay. This is a problem if there are no idle periods, or if compactions need to run more frequently, such as when we have a continuous influx of requests.

To fix this we check whether a compaction should be performed independently of whether the partition is idle, and how much time has passed.

Issue 3. the new, more efficient compaction algorithm inhttps://github.com//pull/408 warrants different parameter tuning - compaction is now much less expensive compared to the checkpointing that is triggered along with it, so we should do fewer and larger compactions

We adjust the values for the max compaction area size (50000 -> 200000) and the minimum expected reduction at which we start compacting (5000 -> 10000)

Copy link
Member

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost ready to approve, just one question

Comment on lines -576 to -581
// if we are processing events that count as activity, our latency category is at least "low"
if (markPartitionAsActive)
{
this.loadInfo.MarkActive();
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after this deleted piece of code, this method may return if it is shutting down. Could we be losing information by no longer recording the latency right before a shut down?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be o.k. since once we shut down we are no longer reporting the partition load anyway. The partition will be started somewhere else and start reporting load from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants