Revise compaction state machine #410

sebastianburckhardt · 2024-07-25T20:23:17Z

This PR addresses several issues and makes some fixes related to the following issues we observed:

Issue 1. The partition load table does not show activity that originates from compactions or checkpoints being performed. This is not ideal because it can mean that (a) issues such as infinite checkpointing or compaction loops easily are not immediately visible (e.g. see #409), and (b) the scale controller is not aware of the true amount of work being performed by partitions.

To fix this we add activity level (L) to the partition load monitor whenever a checkpoint or compaction is in progress. Also, if compactions take particularly long, we add (M) or (H) indicators.

Issue 2. Compactions were only triggered during idle periods, and only after some time delay. This is a problem if there are no idle periods, or if compactions need to run more frequently, such as when we have a continuous influx of requests.

To fix this we check whether a compaction should be performed independently of whether the partition is idle, and how much time has passed.

Issue 3. the new, more efficient compaction algorithm inhttps://github.com//pull/408 warrants different parameter tuning - compaction is now much less expensive compared to the checkpointing that is triggered along with it, so we should do fewer and larger compactions

We adjust the values for the max compaction area size (50000 -> 200000) and the minimum expected reduction at which we start compacting (5000 -> 10000)

davidmrdavid

I'm almost ready to approve, just one question

davidmrdavid · 2024-08-21T20:47:08Z

src/DurableTask.Netherite/StorageLayer/Faster/StoreWorker.cs

-                // if we are processing events that count as activity, our latency category is at least "low"
-                if (markPartitionAsActive)
-                {
-                    this.loadInfo.MarkActive();
-                }
-


after this deleted piece of code, this method may return if it is shutting down. Could we be losing information by no longer recording the latency right before a shut down?

I think that would be o.k. since once we shut down we are no longer reporting the partition load anyway. The partition will be started somewhere else and start reporting load from there.

revise compaction state machine

d539c2b

davidmrdavid reviewed Aug 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise compaction state machine #410

Revise compaction state machine #410

sebastianburckhardt commented Jul 25, 2024

davidmrdavid left a comment

davidmrdavid Aug 21, 2024

sebastianburckhardt Aug 21, 2024

Revise compaction state machine #410

Are you sure you want to change the base?

Revise compaction state machine #410

Conversation

sebastianburckhardt commented Jul 25, 2024

davidmrdavid left a comment

Choose a reason for hiding this comment

davidmrdavid Aug 21, 2024

Choose a reason for hiding this comment

sebastianburckhardt Aug 21, 2024

Choose a reason for hiding this comment