agent: Decrease LFC metrics fetch frequency from 1/15hz -> 1/60hz #1187
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I.e., instead of fetching LFC metrics every 15 seconds, we will now fetch them every 60s.
The reason for this is because 15s intervals cause oscillations due to the 1-minute buckets in the metrics we're fetching. Second-to-second changes in load can cause small, periodic changes as they shift their location within the minute-long buckets. Fetching metrics every minute should fix this, because we get better alignment (although of course it's still possible for something to straddle the boundary between buckets and cause strange results, it's much less likely than what we're currently seeing in practice).
However, going up to a minute without LFC metrics wouldn't be good, because we require all sources of metrics to be available to downscale. So this commit adds new functionality to the metrics fetching config to guarantee that the first metrics fetch is done within a smaller time interval, even if the rest are evenly distributed over the full range of the refresh period.
Broadly part of neondatabase/cloud#22214.
Another option here is smoothing -- e.g., using the average of the LFC goal sizes from the last minute, rather than just the most recent one. There's trade-offs either way; I'm not sure which one is better.
I wrote about this a little bit here: https://www.notion.so/neondatabase/162f189e004780baa0f2f2c982735554?pvs=4#167f189e0047801aa2dde11c06c432bb