Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't check every IPAM allocation on every sync. #9654

Merged

Conversation

caseydavenport
Copy link
Member

@caseydavenport caseydavenport commented Dec 30, 2024

Description

Fixes #7841

The IPAM GC controller currently iterates every IP address allocation in the cluster every time an IP address is assigned or released. This can result in excessive CPU usage on larger / more active clusters.

Instead, this PR changes the behavior such that we limit the allocations we check to:

  • Those on nodes whose IPAM state has changed since the last sync.
  • Those on nodes who have had a Pod Deletion Event since the last sync.

The idea is that this set should be much smaller than every node, and will allow us to incrementally scan the cluster instead of checking the entire cluster on every syncIPAM() call.

We will have a periodic (by default every 5m) sync that will check every node, as a backstop, and it's worth noting that this change on its own means we may be relying on the periodic sync slightly more often in order to determine when IPAM allocations have leaked, as it is possible leaks can occur on nodes without the controller receiving the requisite notifications to notice it.

We might be able to enhance this further by tracking when individual allocations are dirty (and, for example, only clearing them after the grace period proves they are OK or leaked). But I think that's a larger change and this one should do the trick I hope.

Related issues/PRs

Todos

  • Tests
  • Documentation
  • Release note

Release Note

Improve CPU efficiency of IP address garbage collection controller

Reminder for the reviewer

Make sure that this PR has the correct labels and milestone set.

Every PR needs one docs-* label.

  • docs-pr-required: This change requires a change to the documentation that has not been completed yet.
  • docs-completed: This change has all necessary documentation completed.
  • docs-not-required: This change has no user-facing impact and requires no docs.

Every PR needs one release-note-* label.

  • release-note-required: This PR has user-facing changes. Most PRs should have this label.
  • release-note-not-required: This PR has no user-facing changes.

Other optional labels:

  • cherry-pick-candidate: This PR should be cherry-picked to an earlier release. For bug fixes only.
  • needs-operator-pr: This PR is related to install and requires a corresponding change to the operator.

@caseydavenport caseydavenport requested a review from a team as a code owner December 30, 2024 19:35
@marvin-tigera marvin-tigera added this to the Calico v3.30.0 milestone Dec 30, 2024
@marvin-tigera marvin-tigera added release-note-required Change has user-facing impact (no matter how small) docs-pr-required Change is not yet documented labels Dec 30, 2024
@caseydavenport
Copy link
Member Author

caseydavenport commented Dec 30, 2024

This isn't perfect - it's possible for example that on larger clusters with lots of churn, this will still end up iterating a LOT of IPAM allocations. We could probably make this even better if we tracked dirty-ness at the allocation level rather than at the node level.

@caseydavenport caseydavenport added the docs-not-required Docs not required for this change label Dec 30, 2024
@marvin-tigera marvin-tigera removed the docs-pr-required Change is not yet documented label Dec 30, 2024
@caseydavenport
Copy link
Member Author

After writing this PR up, I've been thinking a bit more about this and I think I'd like to adjust the strategy here.

Here are the 4 main responsibilities of this IPAM controller:

  1. Periodic resync. This is meant to catch leaks, and so should scan the entire IPAM data set.
  2. Whenever one or more nodes are deleted, we clean up blocks that belonged to the deleted node (by scanning the whole IPAM data set).
  3. Whenever an IPAM block is updated, we scan the whole IPAM data set (indexed in a different way than the above) in order to generate IPAM metrics for Prometheus.
  4. When we see that a Node has more blocks than it needs, we clean them up.

For (1), we need to scan the whole IPAM data set. But we can do this relatively infrequently.

For (2), we don't necessarily need to scan the entire IPAM data set - we could just clean up the known affine blocks. Scanning the whole set has some advantages - we might catch a leak slightly sooner if it was in a borrowed IPAM block, but the full scan should catch that eventually anyway.

For (3), we do need to scan the entire IPAM data set, but we can probably stop doing it so frequently. Right now we do it every batch of updates, but Prometheus is polling anyway so there's really no value in doing this more frequently than the Prometheus poll timer. We could probably change this to be every 15s (and only if something changed) without any real UX problem.

For (4), deleting unused IPAM blocks that are affine to existing nodes can be (and already are) performed without needing to scan the entire IPAM data set. We track empty blocks as updates come in, so we know exactly what the set of empty blocks is on demand. We can do an empty block check every sync without much cause for concern, so no need to change this.

I think I would propose:

  1. Scan the full IPAM data set to check for leaks on a periodic resync (i.e., every 5 minutes).
  2. Scan the full IPAM data set to check blocks for deleted nodes whenever one or more nodes are deleted (could be further optimized to check for particular nodes, but not necessary IMO).
  3. Only perform a metrics scan of the dataset if IPAM data has changed, and no more frequently than every 15s.

@fasaxc
Copy link
Member

fasaxc commented Jan 3, 2025

@caseydavenport for (3) do we scan an in-memory cache, or do we refresh the whole list from the API server? Seems like we could certainly use a cache for that less critical work.

@caseydavenport
Copy link
Member Author

@fasaxc we use an in-memory cache for all of the above! So at least we have that going 😅

@caseydavenport
Copy link
Member Author

caseydavenport commented Jan 3, 2025

The problem with (3) I just realized is that we rely on the full IPAM scan to determine which IPs are GC candidates / leaks, which then feeds into the metrics.. so some metrics will end up being gated by the full IPAM scan interval anyway!

That's not really a problem per-se (it's still accurate and reflects our slower time to detection for IPAM GC) but it does break some assumptions in our tests.

I do think this is a good reason to keep the "only scan nodes that have changed" logic though. It maintains quick updates for our metrics, while also reducing the amount of work that happens on each event.

@caseydavenport caseydavenport force-pushed the casey-ipam-gc-efficiency branch from 278f85f to 1036335 Compare January 3, 2025 22:30
@caseydavenport
Copy link
Member Author

My most recent changes keep the "immediate" sync behavior, where we trigger a sync for each block update, each node deletion update, and each pod deletion update (in order to maintain quick metrics updates). What I've done instead is back to the original approach:

  • Only iterate over nodes that have changed since the last sync.
  • When we receive an update, wait a little bit longer to see if any more updates can be added to the batch so that we don't call syncIPAM so much.

Copy link
Member

@fasaxc fasaxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of thoughts but perhaps we should bank it!

@@ -270,7 +353,9 @@ func (c *ipamController) acceptScheduleRequests(stopCh <-chan struct{}) {
log.WithField("batchSize", i).Debug("Triggering sync after batch of updates")
kick(c.syncChan)
case <-t.C:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could postpone the periodic sync if you've already done a full sync recently.

@@ -250,18 +293,58 @@ func (c *ipamController) acceptScheduleRequests(stopCh <-chan struct{}) {
for {
// Wait until something wakes us up, or we are stopped.
select {
case <-c.nodeDeletionChan:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to service both channels during the consolidation window? (But this is likely a lot better than it was!)

@caseydavenport caseydavenport merged commit d87fb98 into projectcalico:master Jan 21, 2025
3 checks passed
@caseydavenport caseydavenport deleted the casey-ipam-gc-efficiency branch January 21, 2025 17:51
tomastigera pushed a commit to ioworker0/calico that referenced this pull request Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required Docs not required for this change release-note-required Change has user-facing impact (no matter how small)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

calico-kube-controller 3.26 high cpu usage frequent ipam syncs
3 participants