Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calico-kube-controller 3.26 high cpu usage frequent ipam syncs #7841

Closed
juliantaylor opened this issue Jul 5, 2023 · 4 comments · Fixed by #9654
Closed

calico-kube-controller 3.26 high cpu usage frequent ipam syncs #7841

juliantaylor opened this issue Jul 5, 2023 · 4 comments · Fixed by #9654

Comments

@juliantaylor
Copy link

juliantaylor commented Jul 5, 2023

We updated calico in one of our larger clusters from 3.24.5 to 3.26.1 and since then the kube-controller has a comparably high cpu usage and seems to perform many ipam syncs.
3.24 in an cluster of about 200 nodes and 10000 pods uses about 200m cpu while 3.26.1 uses a full cpu.

The underlying cause appears to be a change probably added in 3.24.6 which made the allocationIsValid function actually use the cache which reduced the cache sync time from a few minutes to a couple seconds.
So essentially a problem was fixed but maybe the high cpu usage is a fixable issue or the sync frequency is too high.

a pprofile showed it spends lots of time in defaultWorkloadEndpointConverter in checkAllocations

         0     0%     0%    159.53s 67.54%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).acceptScheduleRequests
     0.64s  0.27%  0.27%    153.88s 65.15%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).checkAllocations
         0     0%  0.27%    153.88s 65.15%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).syncIPAM
     1.44s  0.61%  0.88%    129.67s 54.90%  github.com/projectcalico/calico/kube-controllers/pkg/controllers/node.(*ipamController).allocationIsValid
         0     0%  0.88%     99.52s 42.13%  github.com/projectcalico/calico/libcalico-go/lib/backend/k8s/conversion.defaultWorkloadEndpointConverter.PodToWorkloadEndpoints
     8.49s  3.59%  4.48%     99.52s 42.13%  github.com/projectcalico/calico/libcalico-go/lib/backend/k8s/conversion.defaultWorkloadEndpointConverter.podToDefaultWorkloadEndpoint

Current Behavior

grep -E "Synchronizing IPAM|sync com|Triggering sync after batch of updates "
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:46:54.246 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:46:57.021 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:46:57.026 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:08.677 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=1
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:08.677 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.352 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.357 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.357 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=13
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:11.357 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.949 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.954 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.954 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=13
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:13.954 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.794 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.801 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.801 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=14
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:48.801 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.709 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.716 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.716 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=14
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:51.716 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.466 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.471 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.471 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=13
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:47:54.471 [DEBUG][1] ipam.go 968: Synchronizing IPAM data
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.954 [DEBUG][1] ipam.go 1009: IPAM sync completed
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.960 [DEBUG][1] ipam.go 285: Triggered IPAM sync complete
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.960 [DEBUG][1] ipam.go 262: Triggering sync after batch of updates batchSize=14
calico-kube-controllers-7cc79d76fd-n5526 calico-kube-controllers 2023-07-05 15:48:28.960 [DEBUG][1] ipam.go 968: Synchronizing IPAM data

The difference to 3.24 seems to be that it uses the podcache which was changed in 3.24.6 #7503

3.24.5 instead logs the queries for each of the 9000 ipamhandles which takes several minutes:

calico-kube-controllers-bc987cddc-vr4gk calico-kube-controllers 2023-07-05 16:20:28.566 [DEBUG][1] ipam.go 1008: Synchronizing IPAM data
calico-kube-controllers-bc987cddc-vr4gk calico-kube-controllers 2023-07-05 16:20:28.572 [DEBUG][1] ipam.go 920: Querying Kubernetes API for pod handle="k8s-pod-network.9...
.

So the cache is now effective which reduces the load on the apiserver.

  • Calico version 3.26.1
  • Orchestrator version (e.g. kubernetes, mesos, rkt): kubernetes
  • Operating System and version: flatcar linux
@caseydavenport
Copy link
Member

Sounds like we pushed one bottleneck somewhere else 😅

perhaps we need some better internal ratelimiting / batching to make sure we're not eating up too much resource at once.

@RaSerge
Copy link

RaSerge commented Dec 12, 2024

Any progress here? This issue was fixed?

We hit same issue with high cpu usage on 3.26.4

@caseydavenport
Copy link
Member

I think this TODO is essentially the problem here:

// TODO: We're effectively iterating every allocation in the cluster on every execution. Can we optimize? Or at least rate-limit?

We previously had unintentional rate limiting, but now we're iterating everything for every sync. We do some batching, but it's clearly not enough.

I think we want to introduce "dirty" tracking here, so that we can reduce this loop to only checking the allocations that we know have changed since the last sync.

@caseydavenport
Copy link
Member

This is a candidate PR that might reduce the CPU usage: #9654

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants