neon-cluster-operator: performance issues after cluster runs for a while #1844

jefflill · 2023-08-14T17:33:08Z

neon-cluster-operator has high CPU utilization and looks like it's also slamming the API server. The combination is close to pegging the CPUs on a single node cluster. This happens after gracefully restarting the cluster and then waiting a while (like overnight).

We've been fighting this one for a while now without actually posting an issue. We first noticed that desktop clusters are close to pegging the assigned CPUs but a fresh single node (non-desktop) cluster has low CPU utilization. We realized that this was happening because the desktop cluster was shutdown before creating the desktop node image and then restarted when deployed. We were able to replicate this on fresh single node (non-desktop) cluster by restarting it and @marcusbooyah resolved a perf issue.

Unfortunately, we're seeing this behavior again after running the cluster overnight. I did restart the cluster gracefully, although I'm not sure that's required to reproduce this.

jefflill · 2023-08-14T17:34:19Z

I'm going to try out Marcus' fancy Operator SDK F5 to debug feature!

jefflill · 2023-08-15T14:29:18Z

I ran this overnight in the debugger on a neon-desktop cluster that I also restarted for good measure and I'm not seeing this performance problem. I set the log level to debug but all I'm seeing are periodic Prometheus health checks every 20 seconds or so which take less than 10ms to run.

jefflill · 2023-08-15T14:53:29Z

I had another desktop running on my Windows Home box with the cluster operator running in cluster and that cluster is struggling. Looking at the operator logs, I'm seeing this webhook ping about every 10 seconds:

{"tsNs":1692110208024873600,"severity":"Information","body":"Request starting HTTP/1.1 POST https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s application/json 22016","categoryName":"Microsoft.AspNetCore.Hosting.Diagnostics","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Microsoft.AspNetCore.Hosting.Diagnostics","Protocol":"HTTP/1.1","Method":"POST","ContentType":"application/json","ContentLength":22016,"Scheme":"https","Host":"neon-cluster-operator.neon-system.svc:443","PathBase":"","Path":"/apps/v1/deployments/deploymentwebhook/mutate","QueryString":"?timeout=5s"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208027119400,"severity":"Information","body":"Admission with method \"UPDATE\".","categoryName":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","neon.index":46427,"traceid":"78cf8349f7ae456f1385cf695a64a8ee","spanid":"1c1665ef3ebf19e5"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208027194400,"severity":"Information","body":"Received request for deployment neon-monitor/grafana-deployment","categoryName":"Neon.Kube.Operator.Webhook.IMutatingWebhook","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Neon.Kube.Operator.Webhook.IMutatingWebhook","neon.index":46428,"traceid":"78cf8349f7ae456f1385cf695a64a8ee","spanid":"a5d1840296ed887f"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"a5d1840296ed887f","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208027222600,"severity":"Information","body":"AdmissionHook \"neonclusteroperator.v1deployment.deploymentwebhook\" did return \"True\" for \"UPDATE\".","categoryName":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","neon.index":46429,"traceid":"78cf8349f7ae456f1385cf695a64a8ee","spanid":"1c1665ef3ebf19e5"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208037529400,"severity":"Information","body":"Request finished HTTP/1.1 POST https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s application/json 22016 - 200 - application/json;+charset=utf-8 12.6546ms","categoryName":"Microsoft.AspNetCore.Hosting.Diagnostics","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Microsoft.AspNetCore.Hosting.Diagnostics","ElapsedMilliseconds":12.6546,"StatusCode":200,"ContentType":"application/json; charset=utf-8","ContentLength":null,"Protocol":"HTTP/1.1","Method":"POST","Scheme":"https","Host":"neon-cluster-operator.neon-system.svc:443","PathBase":"","Path":"/apps/v1/deployments/deploymentwebhook/mutate","QueryString":"?timeout=5s"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}

Note that I'm not seeing this in the debugger for the other cluster.

jefflill · 2023-08-15T18:13:28Z

CLOSING: It looks like the performance problem is related to CertManager/ACME rather than the cluster operator. I've opened an new issue to track that: #1847

jefflill self-assigned this Aug 14, 2023

jefflill added bug Identifies a bug or other failure perf Perfomance related cluster-operators Related to one of our cluster operators labels Aug 14, 2023

jefflill mentioned this issue Aug 14, 2023

Operator SDK: potential issues while local debugging #1845

Closed

5 tasks

jefflill closed this as completed Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neon-cluster-operator: performance issues after cluster runs for a while #1844

neon-cluster-operator: performance issues after cluster runs for a while #1844

jefflill commented Aug 14, 2023

jefflill commented Aug 14, 2023

jefflill commented Aug 15, 2023 •

edited

Loading

jefflill commented Aug 15, 2023

jefflill commented Aug 15, 2023

neon-cluster-operator: performance issues after cluster runs for a while #1844

neon-cluster-operator: performance issues after cluster runs for a while #1844

Comments

jefflill commented Aug 14, 2023

jefflill commented Aug 14, 2023

jefflill commented Aug 15, 2023 • edited Loading

jefflill commented Aug 15, 2023

jefflill commented Aug 15, 2023

jefflill commented Aug 15, 2023 •

edited

Loading