Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neon-cluster-operator: performance issues after cluster runs for a while #1844

Closed
jefflill opened this issue Aug 14, 2023 · 4 comments
Closed
Assignees
Labels
bug Identifies a bug or other failure cluster-operators Related to one of our cluster operators perf Perfomance related

Comments

@jefflill
Copy link
Collaborator

neon-cluster-operator has high CPU utilization and looks like it's also slamming the API server. The combination is close to pegging the CPUs on a single node cluster. This happens after gracefully restarting the cluster and then waiting a while (like overnight).

Image

We've been fighting this one for a while now without actually posting an issue. We first noticed that desktop clusters are close to pegging the assigned CPUs but a fresh single node (non-desktop) cluster has low CPU utilization. We realized that this was happening because the desktop cluster was shutdown before creating the desktop node image and then restarted when deployed. We were able to replicate this on fresh single node (non-desktop) cluster by restarting it and @marcusbooyah resolved a perf issue.

Unfortunately, we're seeing this behavior again after running the cluster overnight. I did restart the cluster gracefully, although I'm not sure that's required to reproduce this.

@jefflill jefflill self-assigned this Aug 14, 2023
@jefflill jefflill added bug Identifies a bug or other failure perf Perfomance related cluster-operators Related to one of our cluster operators labels Aug 14, 2023
@jefflill
Copy link
Collaborator Author

I'm going to try out Marcus' fancy Operator SDK F5 to debug feature!

@jefflill
Copy link
Collaborator Author

jefflill commented Aug 15, 2023

I ran this overnight in the debugger on a neon-desktop cluster that I also restarted for good measure and I'm not seeing this performance problem. I set the log level to debug but all I'm seeing are periodic Prometheus health checks every 20 seconds or so which take less than 10ms to run.

@jefflill
Copy link
Collaborator Author

I had another desktop running on my Windows Home box with the cluster operator running in cluster and that cluster is struggling. Looking at the operator logs, I'm seeing this webhook ping about every 10 seconds:

{"tsNs":1692110208024873600,"severity":"Information","body":"Request starting HTTP/1.1 POST https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s application/json 22016","categoryName":"Microsoft.AspNetCore.Hosting.Diagnostics","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Microsoft.AspNetCore.Hosting.Diagnostics","Protocol":"HTTP/1.1","Method":"POST","ContentType":"application/json","ContentLength":22016,"Scheme":"https","Host":"neon-cluster-operator.neon-system.svc:443","PathBase":"","Path":"/apps/v1/deployments/deploymentwebhook/mutate","QueryString":"?timeout=5s"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208027119400,"severity":"Information","body":"Admission with method \"UPDATE\".","categoryName":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","neon.index":46427,"traceid":"78cf8349f7ae456f1385cf695a64a8ee","spanid":"1c1665ef3ebf19e5"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208027194400,"severity":"Information","body":"Received request for deployment neon-monitor/grafana-deployment","categoryName":"Neon.Kube.Operator.Webhook.IMutatingWebhook","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Neon.Kube.Operator.Webhook.IMutatingWebhook","neon.index":46428,"traceid":"78cf8349f7ae456f1385cf695a64a8ee","spanid":"a5d1840296ed887f"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"a5d1840296ed887f","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208027222600,"severity":"Information","body":"AdmissionHook \"neonclusteroperator.v1deployment.deploymentwebhook\" did return \"True\" for \"UPDATE\".","categoryName":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Neon.Kube.Operator.Webhook.IAdmissionWebhook","neon.index":46429,"traceid":"78cf8349f7ae456f1385cf695a64a8ee","spanid":"1c1665ef3ebf19e5"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}
{"tsNs":1692110208037529400,"severity":"Information","body":"Request finished HTTP/1.1 POST https://neon-cluster-operator.neon-system.svc:443/apps/v1/deployments/deploymentwebhook/mutate?timeout=5s application/json 22016 - 200 - application/json;+charset=utf-8 12.6546ms","categoryName":"Microsoft.AspNetCore.Hosting.Diagnostics","severityNumber":9,"attributes":{"dotnet.ilogger.category":"Microsoft.AspNetCore.Hosting.Diagnostics","ElapsedMilliseconds":12.6546,"StatusCode":200,"ContentType":"application/json; charset=utf-8","ContentLength":null,"Protocol":"HTTP/1.1","Method":"POST","Scheme":"https","Host":"neon-cluster-operator.neon-system.svc:443","PathBase":"","Path":"/apps/v1/deployments/deploymentwebhook/mutate","QueryString":"?timeout=5s"},"resources":{"service.name":"neon-cluster-operator","service.version":"0.10.0-beta.3+master.bb317d91","service.instance.id":"9c016084-42c0-4582-a525-a460e2e1374d"},"spanId":"1c1665ef3ebf19e5","traceId":"78cf8349f7ae456f1385cf695a64a8ee"}

Note that I'm not seeing this in the debugger for the other cluster.

@jefflill
Copy link
Collaborator Author

CLOSING: It looks like the performance problem is related to CertManager/ACME rather than the cluster operator. I've opened an new issue to track that: #1847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Identifies a bug or other failure cluster-operators Related to one of our cluster operators perf Perfomance related
Projects
None yet
Development

No branches or pull requests

1 participant