-
-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neon-cluster-operator: performance issues after cluster runs for a while #1844
Comments
I'm going to try out Marcus' fancy Operator SDK F5 to debug feature! |
I ran this overnight in the debugger on a neon-desktop cluster that I also restarted for good measure and I'm not seeing this performance problem. I set the log level to debug but all I'm seeing are periodic Prometheus health checks every 20 seconds or so which take less than 10ms to run. |
I had another desktop running on my Windows Home box with the cluster operator running in cluster and that cluster is struggling. Looking at the operator logs, I'm seeing this webhook ping about every 10 seconds:
Note that I'm not seeing this in the debugger for the other cluster. |
CLOSING: It looks like the performance problem is related to CertManager/ACME rather than the cluster operator. I've opened an new issue to track that: #1847 |
neon-cluster-operator has high CPU utilization and looks like it's also slamming the API server. The combination is close to pegging the CPUs on a single node cluster. This happens after gracefully restarting the cluster and then waiting a while (like overnight).
We've been fighting this one for a while now without actually posting an issue. We first noticed that desktop clusters are close to pegging the assigned CPUs but a fresh single node (non-desktop) cluster has low CPU utilization. We realized that this was happening because the desktop cluster was shutdown before creating the desktop node image and then restarted when deployed. We were able to replicate this on fresh single node (non-desktop) cluster by restarting it and @marcusbooyah resolved a perf issue.
Unfortunately, we're seeing this behavior again after running the cluster overnight. I did restart the cluster gracefully, although I'm not sure that's required to reproduce this.
The text was updated successfully, but these errors were encountered: