Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(perf-test): restore reachable backends, update node logic, and improve observability setup #262

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

bartsmykla
Copy link
Contributor

@bartsmykla bartsmykla commented Jan 28, 2025

This PR fixes issues with performance tests broken in #224

Changes:

  • Restored reachable backends in the service graph to fix failing tests.
  • Updated logic to allocate enough nodes for 1000 services with 2 instances each.
  • Added a new "observability" node group in EKS to keep Prometheus and other tools separate.
    • Added a 80GB PersistentVolumeClaim for Prometheus to avoid storage issues when there is a lot of workloads
    • Ensured observability components run on the right node group using tolerations and nodeSelector.
  • Increased timeout in the test that checks certificate distribution to 360s, as generating certificates for 2000 services takes longer than before.

- Added back reachable backends in the service graph to fix failing tests.
- Updated node count logic to handle resource requests for 1000 services
  with 2 instances each. The old logic didn't provide enough nodes.

Signed-off-by: Bart Smykla <[email protected]>
@lukidzi
Copy link
Contributor

lukidzi commented Jan 28, 2025

Increased timeout in the test that checks certificate distribution to 360s, as generating certificates for 2000 services takes longer than before.

should this be investigated in kuma?

Added a new "observability" node group in EKS to keep Prometheus and other
monitoring tools separate from other workloads. This helps ensure Prometheus
has enough resources, especially when monitoring many services.

Updated Prometheus setup to:
- Use a 80GB PersistentVolumeClaim to avoid running out of space when monitoring
  large workloads.
- Add tolerations and nodeSelector to make sure observability components run on
  the right node group.

Increased timeout in a test from 60s to 600s, as generating certificates for
2000 services takes significantly more time.

Signed-off-by: Bart Smykla <[email protected]>
@bartsmykla bartsmykla force-pushed the fix/configure-reachable-backends branch from ce73ed2 to 53bdca4 Compare January 28, 2025 13:23
This logic is not used, but was helpful when I was making sure locally
that there is no huge difference between reachable services with legacy
`kuma.io/service` labels and reachable backends with `MeshServices`

Signed-off-by: Bart Smykla <[email protected]>
Signed-off-by: Bart Smykla <[email protected]>
Signed-off-by: Bart Smykla <[email protected]>
Signed-off-by: Bart Smykla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants