Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute: metrics for stuck/failing getpage requests to alert on pageserver unavailability #10327

Open
jcsp opened this issue Jan 9, 2025 · 0 comments
Labels
c/compute Component: compute, excluding postgres itself c/storage Component: storage

Comments

@jcsp
Copy link
Collaborator

jcsp commented Jan 9, 2025

We should have a general alert for computes to report if they can't reach a pageserver for too long. To catch issues like #10309, where the pageserver is hung internally and it's not obvious exactly what's wrong & which endpoint is affected, without knowing which clients couldn't get what they wanted.

Ideas:

  • metric for the longest-waiting request currently in flight
  • metrics for ratio of failed/succeeded requests in recent history

@ololobus we might need your team's help/advice to build this

@jcsp jcsp added c/compute Component: compute, excluding postgres itself c/storage Component: storage labels Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself c/storage Component: storage
Projects
None yet
Development

No branches or pull requests

1 participant