-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-47889: Prevent DB connection pool exhaustion in Butler server #1124
Conversation
4ac7a64
to
c85c332
Compare
Codecov ReportAttention: Patch coverage is
✅ All tests successful. No failed tests found. Additional details and impacted files@@ Coverage Diff @@
## main #1124 +/- ##
=======================================
Coverage 89.44% 89.45%
=======================================
Files 366 366
Lines 48614 48684 +70
Branches 5890 5897 +7
=======================================
+ Hits 43485 43548 +63
- Misses 3717 3721 +4
- Partials 1412 1415 +3 ☔ View full report in Codecov by Sentry. |
# How long we ask callers to wait before trying their query again. | ||
# The hope is that they will bounce to a less busy replica, so we don't want | ||
# them to wait too long. | ||
_QUERY_RETRY_SECONDS = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we planning to do randomized exponential backoff at some point in the future? Can we tell if this is the 100th time a client has attempted a retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't really want to go exponential (at least by default), since most consumers of the Butler can't tolerate a request hanging out indefinitely. This is just supposed to buy us enough time for auto-scaling to get some more replicas booted up. When this case is triggered the scarce resource is threads and database connections, so I'm not really worried about the minimal amount of CPU and network needed to tell the caller to go away once every 5 seconds.
I added a cap on the retry time on the client side. Currently we don't have a way to track what a client is up to on the server side, but we will eventually have a redis instance (or Russ will) for more elaborate rate limiting.
Set Postgres connection pool size to match the FastAPI thread pool size. Testing showed that performance degraded horrendously once we got into the "overflow" connections that were allowed by the previous configuration.
The server now tracks the number of streaming queries in progress, and rejects query requests that exceed a limit of 25 with an HTTP 503. The client now retries on receiving a 503 or 429 with a Retry-After header. This will allow the server to avoid exhausting its thread pool and database connection pool with long-running queries.
If after a couple of minutes the server is still not able to handle our query, we should just bail so that a Butler server failure doesn't cascade indefinitely into other services trying to use the Butler.
d433a5d
to
8ea635f
Compare
Adjusted the database connection pool parameters and added a limit to the number of long-running queries that the server will execute simultaneously.
If a client tries to run a long-running query while the server is overloaded, the request will be rejected with an HTTP 503 and the client will retry later. This also sets up the retry logic for HTTP 429, which will be used in the future for other kinds of rate limiting.
Checklist
doc/changes
configs/old_dimensions