Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-47889: Prevent DB connection pool exhaustion in Butler server #1124

Merged
merged 3 commits into from
Dec 5, 2024

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Dec 3, 2024

Adjusted the database connection pool parameters and added a limit to the number of long-running queries that the server will execute simultaneously.

If a client tries to run a long-running query while the server is overloaded, the request will be rejected with an HTTP 503 and the client will retry later. This also sets up the retry logic for HTTP 429, which will be used in the future for other kinds of rate limiting.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

@dhirving dhirving force-pushed the tickets/DM-47889 branch 2 times, most recently from 4ac7a64 to c85c332 Compare December 3, 2024 18:22
Copy link

codecov bot commented Dec 3, 2024

Codecov Report

Attention: Patch coverage is 92.13483% with 7 lines in your changes missing coverage. Please review.

Project coverage is 89.45%. Comparing base (f8303a5) to head (8ea635f).
Report is 4 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
.../lsst/daf/butler/remote_butler/_http_connection.py 86.66% 2 Missing and 2 partials ⚠️
tests/test_server.py 92.85% 1 Missing and 1 partial ⚠️
.../remote_butler/server/handlers/_query_streaming.py 95.23% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1124   +/-   ##
=======================================
  Coverage   89.44%   89.45%           
=======================================
  Files         366      366           
  Lines       48614    48684   +70     
  Branches     5890     5897    +7     
=======================================
+ Hits        43485    43548   +63     
- Misses       3717     3721    +4     
- Partials     1412     1415    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving marked this pull request as ready for review December 4, 2024 00:13
# How long we ask callers to wait before trying their query again.
# The hope is that they will bounce to a less busy replica, so we don't want
# them to wait too long.
_QUERY_RETRY_SECONDS = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to do randomized exponential backoff at some point in the future? Can we tell if this is the 100th time a client has attempted a retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't really want to go exponential (at least by default), since most consumers of the Butler can't tolerate a request hanging out indefinitely. This is just supposed to buy us enough time for auto-scaling to get some more replicas booted up. When this case is triggered the scarce resource is threads and database connections, so I'm not really worried about the minimal amount of CPU and network needed to tell the caller to go away once every 5 seconds.

I added a cap on the retry time on the client side. Currently we don't have a way to track what a client is up to on the server side, but we will eventually have a redis instance (or Russ will) for more elaborate rate limiting.

Set Postgres connection pool size to match the FastAPI thread pool size.  Testing showed that performance degraded horrendously once we got into the "overflow" connections that were allowed by the previous configuration.
The server now tracks the number of streaming queries in progress, and rejects query requests that exceed a limit of 25 with an HTTP 503.  The client now retries on receiving a 503 or 429 with a Retry-After header.

This will allow the server to avoid exhausting its thread pool and database connection pool with long-running queries.
If after a couple of minutes the server is still not able to handle our query, we should just bail so that a Butler server failure doesn't cascade indefinitely into other services trying to use the Butler.
@dhirving dhirving merged commit d632886 into main Dec 5, 2024
19 checks passed
@dhirving dhirving deleted the tickets/DM-47889 branch December 5, 2024 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants