DM-47889: Prevent DB connection pool exhaustion in Butler server #1124

dhirving · 2024-12-03T18:14:34Z

Adjusted the database connection pool parameters and added a limit to the number of long-running queries that the server will execute simultaneously.

If a client tries to run a long-running query while the server is overloaded, the request will be rejected with an HTTP 503 and the client will retry later. This also sets up the retry logic for HTTP 429, which will be used in the future for other kinds of rate limiting.

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes
(if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

codecov · 2024-12-03T19:12:09Z

Codecov Report

Attention: Patch coverage is 92.13483% with 7 lines in your changes missing coverage. Please review.

Project coverage is 89.45%. Comparing base (f8303a5) to head (8ea635f).
Report is 4 commits behind head on main.

✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
.../lsst/daf/butler/remote_butler/_http_connection.py	86.66%	2 Missing and 2 partials ⚠️
tests/test_server.py	92.85%	1 Missing and 1 partial ⚠️
.../remote_butler/server/handlers/_query_streaming.py	95.23%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1124   +/-   ##
=======================================
  Coverage   89.44%   89.45%           
=======================================
  Files         366      366           
  Lines       48614    48684   +70     
  Branches     5890     5897    +7     
=======================================
+ Hits        43485    43548   +63     
- Misses       3717     3721    +4     
- Partials     1412     1415    +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

timj · 2024-12-04T20:56:32Z

python/lsst/daf/butler/remote_butler/server/handlers/_query_streaming.py

+# How long we ask callers to wait before trying their query again.
+# The hope is that they will bounce to a less busy replica, so we don't want
+# them to wait too long.
+_QUERY_RETRY_SECONDS = 5


Are we planning to do randomized exponential backoff at some point in the future? Can we tell if this is the 100th time a client has attempted a retry?

We don't really want to go exponential (at least by default), since most consumers of the Butler can't tolerate a request hanging out indefinitely. This is just supposed to buy us enough time for auto-scaling to get some more replicas booted up. When this case is triggered the scarce resource is threads and database connections, so I'm not really worried about the minimal amount of CPU and network needed to tell the caller to go away once every 5 seconds.

I added a cap on the retry time on the client side. Currently we don't have a way to track what a client is up to on the server side, but we will eventually have a redis instance (or Russ will) for more elaborate rate limiting.

Set Postgres connection pool size to match the FastAPI thread pool size. Testing showed that performance degraded horrendously once we got into the "overflow" connections that were allowed by the previous configuration.

The server now tracks the number of streaming queries in progress, and rejects query requests that exceed a limit of 25 with an HTTP 503. The client now retries on receiving a 503 or 429 with a Retry-After header. This will allow the server to avoid exhausting its thread pool and database connection pool with long-running queries.

If after a couple of minutes the server is still not able to handle our query, we should just bail so that a Butler server failure doesn't cascade indefinitely into other services trying to use the Butler.

dhirving force-pushed the tickets/DM-47889 branch 2 times, most recently from 4ac7a64 to c85c332 Compare December 3, 2024 18:22

dhirving marked this pull request as ready for review December 4, 2024 00:13

timj approved these changes Dec 4, 2024

View reviewed changes

dhirving added 3 commits December 5, 2024 09:59

Increase Postgres connection pool size

4571349

Set Postgres connection pool size to match the FastAPI thread pool size. Testing showed that performance degraded horrendously once we got into the "overflow" connections that were allowed by the previous configuration.

Limit the maximum retry time for query streaming

8ea635f

If after a couple of minutes the server is still not able to handle our query, we should just bail so that a Butler server failure doesn't cascade indefinitely into other services trying to use the Butler.

dhirving force-pushed the tickets/DM-47889 branch from d433a5d to 8ea635f Compare December 5, 2024 17:00

dhirving merged commit d632886 into main Dec 5, 2024
19 checks passed

dhirving deleted the tickets/DM-47889 branch December 5, 2024 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-47889: Prevent DB connection pool exhaustion in Butler server #1124

DM-47889: Prevent DB connection pool exhaustion in Butler server #1124

dhirving commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

timj Dec 4, 2024

dhirving Dec 5, 2024

DM-47889: Prevent DB connection pool exhaustion in Butler server #1124

DM-47889: Prevent DB connection pool exhaustion in Butler server #1124

Conversation

dhirving commented Dec 3, 2024 • edited Loading

Checklist

codecov bot commented Dec 3, 2024 • edited Loading

Codecov Report

timj Dec 4, 2024

Choose a reason for hiding this comment

dhirving Dec 5, 2024

Choose a reason for hiding this comment

dhirving commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading