Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageservers: suspected GetPage pipelining deadlock with timeline shutdown #10309

Open
erikgrinaker opened this issue Jan 8, 2025 · 0 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug t/incident Issue type: incident in our service

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 8, 2025

In regions where GetPage pipelining is enabled, we've seen timeline shutdown (e.g. used during migrations) get stuck for several minutes, while GetPage requests get stuck waiting for the timeline to activate. When the shutdown succeeds, we see that the gate was held by a GetPage request. We suspect that there is a deadlock between GetPage requests and timeline shutdown when pipelining is enabled.

See also internal Slack thread.

2025-01-07T10:46:26.719372Z  INFO request{method=PUT path=/v1/tenant/foo/location_config request_id=8ddcb778-53bb-4ce8-829a-be98d58d6dac}:upsert_location{tenant_id=foo shard_id=0000}:timeline_shutdown{timeline_id=bar shutdown_mode=Hard}: close completed gate=0xffff4ee50950 elapsed_ms=328752
2025-01-07T10:46:26.719317Z  INFO page_service_conn_main{peer_addr=10.9.67.122:41228}:process_query{tenant_id=foo timeline_id=bar}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: kept the gate from closing gate=0xffff4ee50950   
2025-01-07T10:46:26.719354Z  INFO page_service_conn_main{peer_addr=10.9.67.122:41228}:process_query{tenant_id=foo timeline_id=bar}:handle_pagerequests:request:handle_get_page_at_lsn_request_batched{req_lsn=FFFFFFFF/FFFFFFFF}: dropping connection due to shutdown
2025-01-07T10:46:25.848986Z  INFO page_service_conn_main{peer_addr=10.9.67.122:54466}: query handler for 'pagestream_v2 foo bar' failed with expected io error: Timed out waiting 30s for tenant active state. Latest state: None
@erikgrinaker erikgrinaker added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 8, 2025
@erikgrinaker erikgrinaker added the t/incident Issue type: incident in our service label Jan 8, 2025
problame added a commit that referenced this issue Jan 8, 2025
When we moved throttling up from Timeline::get into page_service,
we stopped being sensitive to `Timeline::cancel`, even though we're
holding a Handle and thus a guard on the `Timeline::gate` open.

This PR rectifies the situation.

Refs

- Found while investigating #10309 (hung detach because gate kept open),
  but not expected to be the root cause of that issue because the
  affected tenants are not being throttled according to their metrics.
problame added a commit that referenced this issue Jan 9, 2025
…enderWaitsForReceiverToConsume`

Before this PR, there were cases where send() in state
SenderWaitsForReceiverToConsume would never be woken up
by the receiver, because it never registered with `wake_sender`.

Example Scenario 1: we stop polling a send() future A that was waiting
for the receiver to consume. We drop A and create a new send() future B.
B would return Poll::Pending and never regsister a waker.

Example Scenario 2: a send() future A transitions from HasData
to SenderWaitsForReceiverToConsume. This registers the context X
with `wake_sender`. But before the Receiver consumes the data,
we poll A from a different context Y.
The state is still SenderWaitsForReceiverToConsume, but we wouldn't
register the new context with `wake_sender`.
When the Receiver comes around to consume and `wake_sender.notify()`s,
it wakes the old context X instead of Y.

I found this bug while investigating #10309.
There (in page_service), Scenario 1 does not apply because
we poll the send() future to completion.
I am not certain about Scenario 2; we use tokio::join!.
Need to investigate whether it (or tokio, generally), can
decide to poll, possibly spuriously, without an explicit
wakeup _AND_ a changed context.
problame added a commit that referenced this issue Jan 9, 2025
Wondered about the case covered here while investigating #10309.
problame added a commit that referenced this issue Jan 9, 2025
Wondered about the case covered here while investigating #10309.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug t/incident Issue type: incident in our service
Projects
None yet
Development

No branches or pull requests

2 participants