-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageservers: suspected GetPage pipelining deadlock with timeline shutdown #10309
Labels
c/storage/pageserver
Component: storage: pageserver
t/bug
Issue Type: Bug
t/incident
Issue type: incident in our service
Comments
erikgrinaker
added
c/storage/pageserver
Component: storage: pageserver
t/bug
Issue Type: Bug
labels
Jan 8, 2025
problame
added a commit
that referenced
this issue
Jan 8, 2025
When we moved throttling up from Timeline::get into page_service, we stopped being sensitive to `Timeline::cancel`, even though we're holding a Handle and thus a guard on the `Timeline::gate` open. This PR rectifies the situation. Refs - Found while investigating #10309 (hung detach because gate kept open), but not expected to be the root cause of that issue because the affected tenants are not being throttled according to their metrics.
problame
added a commit
that referenced
this issue
Jan 8, 2025
problame
added a commit
that referenced
this issue
Jan 9, 2025
…enderWaitsForReceiverToConsume` Before this PR, there were cases where send() in state SenderWaitsForReceiverToConsume would never be woken up by the receiver, because it never registered with `wake_sender`. Example Scenario 1: we stop polling a send() future A that was waiting for the receiver to consume. We drop A and create a new send() future B. B would return Poll::Pending and never regsister a waker. Example Scenario 2: a send() future A transitions from HasData to SenderWaitsForReceiverToConsume. This registers the context X with `wake_sender`. But before the Receiver consumes the data, we poll A from a different context Y. The state is still SenderWaitsForReceiverToConsume, but we wouldn't register the new context with `wake_sender`. When the Receiver comes around to consume and `wake_sender.notify()`s, it wakes the old context X instead of Y. I found this bug while investigating #10309. There (in page_service), Scenario 1 does not apply because we poll the send() future to completion. I am not certain about Scenario 2; we use tokio::join!. Need to investigate whether it (or tokio, generally), can decide to poll, possibly spuriously, without an explicit wakeup _AND_ a changed context.
problame
added a commit
that referenced
this issue
Jan 9, 2025
Wondered about the case covered here while investigating #10309.
problame
added a commit
that referenced
this issue
Jan 9, 2025
Wondered about the case covered here while investigating #10309.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
c/storage/pageserver
Component: storage: pageserver
t/bug
Issue Type: Bug
t/incident
Issue type: incident in our service
In regions where GetPage pipelining is enabled, we've seen timeline shutdown (e.g. used during migrations) get stuck for several minutes, while GetPage requests get stuck waiting for the timeline to activate. When the shutdown succeeds, we see that the gate was held by a GetPage request. We suspect that there is a deadlock between GetPage requests and timeline shutdown when pipelining is enabled.
See also internal Slack thread.
The text was updated successfully, but these errors were encountered: