-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220
Comments
I think that moving the timeline out from a Stopping state would not be good. I.e. Stopping should be an irreversible path to walk. Otherwise we'd have to implement a "initialization light" operation, and think about how all the different timeline components deal with initialization. So more or less, what should happen is that once the timeline is in Stopping state, we should be completing the offload operation, and then unoffload it when the archival config request is retried. As the offload operation is triggered by the compaction task, we'll need some mechanism that continues the offload once that errors out: it's untenable to error on all archival config requests until compaction gets to the timeline again. Actionable items:
|
as the compaction loop is per-tenant, it's probably not a good idea to block compactions of other timelines on this. however, compaction doesn't sleep if there is still work left to do. so maybe if there is an error during offloading, we could make it piggy back on that mechanism. of course, we should make sure that actual compaction doesn't get into the way. |
Hmm originally I was leaning towards adding a new |
I am hesitant to layer more locks on here, but the most direct solution to this particular bug might be to have an "offloading lock" that is taken by the part of compaction that considers offloading a timeline, and then also to take this lock during archival config changes. I agree that shutting down a timeline should be a one-way street: we definitely don't want to try backing out from a shutdown when we receive an archival request. Holding a TimelineState in a watch<> is a bit of a doomed approach when it will always be racy wrt the state of other parts of the Timeline. The main utility of the TimelineState is in waiting for a timeline to be active, we could probably narrow it to just a condition variable for any code that waits for activation. |
via INC-357.
https://neondb.slack.com/archives/C085L8N9B4P
The text was updated successfully, but these errors were encountered: