Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

Open
jcsp opened this issue Dec 20, 2024 · 4 comments · May be fixed by #10305
Open

pageserver: compaction racing with archival config can leave Timeline stuck in Stopping. #10220

jcsp opened this issue Dec 20, 2024 · 4 comments · May be fixed by #10305
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@jcsp
Copy link
Collaborator

jcsp commented Dec 20, 2024

via INC-357.

Compaction races with modifying the timeline's archival config.
offload_timeline gets called for a timeline that was recently archived, errors out if the timeline isn't archived. However, by that point it has already called DeleteTimelineFlow::prepare, which puts the timeline into Stopping state.
So that's how it's stuck right now: the Timeline exists but is in Stopping state.

https://neondb.slack.com/archives/C085L8N9B4P

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Dec 20, 2024
@arpad-m
Copy link
Member

arpad-m commented Dec 20, 2024

I think that moving the timeline out from a Stopping state would not be good. I.e. Stopping should be an irreversible path to walk. Otherwise we'd have to implement a "initialization light" operation, and think about how all the different timeline components deal with initialization.

So more or less, what should happen is that once the timeline is in Stopping state, we should be completing the offload operation, and then unoffload it when the archival config request is retried. As the offload operation is triggered by the compaction task, we'll need some mechanism that continues the offload once that errors out: it's untenable to error on all archival config requests until compaction gets to the timeline again.

Actionable items:

  • We need some thread-safe switch that both timeline offloading and timeline need to flip to "their" direction so that there is no race going on. This would fix the bug at hand. For offloading, it could be a flag in the Stopping state maybe, as well as a new IsUnarchiving state, idk. Or we add it somewhere to the upload queue and have it live under its locks, idk.
  • We should also think about retries of offloading (issued by compaction). if it fails, it might leave the timeline in a stopping state, so waiting until the compaction loop gets to it again isn't good. Maybe we should spawn a task that retries indefinitely? Or have the compaction loop just have a sub-loop that retries offloading until completion? What if it errors each time?
  • We should also think about retries of archival config: say a timeline gets put into IsUnarchiving state: is there some running task that ensures this runs to completion? Or do we want to be dependent on users to retry unarchival, and not just giving up eventually? I think latter is where we want to move deletion eventually, so we can probably just demand this for unarchival as well: if an unarchive operation is started, we expect it to be retried until completion.

@arpad-m
Copy link
Member

arpad-m commented Dec 20, 2024

think about retries of offloading (issued by compaction)

as the compaction loop is per-tenant, it's probably not a good idea to block compactions of other timelines on this. however, compaction doesn't sleep if there is still work left to do. so maybe if there is an error during offloading, we could make it piggy back on that mechanism. of course, we should make sure that actual compaction doesn't get into the way.

@arpad-m
Copy link
Member

arpad-m commented Jan 7, 2025

Hmm originally I was leaning towards adding a new TimelineState, but then discovered that the way we prevent illegal transitions is still racy: #10297. So either we fix the raciness, or do the upload queue fix.

@jcsp
Copy link
Collaborator Author

jcsp commented Jan 7, 2025

I am hesitant to layer more locks on here, but the most direct solution to this particular bug might be to have an "offloading lock" that is taken by the part of compaction that considers offloading a timeline, and then also to take this lock during archival config changes.

I agree that shutting down a timeline should be a one-way street: we definitely don't want to try backing out from a shutdown when we receive an archival request.

Holding a TimelineState in a watch<> is a bit of a doomed approach when it will always be racy wrt the state of other parts of the Timeline. The main utility of the TimelineState is in waiting for a timeline to be active, we could probably narrow it to just a condition variable for any code that waits for activation.

@arpad-m arpad-m linked a pull request Jan 8, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants