Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REP-5470 Fix hang when handler thread hits an error. #86

Merged
merged 3 commits into from
Jan 28, 2025

Conversation

FGasper
Copy link
Collaborator

@FGasper FGasper commented Jan 23, 2025

This fixes a hang that previously happened if migration-verifier received an event that it couldn’t handle.

Specific changes:

  • The event optype and namespace presence are checked right after decoding. Previously this happened in the handler, which was the immediate cause of the hang that this changeset fixes. Doing this check makes the failure faster and more prominent.
  • If the handler fails, it stores its error, and the reader checks for that error.
  • Many debug-level logs are added.
  • Context cancellations/timeouts are reported with their Cause.
  • WritesOff() is rewritten to make it easier to reason about the mutex handling’s correctness.
  • The generation-end error check no longer mistakenly assumes that, if the worker-controller thread finished without error, then all is well.

This also removes a flapping assertion in TestStartAtTimeNoChanges.

@FGasper FGasper requested review from tdq45gj and autarch January 23, 2025 03:31
Copy link
Collaborator

@tdq45gj tdq45gj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % two small things.

if err = verifier.waitForChangeStream(ctx, csr); err != nil {
return errors.Wrapf(
err,
"failed to close %s",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the error related to closing a change stream reader?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ll reword.

verifier.SetSrcNamespaces([]string{db.Name() + ".mycoll"})
verifier.SetDstNamespaces([]string{db.Name() + ".mycoll"})
verifier.SetNamespaceMap()
// verifier.SetVerifyAll(true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this line commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cruft. REmoving.

Copy link
Collaborator

@autarch autarch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly large and complex diff for a codebase I don't understand very well. It looks okay, but there's a lot going on here. It seems like there's a lot more changes than the bare minimum needed to prevent the hang. Did you do some significant refactoring as well?

Would it be possible to split this up into multiple PRs?

//
// NB: The returned error wraps both the context’s original error
// *and* the error’s cause.
func WrapCtxErrWithCause(ctx context.Context) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason this isn't the same code as ctxutil.Err from mongo-go/ctxutil?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None in particular; I can copy that over.

}
}

// This will prevent the reader from hanging.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to say why it will prevent it from hanging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

internal/verifier/change_stream.go Show resolved Hide resolved
@FGasper
Copy link
Collaborator Author

FGasper commented Jan 24, 2025

It seems like there's a lot more changes than the bare minimum needed to prevent the hang.

I left a lot of debug logging in because they seem potentially useful and unlikely to pollute logs (excessively, anyhow).

Did you do some significant refactoring as well?

No, actually. There’s some renaming (e.g., error to readerError), and maybe what you’re seeing is the moving of the optype validation to sooner in the change stream workflow, but I wouldn’t call anything here a significant refactor.

Would it be possible to split this up into multiple PRs?

It seems a mite excessive for a <1k-line PR, most of whose changes are logging, but of course I wrote the changes. :)

I can go back over this and make discrete commits.

@FGasper FGasper requested a review from tdq45gj January 24, 2025 21:20
@FGasper FGasper requested a review from autarch January 24, 2025 21:43
@FGasper
Copy link
Collaborator Author

FGasper commented Jan 24, 2025

@autarch I’ve separated this into discrete commits.

Copy link
Collaborator

@tdq45gj tdq45gj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@autarch
Copy link
Collaborator

autarch commented Jan 24, 2025

Would it be possible to split this up into multiple PRs?

It seems a mite excessive for a <1k-line PR, most of whose changes are logging, but of course I wrote the changes. :)

I can go back over this and make discrete commits.

I think the reason I'm struggling with this is that I don't particularly understand the state of the code before the PR. So this seems like a fairly large change to review in that context.

I think discrete commits would make this a lot easier for me to review. I did take a look at the commit history but it didn't look amenable to a commit-by-commit review, which is part of why I asked if this could be broken up differently.

Copy link
Collaborator

@autarch autarch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@FGasper FGasper merged commit 1512de0 into main Jan 28, 2025
98 checks passed
@FGasper FGasper deleted the REP-5470-ci-hangs branch January 28, 2025 16:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants