Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BugFix] abort compaction task properly during shutdown #55503

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

kevincai
Copy link
Contributor

Why I'm doing:

What I'm doing:

Fixes #issue

What type of PR is this:

  • BugFix
  • Feature
  • Enhancement
  • Refactor
  • UT
  • Doc
  • Tool

Does this PR entail a change in behavior?

  • Yes, this PR will result in a change in behavior.
  • No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • Parameter changes: default values, similar parameters but with different default values
  • Policy changes: use new policy to replace old one, functionality automatically enabled
  • Feature removed
  • Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • I have added test cases for my bug fix or my new feature
  • This pr needs user documentation (for new or modified features or behaviors)
    • I have added documentation for my new feature or new function
  • This is a backport pr

Bugfix cherry-pick branch check:

  • I have checked the version labels which the pr will be auto-backported to the target branch
    • 3.4
    • 3.3
    • 3.2
    • 3.1
    • 3.0

@kevincai kevincai requested a review from a team as a code owner January 28, 2025 08:49
@kevincai kevincai force-pushed the abort-compaction-during-shutdown branch from 0dac0f6 to 1f09a1b Compare January 28, 2025 14:02
}
}

void CompactionScheduler::compact(::google::protobuf::RpcController* controller, const CompactRequest* request,
CompactResponse* response, ::google::protobuf::Closure* done) {
if (_stopped) {
brpc::ClosureGuard guard(done);
auto st = Status::Aborted("Compaction shutdown in progress!");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BE/CN shudown

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -197,6 +206,9 @@ void CompactionScheduler::compact(::google::protobuf::RpcController* controller,
}
// initialize last check time, compact request is received right after FE sends it, so consider it valid now
cb->set_last_check_time(time(nullptr));
// FIXME: be noticed that even the `_stopped` is checked in the beginning, it is still possible that the scheduler
// turns to stop status after the check and before this point. In case of the conflict happens, the contexts_vec
// may never be processed which leads to rpc hanging without response.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be done by abort all?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so comment can be adjusted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np. there is a gap between the abort_all and compact(). Consider thread A and B done in the following seq

1. thread-A: compact() execute to line 183
2. thread-B: stop()
3. thread-A: run to line 212

So the abort_all() is done and new context are still pushed into the task queue after abort_all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it can be handled? by adding a mutex in compaction scheduler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -432,6 +459,21 @@ Status CompactionScheduler::abort(int64_t txn_id) {
return Status::NotFound(fmt::format("no compaction task with txn id {}", txn_id));
}

void CompactionScheduler::abort_all() {
for (int i = 0; i < _task_queues.task_queue_size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there might be a concurrency problem here, task_queue_size might change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, expect the abort_all called only during the shutdown procedure, where all the services have already shutdown. Nevertheless, worth to deal the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take a second look at the implementation, it is safe even if the task_queue_size changed, because the try_get() interface can handle the i out of index if the queue size shrink. As for the expansion of the queue, it should be fine as long as no tasks can be queued after stopped (a different topic to take care).

@kevincai kevincai force-pushed the abort-compaction-during-shutdown branch from 1f09a1b to 367a64f Compare February 5, 2025 06:23
@kevincai kevincai force-pushed the abort-compaction-during-shutdown branch from 367a64f to ff35b32 Compare February 7, 2025 03:00
Copy link

github-actions bot commented Feb 7, 2025

[Java-Extensions Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

github-actions bot commented Feb 7, 2025

[FE Incremental Coverage Report]

pass : 0 / 0 (0%)

Copy link

github-actions bot commented Feb 7, 2025

[BE Incremental Coverage Report]

pass : 44 / 46 (95.65%)

file detail

path covered_line new_line coverage not_covered_line_detail
🔵 be/src/storage/lake/compaction_scheduler.cpp 44 46 95.65% [273, 274]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants