-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] abort compaction task properly during shutdown #55503
base: main
Are you sure you want to change the base?
Conversation
0dac0f6
to
1f09a1b
Compare
} | ||
} | ||
|
||
void CompactionScheduler::compact(::google::protobuf::RpcController* controller, const CompactRequest* request, | ||
CompactResponse* response, ::google::protobuf::Closure* done) { | ||
if (_stopped) { | ||
brpc::ClosureGuard guard(done); | ||
auto st = Status::Aborted("Compaction shutdown in progress!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BE/CN shudown
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -197,6 +206,9 @@ void CompactionScheduler::compact(::google::protobuf::RpcController* controller, | |||
} | |||
// initialize last check time, compact request is received right after FE sends it, so consider it valid now | |||
cb->set_last_check_time(time(nullptr)); | |||
// FIXME: be noticed that even the `_stopped` is checked in the beginning, it is still possible that the scheduler | |||
// turns to stop status after the check and before this point. In case of the conflict happens, the contexts_vec | |||
// may never be processed which leads to rpc hanging without response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be done by abort all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so comment can be adjusted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np. there is a gap between the abort_all
and compact()
. Consider thread A and B done in the following seq
1. thread-A: compact() execute to line 183
2. thread-B: stop()
3. thread-A: run to line 212
So the abort_all()
is done and new context are still pushed into the task queue after abort_all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it can be handled? by adding a mutex in compaction scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -432,6 +459,21 @@ Status CompactionScheduler::abort(int64_t txn_id) { | |||
return Status::NotFound(fmt::format("no compaction task with txn id {}", txn_id)); | |||
} | |||
|
|||
void CompactionScheduler::abort_all() { | |||
for (int i = 0; i < _task_queues.task_queue_size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there might be a concurrency problem here, task_queue_size might change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, expect the abort_all
called only during the shutdown procedure, where all the services have already shutdown. Nevertheless, worth to deal the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a second look at the implementation, it is safe even if the task_queue_size changed, because the try_get() interface can handle the i
out of index if the queue size shrink. As for the expansion of the queue, it should be fine as long as no tasks can be queued after stopped (a different topic to take care).
1f09a1b
to
367a64f
Compare
Signed-off-by: Kevin Xiaohua Cai <[email protected]>
367a64f
to
ff35b32
Compare
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[FE Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]✅ pass : 44 / 46 (95.65%) file detail
|
Why I'm doing:
What I'm doing:
Fixes #issue
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: