-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: Add support for stopping workqueues #82614
kernel: Add support for stopping workqueues #82614
Conversation
63fe953
to
761384d
Compare
LGTM, but I'll wait until others have reviewed too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feature in general looks clean and useful.
One bug in the test. Some criticism of IMHO an inappropriately aggressive implementation of "stop()".
And just as general hygiene, can you split this up into three patches:
- Add the feature
- Add tracing hooks
- Add tests
I'll split them up if you want, but could I ask why you prefere it that way? |
@Mattemagikern - separate commits make it easier to review and maintain in many cases. As general rules, to ensure if a code change needs to be backported, there are fewer side effects with smaller commits. If a regression occurs, it's easier to narrow down the root cause. We usually insist on separate code and test commits because it makes it easier to test bug fixes. Generally good hygiene |
761384d
to
5032425
Compare
Once upon a time on Discord: "Help help my very important code stopped working!" "Maybe it's the new feature that just got added. Can you try reverting this patch and see if that fixes it?" "Help help! I tried git revert and now it's giving me a bunch of collisions in the tracing code! Who can help me fix this? My code is very important!" Small patches that do one thing in an isolated way makes it easier to figure out what broke when things inevitably break. Maintainers are lazy and like to toss off one-line suggestions vs. addressing rebase failures in their spare time. |
f079b61
to
ecf9d13
Compare
Another newbi question: |
Separate commits within the same pull request please.
|
@npitre thanks! ✔️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more nitpicks that occur to me, but this looks good.
include/zephyr/kernel.h
Outdated
* @retval -ENODEV if the work queue was not started | ||
* @retval -EBUSY if the work queue is actively processing work items | ||
*/ | ||
int k_work_queue_stop(struct k_work_q *queue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is a blocking call (because of k_thread_join()) and so per "Standard Zephyr API Idiom" should probably take a k_timeout to allow the app to control wakeup behavior. In this particular case that's probably not a big deal because the only way this will get stuck is if a workq handler refuses to return, which is already a bug elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about side effects if an user sets a very short timeout, e.g. K_NO_WAIT
and/or thread priorities.
Where we get a false negative.
The case I'm worried about is:
int k_work_queue_stop(struct k_work_q *queue, k_timeout_t timeout)
{
...
flag_set(&queue->flags, K_WORK_QUEUE_STOP_BIT);
notify_queue_locked(queue);
k_spin_unlock(&lock, key);
if (k_thread_join(&queue->thread, timeout)) { <--- sleeping here
key = k_spin_lock(&lock);
flag_clear(&queue->flags, K_WORK_QUEUE_STOP_BIT);
k_spin_unlock(&lock, key);
return -ETIMEOUT;
}
return 0;
...
static void work_queue_main(void *workq_ptr, void *p2, void *p3)
{
....
k_spinlock_key_t key = k_spin_lock(&lock); <-- past this point
....
}
k_work_queue_stop(queue, K_NO_WAIT); // returns -ETIMEOUT
// Is the thread on it's way to die or is it alive?
It may still be ok?
Subsequent calls to the workqueue will be handled (-ENODEV) due to the thread resets the flags.
I may be overthinking this. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If k_work_queue_stop()
is guaranteed to set the STOP
bit, then the work queue is guaranteed to stop in finite time (absent app bugs that stall the handlers). I think that's acceptable. Apps that care about synchronization can always call it again and take care to avoid racing with their own handlers until they get a success.
Again, I don't think this rises to a -1 given the specifics; the timeout isn't likely to be very useful in practice. Nonetheless timeout arguments for blocking calls are The Zephyr Way, so it should be pointed out.
} | ||
} | ||
|
||
ZTEST_SUITE(workqueue_reuse, NULL, NULL, NULL, NULL, NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, rather than adding a whole new testsuite effectively for a single ZTEST testing a single API call, this should be added to an existing testsuite that are more like traditional unit tests for the work queue API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the build project there exists the:
workqueue_triggered
workqueue_delayed
They seem more specifc than my test, should I rename my suit to workqueue_api for future use instead?
or should I try to integrate my test into one of them or another test suite?
"Succeeded to submit work item to non-initialized work queue"); | ||
} | ||
|
||
ZTEST(workqueue_reuse, test_submit_to_queue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's narrow the scope of the code we are testing to k_work_queue_stop()
and add it to some other testsuite, since it is unlikely that this test is adding new coverage for work queue execution of handling multiple work items.
It doesn't really make sense to create a new suite for effectively testing one API call.
If we don't have an existing unit-style testsuite (e.g. workqueue_api
), maybe this could be it.
ZTEST(workqueue_reuse, test_submit_to_queue) | |
ZTEST(workqueue_api, test_k_work_queue_stop) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll wait with these changes until we've decided on the correct placement of the tests 👍
This patch adds support for stopping workqueues. This is useful for freeing resources from workqueues when subsystems/modules is deactivated or cleaning up the system between tests in ztest to reach a fully normalized state. The patch adds a new function k_work_queue_stop() that releases the workqueues thread and stack when a workqueue is unwanted. k_work_queue_stop(...) should be viewed as a counterpart to k_work_queue_start(...). This would allow to: k_work_queue_start(...); k_work_drain(..., true); k_work_queue_stop(...); Signed-off-by: Måns Ansgariusson <[email protected]>
Adds tracing support for stopping workqueues. Signed-off-by: Måns Ansgariusson <[email protected]>
This patch adds tests for stopping workqueues. Signed-off-by: Måns Ansgariusson <[email protected]>
726569b
to
2c3bf28
Compare
} else if (flag_test(&queue->flags, K_WORK_QUEUE_STOP_BIT)) { | ||
/* User has requested that the queue stop. Clear the status flags and exit. | ||
*/ | ||
flags_set(&queue->flags, 0); | ||
k_spin_unlock(&lock, key); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pardon me.
If we allow task append to the queue continuously after setting K_WORK_QUEUE_STOP_BIT
, the block will never be executed.
Is it excepted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A pre-condition to the _stop(..) function is that the queue has already been plugged.
static inline int queue_submit_locked(struct k_work_q *queue,
struct k_work *work)
{
* * -ENODEV if the queue isn't running.
* * -EBUSY if draining and not chained
* * -EBUSY if plugged and not draining
* * otherwise OK
So the case you're thinking of is:
Thread a {
while (true) {
queue_submit_locked(....);
}
}
Thread b {
k_work_queue_drain(..., true);
k_work_queue_stop(..., K_FOREVER);
}
Would thread b. freeze indefenately?
Initially, maybe 👍
Question is if we should disallow that append of work if the K_WORK_QUEUE_STOP_BIT is set as well.
I'm not sure. How do you reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed we need to call k_work_queue_drain
before stopping workqueue
, but the design make me thinks about this scenario will happen?
Thread a {
while (true) {
queue_submit_locked(....);
}
}
Thread b k_work_queue_drain(..., true);
Thread c k_work_queue_drain(..., true);
Thread b k_work_queue_stop(..., K_FOREVER);
Thread c k_work_queue_unplug(...);
Base on doc
The workqueue API is designed to be safe when invoked from multiple threads and interrupts.
I think it is possible so we need to care about this problem. WDYT?
In general, I prefer
- Making
K_WORK_QUEUE_STOP_BIT
being isolated instead depend onK_WORK_QUEUE_PLUGGED_BIT
. - Disallow append if
K_WORK_QUEUE_STOP_BIT
is set. - We don't clear
K_WORK_QUEUE_STOP_BIT
forever even if timeout avoid appendingwork
toworkqueue
and documented it.
What do you and others think about this idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like a great opportunity for additional test cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to forbid the unplug function if K_WORK_QUEUE_STOP_BIT
is set.
Thread a {
while (true) {
queue_submit_locked(....); <- Error once K_WORK_QUEUE_PLUGGED_BIT is set & not draining
}
}
Thread b k_work_queue_drain(..., true);
Thread c k_work_queue_drain(..., true);
Thread b k_work_queue_stop(..., K_FOREVER); <- Would exit in finite time
Thread c k_work_queue_unplug(...); <- Would yield error & not unplug.
Either thread B or C would return an error, depending on which accessed the queue first.
I believe this is the smallest change we could make to implement this feature. It shouldn't affect current use cases, what do you think?
@cfriedt Agreed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The case I'm making could still deadlock thread b, if the queue never drains the queue will continue to accept works.
So we probbably need the 2. You outlined earlier as well.
- Disallow append if
K_WORK_QUEUE_STOP_BIT
is set.
But that changes the behaviour of the queue & drain system slightly, is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option is to forbid the unplug function if K_WORK_QUEUE_STOP_BIT is set.
Sound great.
But that changes the behaviour of the queue & drain system slightly, is that ok?
It is ok for me but we still need @andyross or other maintainers comfirm since the caller is time-sensitive.
By the way, I am not sure we need to call k_work_queue_drain
before k_work_queue_stop
by users is good api design or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I am not sure we need to call k_work_queue_drain before k_work_queue_stop by users is good api design or not.
Myself, being biased (haha), think it is neat :P The queue_drain(..., true)
gives the user the knowledge that the works scheduled up to that point will be executed and in so doing letting free-routines kick in as to not casue memory leaks in the application.
The k_work_queue_stop(...)
stopps the thread as soon as the last job has been processed.
I'll commit the changes that we outlined here today :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to call queue_drain(..., true)
in k_work_queue_stop
internally and set K_WORK_QUEUE_STOP_BIT
berfore queue_drain(..., true)
Is it possible in current design?
TA TB
queue_drain(..., true)
k_work_queue_stop but flag is not set
queue_unplug
flag is set
Additionally, System Workqueue should never stop I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, System Workqueue should never stop I guess.
The patch specifically addresses scenarios where stopping a workqueue becomes interesting, like in a couple of tests I have in mind.
I believe the patch aligns with the broader system goals. If this approach conflicts with project guidelines or expectations, I expect that the maintainers wil let me know shortly haha 🥲
I think we need to call queue_drain(..., true) in k_work_queue_stop internally and set K_WORK_QUEUE_STOP_BIT berfore queue_drain(..., true)
Is it possible in current design?
I've just looked through the kernel/work.c
file, it would be a neat addition, one less function call from the users to achive the same effect. What's stopping me is the k_work_queue_drain(...)
since the current implementation is blocking ( K_FOREVER
) and to resolve that I would need to rework that function to include a timeout and then manage to make the timeout time be split between the draining & the stopping of the thread (the stopping of the queue would be, almost, instantanious at that point if we've passed the drain stage).
@cfriedt @andyross How would you like me to proceede with:
- to sort out the
_drain(..)
to include a timeout and incorperate parts of that function in the_stop(...)
or
- keep it as is with the addition of blocking
_unplug(..)
and stop new works to be scheduled if the stop bit is set?
Regarding the race-condition you're (@TaiJuWu) outlining I don't think it is a problem since the getting & setting of the flags are protected by the spin_lock, the winner of the spin_lock would dictate the outcome.
FWIW:
I think that's more of a policy decision than a bug. I can see arguments either way: you could force the queue to reject new submissions when a STOP is pending, or you could allow it and document it as "the queue will stop processing the next time it is idle". Either one seems like a reasonable choice. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to refresh +1. I'm fine with this as-is. Most of the suggestions seem acceptable too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably would be best to rename to workq/worq_api
, but I'm not blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t want to block this feature too.
This patch adds support for stopping workqueues. This is useful for freeing resources from workqueues when subsystems/modules is deactivated or cleaning up the system between tests in ztest to reach a fully normalized state.
The patch adds a new function k_work_queue_stop() that releases the workqueues thread and stack when a workqueue is unwanted. k_work_queue_stop(...) should be viewed as a counterpart to k_work_queue_start(...).
This would allow to:
k_work_queue_start(...);
k_work_drain(..., true);
k_work_queue_stop(...);