Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: Add support for stopping workqueues #82614

Merged
merged 3 commits into from
Dec 12, 2024

Conversation

Mattemagikern
Copy link
Contributor

This patch adds support for stopping workqueues. This is useful for freeing resources from workqueues when subsystems/modules is deactivated or cleaning up the system between tests in ztest to reach a fully normalized state.

The patch adds a new function k_work_queue_stop() that releases the workqueues thread and stack when a workqueue is unwanted. k_work_queue_stop(...) should be viewed as a counterpart to k_work_queue_start(...).

This would allow to:
k_work_queue_start(...);

k_work_drain(..., true);
k_work_queue_stop(...);

@Mattemagikern Mattemagikern force-pushed the k_work_queue_stop branch 3 times, most recently from 63fe953 to 761384d Compare December 6, 2024 13:02
@Mattemagikern Mattemagikern marked this pull request as ready for review December 6, 2024 13:08
@cfriedt
Copy link
Member

cfriedt commented Dec 6, 2024

LGTM, but I'll wait until others have reviewed too.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feature in general looks clean and useful.

One bug in the test. Some criticism of IMHO an inappropriately aggressive implementation of "stop()".

And just as general hygiene, can you split this up into three patches:

  1. Add the feature
  2. Add tracing hooks
  3. Add tests

kernel/work.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/prj.conf Outdated Show resolved Hide resolved
@Mattemagikern
Copy link
Contributor Author

Mattemagikern commented Dec 6, 2024

@andyross

And just as general hygiene, can you split this up into three patches:

  1. Add the feature
  2. Add tracing hooks
  3. Add tests

I'll split them up if you want, but could I ask why you prefere it that way?
Just for my own curiosity.

@cfriedt
Copy link
Member

cfriedt commented Dec 6, 2024

@Mattemagikern - separate commits make it easier to review and maintain in many cases. As general rules, to ensure if a code change needs to be backported, there are fewer side effects with smaller commits. If a regression occurs, it's easier to narrow down the root cause. We usually insist on separate code and test commits because it makes it easier to test bug fixes.

Generally good hygiene

@andyross
Copy link
Contributor

andyross commented Dec 6, 2024

I'll split them up if you want, but could I ask why you prefere it that way? Just for my own curiosity.

Once upon a time on Discord:

"Help help my very important code stopped working!"

"Maybe it's the new feature that just got added. Can you try reverting this patch and see if that fixes it?"

"Help help! I tried git revert and now it's giving me a bunch of collisions in the tracing code! Who can help me fix this? My code is very important!"

Small patches that do one thing in an isolated way makes it easier to figure out what broke when things inevitably break. Maintainers are lazy and like to toss off one-line suggestions vs. addressing rebase failures in their spare time.

@Mattemagikern Mattemagikern force-pushed the k_work_queue_stop branch 2 times, most recently from f079b61 to ecf9d13 Compare December 6, 2024 15:54
@Mattemagikern
Copy link
Contributor Author

@cfriedt @andyross Thanks for the explination!

The patch has been updated to only contain the functionallity.
I've confirmed the changes with my testcases seperately.

@Mattemagikern
Copy link
Contributor Author

Another newbi question:
When you say you want the patches in [feature, tracing, tests], are these separate pull-requests or should I just split it up in commits on this pull-request?

@npitre
Copy link
Collaborator

npitre commented Dec 6, 2024 via email

@Mattemagikern
Copy link
Contributor Author

@npitre thanks! ✔️

andyross
andyross previously approved these changes Dec 6, 2024
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nitpicks that occur to me, but this looks good.

* @retval -ENODEV if the work queue was not started
* @retval -EBUSY if the work queue is actively processing work items
*/
int k_work_queue_stop(struct k_work_q *queue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this is a blocking call (because of k_thread_join()) and so per "Standard Zephyr API Idiom" should probably take a k_timeout to allow the app to control wakeup behavior. In this particular case that's probably not a big deal because the only way this will get stuck is if a workq handler refuses to return, which is already a bug elsewhere.

Copy link
Contributor Author

@Mattemagikern Mattemagikern Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about side effects if an user sets a very short timeout, e.g. K_NO_WAIT and/or thread priorities.
Where we get a false negative.

The case I'm worried about is:

int k_work_queue_stop(struct k_work_q *queue, k_timeout_t timeout)
{
...
	flag_set(&queue->flags, K_WORK_QUEUE_STOP_BIT);
	notify_queue_locked(queue);
	k_spin_unlock(&lock, key);
	if (k_thread_join(&queue->thread, timeout)) { <--- sleeping here
		key = k_spin_lock(&lock);
		flag_clear(&queue->flags, K_WORK_QUEUE_STOP_BIT);
		k_spin_unlock(&lock, key);
		return -ETIMEOUT;
	}

	return 0;

...

static void work_queue_main(void *workq_ptr, void *p2, void *p3)
{
....
		k_spinlock_key_t key = k_spin_lock(&lock); <-- past this point
....		
}

		
k_work_queue_stop(queue, K_NO_WAIT); // returns -ETIMEOUT
// Is the thread on it's way to die or is it alive? 

It may still be ok?
Subsequent calls to the workqueue will be handled (-ENODEV) due to the thread resets the flags.
I may be overthinking this. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If k_work_queue_stop() is guaranteed to set the STOP bit, then the work queue is guaranteed to stop in finite time (absent app bugs that stall the handlers). I think that's acceptable. Apps that care about synchronization can always call it again and take care to avoid racing with their own handlers until they get a success.

Again, I don't think this rises to a -1 given the specifics; the timeout isn't likely to be very useful in practice. Nonetheless timeout arguments for blocking calls are The Zephyr Way, so it should be pointed out.

include/zephyr/kernel.h Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
tests/kernel/workq/work_queue/src/start_stop.c Outdated Show resolved Hide resolved
}
}

ZTEST_SUITE(workqueue_reuse, NULL, NULL, NULL, NULL, NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, rather than adding a whole new testsuite effectively for a single ZTEST testing a single API call, this should be added to an existing testsuite that are more like traditional unit tests for the work queue API.

Copy link
Contributor Author

@Mattemagikern Mattemagikern Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the build project there exists the:

workqueue_triggered
workqueue_delayed

They seem more specifc than my test, should I rename my suit to workqueue_api for future use instead?

or should I try to integrate my test into one of them or another test suite?

"Succeeded to submit work item to non-initialized work queue");
}

ZTEST(workqueue_reuse, test_submit_to_queue)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's narrow the scope of the code we are testing to k_work_queue_stop() and add it to some other testsuite, since it is unlikely that this test is adding new coverage for work queue execution of handling multiple work items.

It doesn't really make sense to create a new suite for effectively testing one API call.

If we don't have an existing unit-style testsuite (e.g. workqueue_api), maybe this could be it.

Suggested change
ZTEST(workqueue_reuse, test_submit_to_queue)
ZTEST(workqueue_api, test_k_work_queue_stop)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wait with these changes until we've decided on the correct placement of the tests 👍

include/zephyr/kernel.h Outdated Show resolved Hide resolved
This patch adds support for stopping workqueues. This is useful for freeing
resources from workqueues when subsystems/modules is deactivated or
cleaning up the system between tests in ztest to reach a fully normalized
state.

The patch adds a new function k_work_queue_stop() that releases the
workqueues thread and stack when a workqueue is unwanted.
k_work_queue_stop(...) should be viewed as a counterpart to
k_work_queue_start(...).

This would allow to:
k_work_queue_start(...);

k_work_drain(..., true);
k_work_queue_stop(...);

Signed-off-by: Måns Ansgariusson <[email protected]>
Adds tracing support for stopping workqueues.

Signed-off-by: Måns Ansgariusson <[email protected]>
This patch adds tests for stopping workqueues.

Signed-off-by: Måns Ansgariusson <[email protected]>
Comment on lines +656 to +661
} else if (flag_test(&queue->flags, K_WORK_QUEUE_STOP_BIT)) {
/* User has requested that the queue stop. Clear the status flags and exit.
*/
flags_set(&queue->flags, 0);
k_spin_unlock(&lock, key);
return;
Copy link
Member

@TaiJuWu TaiJuWu Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pardon me.
If we allow task append to the queue continuously after setting K_WORK_QUEUE_STOP_BIT, the block will never be executed.
Is it excepted?

Copy link
Contributor Author

@Mattemagikern Mattemagikern Dec 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A pre-condition to the _stop(..) function is that the queue has already been plugged.

static inline int queue_submit_locked(struct k_work_q *queue,
				      struct k_work *work)
{
	 * * -ENODEV if the queue isn't running.
	 * * -EBUSY if draining and not chained
	 * * -EBUSY if plugged and not draining
	 * * otherwise OK

So the case you're thinking of is:

Thread a {
          while (true) {
                   queue_submit_locked(....);
         }
}

Thread b {
    k_work_queue_drain(..., true);
    k_work_queue_stop(..., K_FOREVER);
}

Would thread b. freeze indefenately?
Initially, maybe 👍

Question is if we should disallow that append of work if the K_WORK_QUEUE_STOP_BIT is set as well.
I'm not sure. How do you reason?

Copy link
Member

@TaiJuWu TaiJuWu Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed we need to call k_work_queue_drain before stopping workqueue, but the design make me thinks about this scenario will happen?

Thread a {
          while (true) {
                   queue_submit_locked(....);
         }
}

Thread b    k_work_queue_drain(..., true);
Thread c    k_work_queue_drain(..., true);
Thread b    k_work_queue_stop(..., K_FOREVER);
Thread c    k_work_queue_unplug(...);

Base on doc

The workqueue API is designed to be safe when invoked from multiple threads and interrupts. 

I think it is possible so we need to care about this problem. WDYT?

In general, I prefer

  1. Making K_WORK_QUEUE_STOP_BIT being isolated instead depend on K_WORK_QUEUE_PLUGGED_BIT.
  2. Disallow append if K_WORK_QUEUE_STOP_BIT is set.
  3. We don't clear K_WORK_QUEUE_STOP_BIT forever even if timeout avoid appending work to workqueue and documented it.

What do you and others think about this idea?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like a great opportunity for additional test cases

Copy link
Contributor Author

@Mattemagikern Mattemagikern Dec 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TaiJuWu

Another option is to forbid the unplug function if K_WORK_QUEUE_STOP_BIT is set.

Thread a {
          while (true) {
                   queue_submit_locked(....); <- Error once K_WORK_QUEUE_PLUGGED_BIT is set & not draining
         }
}

Thread b    k_work_queue_drain(..., true);
Thread c    k_work_queue_drain(..., true);
Thread b    k_work_queue_stop(..., K_FOREVER); <- Would exit in finite time
Thread c    k_work_queue_unplug(...); <- Would yield error & not unplug. 

Either thread B or C would return an error, depending on which accessed the queue first.
I believe this is the smallest change we could make to implement this feature. It shouldn't affect current use cases, what do you think?

@cfriedt Agreed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The case I'm making could still deadlock thread b, if the queue never drains the queue will continue to accept works.

So we probbably need the 2. You outlined earlier as well.

  1. Disallow append if K_WORK_QUEUE_STOP_BIT is set.

But that changes the behaviour of the queue & drain system slightly, is that ok?

Copy link
Member

@TaiJuWu TaiJuWu Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to forbid the unplug function if K_WORK_QUEUE_STOP_BIT is set.

Sound great.

But that changes the behaviour of the queue & drain system slightly, is that ok?

It is ok for me but we still need @andyross or other maintainers comfirm since the caller is time-sensitive.

By the way, I am not sure we need to call k_work_queue_drain before k_work_queue_stop by users is good api design or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I am not sure we need to call k_work_queue_drain before k_work_queue_stop by users is good api design or not.

Myself, being biased (haha), think it is neat :P The queue_drain(..., true) gives the user the knowledge that the works scheduled up to that point will be executed and in so doing letting free-routines kick in as to not casue memory leaks in the application.
The k_work_queue_stop(...) stopps the thread as soon as the last job has been processed.

I'll commit the changes that we outlined here today :)

Copy link
Member

@TaiJuWu TaiJuWu Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to call queue_drain(..., true) in k_work_queue_stop internally and set K_WORK_QUEUE_STOP_BIT berfore queue_drain(..., true)
Is it possible in current design?

TA                                     TB
queue_drain(..., true)
                                    k_work_queue_stop but flag is not set
queue_unplug
                                   flag is set

Additionally, System Workqueue should never stop I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, System Workqueue should never stop I guess.

The patch specifically addresses scenarios where stopping a workqueue becomes interesting, like in a couple of tests I have in mind.
I believe the patch aligns with the broader system goals. If this approach conflicts with project guidelines or expectations, I expect that the maintainers wil let me know shortly haha 🥲

I think we need to call queue_drain(..., true) in k_work_queue_stop internally and set K_WORK_QUEUE_STOP_BIT berfore queue_drain(..., true)
Is it possible in current design?

I've just looked through the kernel/work.c file, it would be a neat addition, one less function call from the users to achive the same effect. What's stopping me is the k_work_queue_drain(...) since the current implementation is blocking ( K_FOREVER ) and to resolve that I would need to rework that function to include a timeout and then manage to make the timeout time be split between the draining & the stopping of the thread (the stopping of the queue would be, almost, instantanious at that point if we've passed the drain stage).
@cfriedt @andyross How would you like me to proceede with:

  • to sort out the_drain(..) to include a timeout and incorperate parts of that function in the _stop(...)

or

  • keep it as is with the addition of blocking _unplug(..) and stop new works to be scheduled if the stop bit is set?

Regarding the race-condition you're (@TaiJuWu) outlining I don't think it is a problem since the getting & setting of the flags are protected by the spin_lock, the winner of the spin_lock would dictate the outcome.

@andyross
Copy link
Contributor

FWIW:

If we allow task append to the queue continuously after setting K_WORK_QUEUE_STOP_BIT, the block will never be executed.

I think that's more of a policy decision than a bug. I can see arguments either way: you could force the queue to reject new submissions when a STOP is pending, or you could allow it and document it as "the queue will stop processing the next time it is idle". Either one seems like a reasonable choice.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to refresh +1. I'm fine with this as-is. Most of the suggestions seem acceptable too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably would be best to rename to workq/worq_api, but I'm not blocking.

Copy link
Member

@TaiJuWu TaiJuWu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t want to block this feature too.

@kartben kartben merged commit 8934b81 into zephyrproject-rtos:main Dec 12, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants