Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed wake call using .shared() #2706

Open
steffahn opened this issue Feb 7, 2023 · 5 comments · May be fixed by #2899
Open

Missed wake call using .shared() #2706

steffahn opened this issue Feb 7, 2023 · 5 comments · May be fixed by #2899
Labels
A-future Area: futures::future bug

Comments

@steffahn
Copy link
Member

steffahn commented Feb 7, 2023

This is the reproduction:

use std::{
    io::Write,
    pin::Pin,
    task::{Context, Poll},
};

use futures::prelude::*;

pub async fn yield_now() {
    /// Yield implementation
    struct YieldNow {
        yielded: bool,
    }

    impl Future for YieldNow {
        type Output = ();

        fn poll(mut self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<()> {
            if self.yielded {
                return Poll::Ready(());
            }

            self.yielded = true;
            cx.waker().wake_by_ref();
            Poll::Pending
        }
    }

    YieldNow { yielded: false }.await
}

#[tokio::main]
async fn main() {
    for _ in 0..200 {
        print!(".");
        std::io::stdout().flush().unwrap();
        for _ in 0..10000 {
            test().await;
        }
    }
    println!();
}

async fn test() {
    let f1 = yield_now().shared();
    let f2 = f1.clone();
    let x1 = tokio::spawn(async move {
        f1.now_or_never();
    });
    let x2 = tokio::spawn(async move {
        f2.await;
    });
    x1.await.ok();
    x2.await.ok();
}

It manages to hang up eventually on most runs, both in debug and release mode.

The example is crafted to expose a weakness spotted during code review of .shared(). The problematic part is here:

POLLING => {
// Another task is currently polling, at this point we just want
// to ensure that the waker for this task is registered
this.inner = Some(inner);
return Poll::Pending;
}

This code is simply insufficient to ensure the current context’s waker is actually ever woken up. The current context’s waker was stored into the inner slab previously in this line

inner.record_waker(&mut this.waker_key, cx);

However, a concurrently running poll could have already been within the call to the inner poll before the new waker got registered, i.e. here:

let poll_result = future.poll(&mut cx);

As soon as the call to poll on the inner future starts, it could already be calling the waker, thus doing the waking before the new waker that we care about was registered and put into the slab in the first place, because the waking could happen before the state is put back into IDLE.

The waking could happen before the state is put back into IDLE in particular because it is okay for a .poll call to do the waking immediately, before it returns, i.e the way the yield_now function in the example always does (copied from the old yield_now implementation of tokio).

@taiki-e taiki-e added bug A-future Area: futures::future labels Feb 7, 2023
@steffahn
Copy link
Member Author

steffahn commented Feb 7, 2023

Some initial thought for a potential fix: Add some additional (3 states) state enum next to the Slab in the to the Mutex inside the wakers field.

Right before calling the inner future’s poll method, set this state from the “default” state to “inside inner poll”. The Notifier::wake_by_ref implementation would (operate as it already does and additionally) inspect this state, and if it is in “inside inner poll” state change it to “woken while inside inner poll”.

Then after the call to the inner future’s poll method, if the poll result is Pending then

  • first lock the Mutex
  • next change the notifier state from POLLING to IDLE
  • then look into the Mutex contents and change the state back to “default”
    • if the state was in “woken while inside inner poll”, awake all (new) wakers immediately (there will only be those wakers left that arrived during the poll operation, anyways
  • finally, unlock the Mutex

This way, a thread that observes the POLLING state and returns can know that their record_waker call accessed the Mutex before the currently polling state re-locks the Mutex to handle wakes-during-polling, so their registered waker is never ignored.


In more detail:

The previously problematic setting was when a record_waker call happens while the inner .poll call is already in progress and after this call already executed its guaranteed eventual wakeup; and when after such a record_waker call, POLLING state belonging to the same poll to the inner future is still observed. Otherwise there is no problem because either there would already have been a subsequent poll that does consider the registered waker, or we would be doing such a poll ourselves (or the future could be finished).

This setting means that first the polling thread records “inside inner poll” state, and then the poll happening during execution of the inner poll will put it into “woken while inside inner poll” state. Then the problematic thread calls record_waker, adding a new waker to the wakers with the “woken while inside inner poll” state. Afterwards this thread observes POLLING, so that this record_waker is known to happen before the change from POLLING to IDLE in the idle thread. This POLLING to IDLE change happens while the mutex is being held that was also accessed by record_waker, so that mutex acquiry also happened after the call to record_waker. Hence the polling thread will observe both the “woken while inside inner poll” and a list of wakers including the newly registered one, which can then be awoken.


I guess there’s some leeway implementation on the point whether a Notifier::wake_by_ref while in (one of) the “inside inner poll” state(s) should perhaps only update the state (to “woken while inside inner poll”) and not do any waking yet, as any awoken task would run into a currently-POLLING state anyways, and would then need to be re-awoken soon after.

Perhaps more importantly, I have not thought about panic case(s) yet. Those would presumably also need to be handled somewhat properly.

@steffahn
Copy link
Member Author

steffahn commented Feb 11, 2023

Trying to start an implementation of a fix, I came across a design question where I’m not sure what the best solution should be.

If the poll method panics, then the caller of the Shared instances currently being polled will get that panic propagated properly, but all other clones of the Shared will potentially never learn about the panic, since – as far as I’m aware – a panicking poll does give no promises as to whether or not the waker in the context you gave it will actually ever be awoken.

The possible solution is to make sure to wake all other wakers in case of panic, so they have a chance to learn of the panicking poll themselves. (From their point of view, they get awoken, then call poll and then will get a panic themself, which is the desired behavior, AFAICT, unlike the current approach of potentially just leaving them unawoken indefinitely without ever experiencing any panicking poll call themself.)

Now the problem with this approach: Calling a bunch of wakers in a panic case could be done through code in a Drop implementation, with the effect that if any of the wakers panic on the .wake call, that’s a double-panic / abort. So the question thus is whether we can simply assume that reasonable wakers never panic, so the possibility of abort would be considered “fine”. Or alternatively, I guess, it could be a reasonable approach to use catch_unwind instead of destructors, and that way a panicking waker would not result in immediate abort; but if panicking wakers are a thing to consider, there’s follow-up questions such as: If one waker panics, do we still need to awake all subsequent wakers in the Slab? If yes, that’s a bunch of code, and a lot of catch_unwind invocations though…

So really TL;DR, how bad are missed wake-ups in cases where panics happened somewhere? (I suppose bad enough that they do warrant a fix.) And how seriously should the case of panicking wakers be considered? (I’m unsure on this, and interested in hearing about guidelines or prior art on the question of panicking wakers.)

@LPGhatguy
Copy link

Wanted to pop in and say thank you for working on this issue. I went down a rabbithole today debugging an application that uses Shared dropping futures on the ground. This appears to be the root cause for my bug.

@taiki-e
Copy link
Member

taiki-e commented Feb 28, 2023

Thanks for working on this!

how bad are missed wake-ups in cases where panics happened somewhere?

It is not ideal, but fine. (Shared is neither UnwindSafe nor RefUnwindSafe. And IMO, a potential abort is bad anyway.)

guidelines or prior art on the question of panicking wakers

I do not know of any prior art or guidelines for this particular case, but tokio-rs/tokio#3689 is somewhat relevant.

@zekun000
Copy link

any update or workaround for this issue?

@zekun000 zekun000 linked a pull request Nov 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-future Area: futures::future bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants