Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: use listen and trigger universally #164

Conversation

enigbe
Copy link
Contributor

@enigbe enigbe commented Jan 27, 2024

What this PR does

  • Listens for shutdown trigger across all channels
  • Trigger shutdown when tasks exit with error
  • Documentation update

Related Issue(s)

Notes

@enigbe enigbe requested review from sr-gi, okjodom and carlaKC January 27, 2024 07:47
@enigbe enigbe self-assigned this Jan 27, 2024
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! A few structural/stylistic comments on first pass.

One high level question that I'm wondering about is whether we can simplify this by always triggering shutdown whenever we break one of the loop/select patterns in the code Right now we have to inline on shutdown, but if the expected behavior is to break the loop on error we could just have a single trigger on break? Perhaps with the exception of listener triggering, because we know we can just return there.

Update: see comment below.

I think that could lead to simpler understanding of the shutdown logic (which is a nightmare right now), but I haven't looked at whether it works in every instance across the codebase. Interested to hear thoughts on this from others!

.gitignore Outdated Show resolved Hide resolved
sim-lib/Cargo.toml Outdated Show resolved Hide resolved
Cargo.lock Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
@carlaKC
Copy link
Contributor

carlaKC commented Jan 31, 2024

Discussed this PR a little more offline. Another option for a cleaner/more testable solution would be:

  • Refactor to return Result from functions that can trigger shutdown
  • Call trigger at spawn site rather than inside of the function

This saves us from having to pass shutdown all the way down to every task, and cuts down on the number of places where we need to call trigger. Also has the benefit of making some of these functions more testable, because we can assert on return values.

Eg, for our simulation results task:

        tasks.spawn(async move {
            if let Err(e) =
                produce_simulation_results(nodes, output_receiver, results_sender, listener_results)
                    .await
            {
                shutdown.trigger();
                log::error!("produce simulation results exited with error: {e:?}.");
            }
        });

@carlaKC
Copy link
Contributor

carlaKC commented Feb 2, 2024

Can be rebased on #160 and sincerest apolgies in advance for all the rebase conflicts :')

@enigbe enigbe force-pushed the refactor-use-listen-and-trigger-universally branch from 26a6378 to a86e3f6 Compare February 5, 2024 13:05
@enigbe
Copy link
Contributor Author

enigbe commented Feb 5, 2024

Discussed this PR a little more offline. Another option for a cleaner/more testable solution would be:

* Refactor to return `Result` from functions that can trigger shutdown

* Call `trigger` at spawn site rather than inside of the function

This saves us from having to pass shutdown all the way down to every task, and cuts down on the number of places where we need to call trigger. Also has the benefit of making some of these functions more testable, because we can assert on return values.

Eg, for our simulation results task:

        tasks.spawn(async move {
            if let Err(e) =
                produce_simulation_results(nodes, output_receiver, results_sender, listener_results)
                    .await
            {
                shutdown.trigger();
                log::error!("produce simulation results exited with error: {e:?}.");
            }
        });

I have refactored the handling of errors across all tasks. This reduces the number of places trigger() can be called and streamlines the logic. However, I think we can further reduce the call to trigger() to just one location.

From the snippet from sim-lib/lib.rs::run below, we await the completion of all tasks in the join set tasks. The error from the first task failure can be propagated until it gets here, in which we call trigger() (and break out of the loop) to shut down all listening tasks in the set.

        while let Some(res) = tasks.join_next().await {
            if let Err(e) = res {
                // log::error!("Task exited with error: {e}.");
                // success = false;
               self.shutdown();
               break;
            }
        }

This would mean we have to have another loop waiting for all tasks to exit (post trigger()) to have a graceful shutdown. Uncertain if this is a good idea/approach and happy to get thoughts on this.

@enigbe enigbe marked this pull request as ready for review February 5, 2024 13:41
@enigbe enigbe requested a review from carlaKC February 5, 2024 13:41
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vrrynice. Just some annoying nitpicking about logging and breaking from me, I think this is almost there!

Only major comment is about cleaning up the result writer functions a bit, but that's pre-existing.

sim-lib/Cargo.toml Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
1. [Triggered](https://docs.rs/triggered/latest/triggered): a `Trigger`
that can be used to inform threads that it's time to shut down, and
a `Listener` that propagates this signal.
2. The (`Trigger`, `Listener`) pair are used with channels: if a channel errors out across `send()` or `receive()`, shutdown is triggered. There is no reliance on channel mechanics, i.e. errors generated when all senders are and/or a receiver is dropped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would clarify this to note that our channels don't have buffers, so we have to select on the shutdown signal on send/receive to make sure that a receiver exiting before a sender does doesn't indefinitely hang.

Ie, this scenario:

Task 1: Sending into sender
Task 2: Receiving on receiver

  • Task 1 sends into sender, unblocks due to buffer size of 1 that we use everywhere
  • Task 2 errors out before consuming from receiver
  • Task 1 wants to send into sender again, but can't because the receiver has shut down

If we always select on exit, then we don't run into this.

docs/ARCHITECTURE.md Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
@enigbe enigbe force-pushed the refactor-use-listen-and-trigger-universally branch from 452d1da to d8ff458 Compare February 24, 2024 07:58
@enigbe enigbe requested a review from carlaKC February 27, 2024 14:06
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing comments, think this only needs one more round!

Remaining comments are really about logging consistency - if we're logging an exit error at a function's call site there's not need to also log on error return (we'll double log). Would also like to have all the starting/stopping logs moved into the spawn as well.

docs/ARCHITECTURE.md Outdated Show resolved Hide resolved
docs/ARCHITECTURE.md Outdated Show resolved Hide resolved
docs/ARCHITECTURE.md Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
Comment on lines +1065 to +1112
set.spawn(track_payment_result(
source_node.clone(), results.clone(), payment, listener.clone()
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we handle errors returned by track_payment_result and trigger shutdown here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should. I refactored produce_simulation_results to include an additional branch to wait on concurrently. Within this branch, we propagate any track_payment_result error to produce_simulation_results and trigger shutdown at the latter's spawn site.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, realize that we weren't actually waiting on that set at all before this ☠️

This method is an interesting one (/different to our other ones) because it has its own set of tasks that it should wait on. As is, if we get the shutdown listener signal we won't wait for all the spawned payment tracking tasks to complete (which is messy).

Don't need to update in this PR, let's gettit in, but note to self to create an issue/fix this up in future!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've taken note of this and will create an issue for better handling of exits on all tasks spawned in set.

sim-lib/src/lib.rs Outdated Show resolved Hide resolved
sim-lib/src/lib.rs Outdated Show resolved Hide resolved
@enigbe enigbe force-pushed the refactor-use-listen-and-trigger-universally branch from 9a8a3c8 to b84ce8d Compare February 29, 2024 11:33
@enigbe enigbe requested a review from carlaKC March 1, 2024 10:36
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alrightalrightalright!

Actually, actually last comments:

  1. We still need to flush our batched writer to disk on shutdown
  2. Take a look at line wrapping at 120 (I think that a few places are over / feel free to shorten error messagse)

Otherwise, you can go ahead and squash the fixups and we'll merge. Nice stuff 🏅

sim-lib/src/lib.rs Show resolved Hide resolved
Comment on lines +1065 to +1112
set.spawn(track_payment_result(
source_node.clone(), results.clone(), payment, listener.clone()
));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, realize that we weren't actually waiting on that set at all before this ☠️

This method is an interesting one (/different to our other ones) because it has its own set of tasks that it should wait on. As is, if we get the shutdown listener signal we won't wait for all the spawned payment tracking tasks to complete (which is messy).

Don't need to update in this PR, let's gettit in, but note to self to create an issue/fix this up in future!

@enigbe enigbe force-pushed the refactor-use-listen-and-trigger-universally branch from 3ddaef0 to d0be6ec Compare March 11, 2024 16:36
@enigbe enigbe requested a review from carlaKC March 11, 2024 16:36
@carlaKC
Copy link
Contributor

carlaKC commented Mar 11, 2024

Testing this and it looks like TrackPayment doesn't run anymore and the simulator hangs on shutdown.

Will take a look at the code, but iirc this is a regression since my last review (I tested last time and this was fine).

@enigbe
Copy link
Contributor Author

enigbe commented Mar 11, 2024

That's weird. Taking another look right now.

log::error!("Event consumer exited with error: {e:?}.");
},
let consume_event_node = ce_node.clone();
let node_guard = ce_node.lock().await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is deadlocking with the lock in consume_events!

Copy link
Contributor

@carlaKC carlaKC Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol I hate programming, can fix this with:

tasks.spawn(async move {
                let node_info = ce_node.lock().await.get_info().clone();

                log::debug!("Starting events consumer for {}.", node_info);
                if let Err(e) =
                    consume_events(ce_node, receiver, ce_output_sender, ce_listener).await
                {
                    ce_shutdown.trigger();
                    log::error!("Event consumer exited with error: {e:?}.");
                } else {
                    log::debug!("Event consumer for node {node_info} received shutdown signal.");
                }
            });

Problem is that the node_guard lock doesn't get dropped until after consume_events (because we're still borrowing node_info in the log). Had to crack out le old rust book for that one.

@enigbe enigbe force-pushed the refactor-use-listen-and-trigger-universally branch from 581eab9 to 46b4713 Compare March 13, 2024 17:26
enigbe added 3 commits March 14, 2024 08:52
- additionally, remove every `unwrap()` call that
could panic, replacing with error propagation
and/or context with `expect()`

- return Result<(), SimulationError> for all
spawned tasks

- handles triggering shutdown at call site for
spawned tasks

- move starting/stopping logs to spawn site
@enigbe enigbe force-pushed the refactor-use-listen-and-trigger-universally branch from 4b08725 to 5aa5f65 Compare March 14, 2024 07:58
@enigbe enigbe requested a review from carlaKC March 14, 2024 08:02
Copy link
Contributor

@carlaKC carlaKC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tACK at 5aa5f65.

Some minor comments that can be addressed in a followup if desired.

}

log::trace!("Payment result tracker exiting.");
log::trace!("Result tracking complete. Payment result tracker exiting.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: logging at call site, not in function as with others.

@@ -985,12 +1045,14 @@ impl Display for PaymentResultLogger {
}
}

/// Reports a summary of payment results at a duration specified by `interval`
/// Note that `run_results_logger` does not error in any way, thus it has no
/// trigger. It listens for triggers to ensure clean exit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: line wrap at 120

Comment on lines +979 to +982
writer.map_or(Ok(()), |(ref mut w, _)| w.flush().map_err(|_| {
SimulationError::FileError
}))?;
return Ok(());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need ? + return Ok(()) -> can just return writer.map_or...

/results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newline still here - might be editor automatically adding it

Comment on lines +802 to +806
let source = executor.source_info.clone();

log::info!(
"Starting activity producer for {}: {}.",
source,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can just inline + clone because we don't have any locks here

None => return Ok(())
}
},
track_payment = set.join_next() => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a TODO explaining that we're not going to wait for all tasks to exit, just so we don't forget

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Robustness: Use Listen and Shutdown Trigger Universally
3 participants