Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transform hierarchy propagation benchmark #9442

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

hesiod
Copy link
Contributor

@hesiod hesiod commented Aug 14, 2023

Objective

  • In one of my projects, a lot of time was wasted performing hierarchy propagation (both transforms and visiblity) even if there were no changes at all. After digging into the code, I found out that the hierarchy propagation systems will unconditionally traverse the entire hierarchy in every frame, which results in serious performance degradation as the number of entities in the hierarchy goes up. I found only limited prior discussion on this (e.g. Propagation requires full hierarchy traversal #7840).
  • In order to enable better discussion and comparison of any improvements to the hierarchy propagation systems, benchmarks or other ways to measure propagation performance are required.

Solution

  • This PR adds a new benchmark that tests the transform propagation performance for various hierarchies.
  • The test data generation is essentially copied from examples/stress_tests/transform_hierarchy.rs, with the notable difference that the benchmark uses a fixed seed and a (hopefully) deterministic RNG seeding path such that benchmarks are comparable from run to run. Also, setup is using a &mut World directly instead of running as a Startup system.

Review Considerations

  • In theory code could be reused between the benchmark and the stress test example, however due to the current workspace layout (benchmarks completely separate from the rest of the project) I don't think this is possible without a lot of work, or alternatively using some ugly hack based on include! or similar methods.
  • I spent minimal time thinking about names. If you can come up with a better name for something, please share it.
  • One benchmark uses UnsafeWorldCell. I think it should be fine, but someone with a better understanding of UnsafeWorldCell might want to take a look at it.
  • Should this replace the stress test entirely? I'm not sure. To me personally, the stress is useless for testing propagation performance because of the run-to-run variations and the fact that it simply runs endlessly, but perhaps others use it differently.

@github-actions
Copy link
Contributor

Example alien_cake_addict failed to run, please try running it locally and check the result.

@nicopap nicopap self-requested a review August 15, 2023 05:18
@nicopap nicopap added C-Performance A change motivated by improving speed, memory usage or compile times A-Transform Translations, rotations and scales A-Hierarchy Parent-child entity hierarchies labels Aug 15, 2023
Copy link
Contributor

@nicopap nicopap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So. What do you see when you run the benchmarks?

If you run several times the benchmark without changing anything, how much difference is there between runs? How noisy is it?

Comment on lines 197 to 198
group.warm_up_time(std::time::Duration::from_secs(2));
group.measurement_time(std::time::Duration::from_secs(10));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: could be extracted into constants.

group.bench_function("transform_init", |b| {
// Building the World (in setup) takes a lot of time, so we shouldn't do that on every
// iteration.
// Unfortunately, we can't re-use an App directly in iter() the World would no longer be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm there is no better ways of setting things up yet.

benches/benches/bevy_transform/transform_hierarchy.rs Outdated Show resolved Hide resolved
Comment on lines 224 to 229
unsafe { cell.world_mut() }.run_schedule(ResetSchedule);

cell
},
|cell| {
unsafe { cell.world_mut() }.run_schedule(bevy_app::Main);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add safety comments here. This requires both closure to never run concurrently. Do you know this for a fact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be the case. However, I've thought a little over this and simply replaced the version using iter_batched/UnsafeWorldCell with iter_custom, which doesn't require using UnsafeWorldCell here. Only downside is that it's harder to use custom Measurement types with iter_custom, but they aren't currently used in any bevy benchmark.

benches/benches/bevy_transform/transform_hierarchy.rs Outdated Show resolved Hide resolved
if enable_update {
app
.add_plugins(TimePlugin)
// Updating transforms *must* be done before `CoreSet::PostUpdate`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no such thing as a CoreSet anymore. Where does this come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the code I simply copied over from the stress test, so it's probably an oversight from that change:

// Updating transforms *must* be done before `CoreSet::PostUpdate`
// or the hierarchy will momentarily be in an invalid state.

I removed the comment in the benchmark (but not in the example).

// Run Main schedule once to ensure initial updates are done
app.update();

b.iter(move || { app.update(); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also measures the update system runtime right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the names are a little misleading here: App::update is simply what usually gets called by the app.runner (the closure set by the ScheduleRunnerPlugin/winit), and in this case it simply runs the main schedule since we the benchmark app doesn't have any subapps.
And the initial updates I'm referring to in the first comment are both whatever happens in the Startup schedule and the first time the propagations are run in PostUpdate. I didn't specify that in the comment as I'm trying to keep the benchmark implementation-agnostic, perhaps the propagation systems will be run in a different Schedule in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just realized I misread your comment: Yes, the update system runtime is included here.

@nicopap
Copy link
Contributor

nicopap commented Aug 15, 2023

IMO we should get rid of the stress_test/transform_propagation.rs in favor of this.

@hesiod hesiod marked this pull request as ready for review August 15, 2023 11:16
@github-actions
Copy link
Contributor

Example alien_cake_addict failed to run, please try running it locally and check the result.

@nicopap nicopap self-requested a review August 15, 2023 11:30
Copy link
Contributor

@nicopap nicopap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea to benchmark the transform propagation. The stress_test example didn't help at all. I ran locally the benchmark and I found it very slow. I found it was not noisy at all, so that's also pretty good.

What I'd like to see:

  • Do not rely on App::update for updates, but rather build a schedule with the two transform systems and measure their aggregate runtime. Note that this requires manually advancing the world tick (outside of measurement).
  • Make the update system use set_changed() instead of changing the value.
  • Remove the reference bench.

I would really like to keep the benchmarks focused. It both makes the benchmarks faster and easier to interpret.

Does this make sense?

"chain",
Cfg {
test_case: TestCase::Tree {
depth: 2500,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fairly certain this needs to be multiplied by 5. Ideally we have about the same number of Entity per test case, in order to compare more easily the transform propagation behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had that idea as well, but due to the recursive implementation of propagation updates, setting the depth too high will exhaust the stack.
On my system, the benchmark simply crashes when setting depth to 50000. 5000 works, but I haven't tested any further depth values yet.


/// This benchmark tries to measure the cost of the initial transform propagation,
/// i.e. the first time transform propagation runs after we just added all our entities.
fn transform_init(c: &mut Criterion) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrt naming: I think it's a bit misleading to call it "initial propagation". I'd call it "full propagation". transform_complete_propagation

Comment on lines 345 to 355
/// update component with some per-component value
#[derive(Component)]
struct UpdateValue(f32);

/// update positions system
fn update(time: Res<Time>, mut query: Query<(&mut Transform, &mut UpdateValue)>) {
for (mut t, mut u) in &mut query {
u.0 += time.delta_seconds() * 0.1;
set_translation(&mut t.translation, u.0);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UpdateValue could be a simple marker component and update call t.set_changed(). IMO this is better, as it avoids the overhead of the trigonometric functions, setting the update flag on UpdateValue, loading/storing values to table storage.

We are only interested in how transform propagation behaves. Any other code adding to the runtime is noise.

@hesiod
Copy link
Contributor Author

hesiod commented Aug 20, 2023

@nicopap Thanks for all your feedback, it's been very helpful. I've refactored the overall benchmark structure once more in order to hopefully make the benchmarks more useful. I didn't follow all your suggestions precisely, but I think I addressed most of the issues. Please let me know what you think.

Overview of the changes:

  • I've split the code into multiple modules:
    • hierarchy::init contains the "initialization" benchmarks (proper name TBD)
    • hierarchy::propagation contains the "update" benchmarks (again, proper name TBD)
    • world_gen contains the World generation and updating machinery
  • I've tried to separate the actual transform propagation runtime in the PostUpdate schedule from the simulated updates in the Update schedule and all the other schedules. The implementation is in the update_bench_postupdate_only function. Essentially, the function removes the PostUpdate and Last schedules from the Main schedule so that they're no longer available in Schedules, resulting in Main only running all schedules up to Update. Then it runs PostUpdate manually and only measures the time spent in PostUpdate, ignoring the time spent for the other schedules.
    Since the PostUpdate schedule should more or less only contain the relevant propagation systems, this should be equivalent to creating a custom schedule with just these two systems, with the benefit of not coupling the benchmarks tightly to the exact propagation systems.
  • I've added another group of benchmarks transform_hierarchy_sizes that are based on measuring the same basic configuration with either increasing depth (large/deep configuration) or increasing branch width (wide configuration). This allows Criterion to generate nice plots showing the runtime dependence on the amount of nodes (which is mostly linear currently).
    The other benchmark group transform_hierachy_configurations goes through all configurations that were previously present in the stress test without changing any sizes (same as in the initial version of the PR).

Some open questions:

  • I've retained the prior, simple version of the propagation benchmark (function update_bench_reference) that doesn't mess with schedules and simply runs app.update. I think this might be helpful in the future to determine whether the more complicated update_bench_postupdate_only misbehaves (e.g. if the Main schedule definition is changed in the future). OTOH it's not truly necessary and could also be removed.
  • Some thinkable configurations aren't benchmarked currently, e.g. the humanoids configurations aren't included in transform_hierarchy_sizes. I've decided against this as the amount of benchmarks is already quite large, meaning the benchmark time is also quite long.

@nicopap If I understood you correctly, you're suggesting that the transform_init can be removed, right? I'm actually wondering if the whole transform_init benchmarks could be completely removed. They're not really measuring a very critical or specific (and thus actionable) operation.

@hesiod
Copy link
Contributor Author

hesiod commented Aug 20, 2023

Here are some example results showing the results from transform_propagation_large (from a slightly older version, when the benchmark prefix was still transform_propagation_sizes):

  • Violin plot of for different node counts:
    violin
  • Comparison of noop vs updates:
    lines

Note: These benchmarks were done on a potato CPU compared to current models (Intel i5-7200U) which I manually downclocked a bit to ensure thermal throttling don't influence the results too much. As a result, the absolute numbers might look a little worse than you'd expect.

Copy link
Contributor

@nicopap nicopap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. Nice job. To me the only change necessary is getting rid of UPDATE_BENCH_POSTUPDATE_ONLY.

Also we run cargo fmt on source code to keep a consistent style.

Comment on lines +122 to +127
// Measures hierarchy propagation systems when some transforms are updated.
group.bench_with_input(id("updates"), &(cfg, TransformUpdates::Enabled), inner_update_bench);

// Measures hierarchy propagation systems when there are no changes
// during the Update schedule.
group.bench_with_input(id("noop"), &(cfg, TransformUpdates::Disabled), inner_update_bench);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest replacing:

  • "updates" with "transform_updates_enabled"
  • "noop" with "transform_updates_disabled"

So the relationship between the benchmark result and the benchmark source code is a bit more evident.

@@ -0,0 +1,4 @@
pub mod init;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

/// since the benchmark implementation is a little fragile and rather slow (see comments below).
/// They're included here nevertheless in case they're useful.
fn transform_init(c: &mut Criterion) {
let mut group = c.benchmark_group("transform_init");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this benchmark is useful. It gives an idea of the behavior on worst case situations (which do exists in games, eg: when spawning a new level, or complex models).

The computational cost of removing many entities was a factor in reverting a change (see #5423 (comment)). This means we care about computational cost of this sort of things.

I think that "full recomputation" or something similarly descriptive could be a better name though.

Comment on lines +62 to +65
std::hint::black_box({
last.run(&mut app.world);
app.world.clear_trackers();
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure the blackbox is necessary here.

Comment on lines +90 to +91
fn inner_update_bench(b: &mut Bencher<WallTime>, bench_cfg: &(&Cfg, TransformUpdates)) {
const UPDATE_BENCH_POSTUPDATE_ONLY: bool = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a weird way of selecting benchmarks. And inconsistent with how the transform_init benchmark works.

In my opinion "reference" benchmarks should be removed from this PR, since (from my testing) they only add a constant overhead.

But it's OK if they stay in the PR as long as they are consistently declared :P

@rparrett
Copy link
Contributor

rparrett commented Sep 11, 2023

Should this replace the stress test entirely? I'm not sure. To me personally, the stress is useless for testing propagation performance because of the run-to-run variations and the fact that it simply runs endlessly, but perhaps others use it differently.

In my opinion, yes. See #7433 where I was unable to discover how that example is meant to be useful.

Doesn't need to be done here, just wanted to link the issue up.

@james7132 james7132 self-requested a review October 2, 2023 01:15
@BenjaminBrienen BenjaminBrienen added D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Hierarchy Parent-child entity hierarchies A-Transform Translations, rotations and scales C-Performance A change motivated by improving speed, memory usage or compile times D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants