Add transform hierarchy propagation benchmark #9442

hesiod · 2023-08-14T18:44:34Z

Objective

In one of my projects, a lot of time was wasted performing hierarchy propagation (both transforms and visiblity) even if there were no changes at all. After digging into the code, I found out that the hierarchy propagation systems will unconditionally traverse the entire hierarchy in every frame, which results in serious performance degradation as the number of entities in the hierarchy goes up. I found only limited prior discussion on this (e.g. Propagation requires full hierarchy traversal #7840).
In order to enable better discussion and comparison of any improvements to the hierarchy propagation systems, benchmarks or other ways to measure propagation performance are required.

Solution

This PR adds a new benchmark that tests the transform propagation performance for various hierarchies.
The test data generation is essentially copied from examples/stress_tests/transform_hierarchy.rs, with the notable difference that the benchmark uses a fixed seed and a (hopefully) deterministic RNG seeding path such that benchmarks are comparable from run to run. Also, setup is using a &mut World directly instead of running as a Startup system.

Review Considerations

In theory code could be reused between the benchmark and the stress test example, however due to the current workspace layout (benchmarks completely separate from the rest of the project) I don't think this is possible without a lot of work, or alternatively using some ugly hack based on include! or similar methods.
I spent minimal time thinking about names. If you can come up with a better name for something, please share it.
One benchmark uses UnsafeWorldCell. I think it should be fine, but someone with a better understanding of UnsafeWorldCell might want to take a look at it.
Should this replace the stress test entirely? I'm not sure. To me personally, the stress is useless for testing propagation performance because of the run-to-run variations and the fact that it simply runs endlessly, but perhaps others use it differently.

github-actions · 2023-08-14T18:59:59Z

Example alien_cake_addict failed to run, please try running it locally and check the result.

nicopap

So. What do you see when you run the benchmarks?

If you run several times the benchmark without changing anything, how much difference is there between runs? How noisy is it?

nicopap · 2023-08-15T06:17:45Z

benches/benches/bevy_transform/transform_hierarchy.rs

+    group.warm_up_time(std::time::Duration::from_secs(2));
+    group.measurement_time(std::time::Duration::from_secs(10));


Nit: could be extracted into constants.

nicopap · 2023-08-15T06:23:18Z

benches/benches/bevy_transform/transform_hierarchy.rs

+    group.bench_function("transform_init", |b| {
+        // Building the World (in setup) takes a lot of time, so we shouldn't do that on every
+        // iteration.
+        // Unfortunately, we can't re-use an App directly in iter() the World would no longer be


I can confirm there is no better ways of setting things up yet.

benches/benches/bevy_transform/transform_hierarchy.rs

nicopap · 2023-08-15T06:28:40Z

benches/benches/bevy_transform/transform_hierarchy.rs

+                unsafe { cell.world_mut() }.run_schedule(ResetSchedule);
+
+                cell
+            },
+            |cell| {
+                unsafe { cell.world_mut() }.run_schedule(bevy_app::Main);


Could you add safety comments here. This requires both closure to never run concurrently. Do you know this for a fact?

This should be the case. However, I've thought a little over this and simply replaced the version using iter_batched/UnsafeWorldCell with iter_custom, which doesn't require using UnsafeWorldCell here. Only downside is that it's harder to use custom Measurement types with iter_custom, but they aren't currently used in any bevy benchmark.

benches/benches/bevy_transform/transform_hierarchy.rs

nicopap · 2023-08-15T06:31:23Z

benches/benches/bevy_transform/transform_hierarchy.rs

+    if enable_update {
+        app
+            .add_plugins(TimePlugin)
+            // Updating transforms *must* be done before `CoreSet::PostUpdate`


There is no such thing as a CoreSet anymore. Where does this come from?

This is part of the code I simply copied over from the stress test, so it's probably an oversight from that change:

bevy/examples/stress_tests/transform_hierarchy.rs

Lines 188 to 189 in 505b9a5

// Updating transforms *must* be done before `CoreSet::PostUpdate`

// or the hierarchy will momentarily be in an invalid state.

I removed the comment in the benchmark (but not in the example).

nicopap · 2023-08-15T06:36:10Z

benches/benches/bevy_transform/transform_hierarchy.rs

+        // Run Main schedule once to ensure initial updates are done
+        app.update();
+
+        b.iter(move || { app.update(); });


This also measures the update system runtime right?

No, the names are a little misleading here: App::update is simply what usually gets called by the app.runner (the closure set by the ScheduleRunnerPlugin/winit), and in this case it simply runs the main schedule since we the benchmark app doesn't have any subapps.
And the initial updates I'm referring to in the first comment are both whatever happens in the Startup schedule and the first time the propagations are run in PostUpdate. I didn't specify that in the comment as I'm trying to keep the benchmark implementation-agnostic, perhaps the propagation systems will be run in a different Schedule in the future.

Sorry, just realized I misread your comment: Yes, the update system runtime is included here.

nicopap · 2023-08-15T06:47:51Z

IMO we should get rid of the stress_test/transform_propagation.rs in favor of this.

github-actions · 2023-08-15T11:16:35Z

Example alien_cake_addict failed to run, please try running it locally and check the result.

nicopap

It's a good idea to benchmark the transform propagation. The stress_test example didn't help at all. I ran locally the benchmark and I found it very slow. I found it was not noisy at all, so that's also pretty good.

What I'd like to see:

Do not rely on App::update for updates, but rather build a schedule with the two transform systems and measure their aggregate runtime. Note that this requires manually advancing the world tick (outside of measurement).
Make the update system use set_changed() instead of changing the value.
Remove the reference bench.

I would really like to keep the benchmarks focused. It both makes the benchmarks faster and easier to interpret.

Does this make sense?

nicopap · 2023-08-15T12:33:16Z

benches/benches/bevy_transform/transform_hierarchy.rs

+        "chain",
+        Cfg {
+            test_case: TestCase::Tree {
+                depth: 2500,


Fairly certain this needs to be multiplied by 5. Ideally we have about the same number of Entity per test case, in order to compare more easily the transform propagation behavior.

I had that idea as well, but due to the recursive implementation of propagation updates, setting the depth too high will exhaust the stack.
On my system, the benchmark simply crashes when setting depth to 50000. 5000 works, but I haven't tested any further depth values yet.

nicopap · 2023-08-15T12:41:58Z

benches/benches/bevy_transform/transform_hierarchy.rs

+
+/// This benchmark tries to measure the cost of the initial transform propagation,
+/// i.e. the first time transform propagation runs after we just added all our entities.
+fn transform_init(c: &mut Criterion) {


wrt naming: I think it's a bit misleading to call it "initial propagation". I'd call it "full propagation". transform_complete_propagation

nicopap · 2023-08-15T12:46:27Z

benches/benches/bevy_transform/transform_hierarchy.rs

+/// update component with some per-component value
+#[derive(Component)]
+struct UpdateValue(f32);
+
+/// update positions system
+fn update(time: Res<Time>, mut query: Query<(&mut Transform, &mut UpdateValue)>) {
+    for (mut t, mut u) in &mut query {
+        u.0 += time.delta_seconds() * 0.1;
+        set_translation(&mut t.translation, u.0);
+    }
+}


UpdateValue could be a simple marker component and update call t.set_changed(). IMO this is better, as it avoids the overhead of the trigonometric functions, setting the update flag on UpdateValue, loading/storing values to table storage.

We are only interested in how transform propagation behaves. Any other code adding to the runtime is noise.

hesiod · 2023-08-20T19:43:03Z

@nicopap Thanks for all your feedback, it's been very helpful. I've refactored the overall benchmark structure once more in order to hopefully make the benchmarks more useful. I didn't follow all your suggestions precisely, but I think I addressed most of the issues. Please let me know what you think.

Overview of the changes:

I've split the code into multiple modules:
- hierarchy::init contains the "initialization" benchmarks (proper name TBD)
- hierarchy::propagation contains the "update" benchmarks (again, proper name TBD)
- world_gen contains the World generation and updating machinery
I've tried to separate the actual transform propagation runtime in the PostUpdate schedule from the simulated updates in the Update schedule and all the other schedules. The implementation is in the update_bench_postupdate_only function. Essentially, the function removes the PostUpdate and Last schedules from the Main schedule so that they're no longer available in Schedules, resulting in Main only running all schedules up to Update. Then it runs PostUpdate manually and only measures the time spent in PostUpdate, ignoring the time spent for the other schedules.
Since the PostUpdate schedule should more or less only contain the relevant propagation systems, this should be equivalent to creating a custom schedule with just these two systems, with the benefit of not coupling the benchmarks tightly to the exact propagation systems.
I've added another group of benchmarks transform_hierarchy_sizes that are based on measuring the same basic configuration with either increasing depth (large/deep configuration) or increasing branch width (wide configuration). This allows Criterion to generate nice plots showing the runtime dependence on the amount of nodes (which is mostly linear currently).
The other benchmark group transform_hierachy_configurations goes through all configurations that were previously present in the stress test without changing any sizes (same as in the initial version of the PR).

Some open questions:

I've retained the prior, simple version of the propagation benchmark (function update_bench_reference) that doesn't mess with schedules and simply runs app.update. I think this might be helpful in the future to determine whether the more complicated update_bench_postupdate_only misbehaves (e.g. if the Main schedule definition is changed in the future). OTOH it's not truly necessary and could also be removed.
Some thinkable configurations aren't benchmarked currently, e.g. the humanoids configurations aren't included in transform_hierarchy_sizes. I've decided against this as the amount of benchmarks is already quite large, meaning the benchmark time is also quite long.

@nicopap If I understood you correctly, you're suggesting that the transform_init can be removed, right? I'm actually wondering if the whole transform_init benchmarks could be completely removed. They're not really measuring a very critical or specific (and thus actionable) operation.

hesiod · 2023-08-20T20:03:13Z

Here are some example results showing the results from transform_propagation_large (from a slightly older version, when the benchmark prefix was still transform_propagation_sizes):

Violin plot of for different node counts:
Comparison of noop vs updates:

Note: These benchmarks were done on a potato CPU compared to current models (Intel i5-7200U) which I manually downclocked a bit to ensure thermal throttling don't influence the results too much. As a result, the absolute numbers might look a little worse than you'd expect.

nicopap

This looks good to me. Nice job. To me the only change necessary is getting rid of UPDATE_BENCH_POSTUPDATE_ONLY.

Also we run cargo fmt on source code to keep a consistent style.

nicopap · 2023-08-24T07:50:27Z

benches/benches/bevy_transform/hierarchy/propagation.rs

+    // Measures hierarchy propagation systems when some transforms are updated.
+    group.bench_with_input(id("updates"), &(cfg, TransformUpdates::Enabled), inner_update_bench);
+
+    // Measures hierarchy propagation systems when there are no changes
+    // during the Update schedule.
+    group.bench_with_input(id("noop"), &(cfg, TransformUpdates::Disabled), inner_update_bench);


I suggest replacing:

"updates" with "transform_updates_enabled"

"noop" with "transform_updates_disabled"

So the relationship between the benchmark result and the benchmark source code is a bit more evident.

nicopap · 2023-08-24T07:52:50Z

benches/benches/bevy_transform/hierarchy/mod.rs

@@ -0,0 +1,4 @@
+pub mod init;
+


Suggested change

nicopap · 2023-08-24T07:54:05Z

benches/benches/bevy_transform/hierarchy/init.rs

+/// since the benchmark implementation is a little fragile and rather slow (see comments below).
+/// They're included here nevertheless in case they're useful.
+fn transform_init(c: &mut Criterion) {
+    let mut group = c.benchmark_group("transform_init");


I still think this benchmark is useful. It gives an idea of the behavior on worst case situations (which do exists in games, eg: when spawning a new level, or complex models).

The computational cost of removing many entities was a factor in reverting a change (see #5423 (comment)). This means we care about computational cost of this sort of things.

I think that "full recomputation" or something similarly descriptive could be a better name though.

nicopap · 2023-08-24T08:41:11Z

benches/benches/bevy_transform/hierarchy/propagation.rs

+            std::hint::black_box({
+                last.run(&mut app.world);
+                app.world.clear_trackers();
+            });


Not sure the blackbox is necessary here.

nicopap · 2023-08-24T08:43:55Z

benches/benches/bevy_transform/hierarchy/propagation.rs

+fn inner_update_bench(b: &mut Bencher<WallTime>, bench_cfg: &(&Cfg, TransformUpdates)) {
+    const UPDATE_BENCH_POSTUPDATE_ONLY: bool = false;


It's a weird way of selecting benchmarks. And inconsistent with how the transform_init benchmark works.

In my opinion "reference" benchmarks should be removed from this PR, since (from my testing) they only add a constant overhead.

But it's OK if they stay in the PR as long as they are consistently declared :P

rparrett · 2023-09-11T19:36:06Z

Should this replace the stress test entirely? I'm not sure. To me personally, the stress is useless for testing propagation performance because of the run-to-run variations and the fact that it simply runs endlessly, but perhaps others use it differently.

In my opinion, yes. See #7433 where I was unable to discover how that example is meant to be useful.

Doesn't need to be done here, just wanted to link the issue up.

nicopap self-requested a review August 15, 2023 05:18

nicopap added C-Performance A change motivated by improving speed, memory usage or compile times A-Transform Translations, rotations and scales A-Hierarchy Parent-child entity hierarchies labels Aug 15, 2023

nicopap reviewed Aug 15, 2023

View reviewed changes

Add hierarchy propagation benchmark

228a93c

hesiod force-pushed the hierarchy-bench branch from 3fd0c14 to 228a93c Compare August 15, 2023 11:00

hesiod marked this pull request as ready for review August 15, 2023 11:16

nicopap self-requested a review August 15, 2023 11:30

nicopap reviewed Aug 15, 2023

View reviewed changes

Refactor benchmark structure

ac449d2

nicopap reviewed Aug 24, 2023

View reviewed changes

james7132 self-requested a review October 2, 2023 01:15

NthTensor mentioned this pull request Feb 7, 2024

Parallelized transform propagation #11760

Closed

BenjaminBrienen added D-Modest A "normal" level of difficulty; suitable for simple features or challenging fixes S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transform hierarchy propagation benchmark #9442

Add transform hierarchy propagation benchmark #9442

hesiod commented Aug 14, 2023

github-actions bot commented Aug 14, 2023

nicopap left a comment

nicopap Aug 15, 2023

nicopap Aug 15, 2023

nicopap Aug 15, 2023

hesiod Aug 15, 2023

nicopap Aug 15, 2023

hesiod Aug 15, 2023

nicopap Aug 15, 2023

hesiod Aug 15, 2023

hesiod Aug 15, 2023

nicopap commented Aug 15, 2023

github-actions bot commented Aug 15, 2023

nicopap left a comment

nicopap Aug 15, 2023

hesiod Aug 15, 2023

nicopap Aug 15, 2023

nicopap Aug 15, 2023

hesiod commented Aug 20, 2023

hesiod commented Aug 20, 2023

nicopap left a comment

nicopap Aug 24, 2023

nicopap Aug 24, 2023

nicopap Aug 24, 2023

nicopap Aug 24, 2023

nicopap Aug 24, 2023

rparrett commented Sep 11, 2023 •

edited

Loading

		group.warm_up_time(std::time::Duration::from_secs(2));
		group.measurement_time(std::time::Duration::from_secs(10));

	// Updating transforms must be done before `CoreSet::PostUpdate`
	// or the hierarchy will momentarily be in an invalid state.

		fn inner_update_bench(b: &mut Bencher<WallTime>, bench_cfg: &(&Cfg, TransformUpdates)) {
		const UPDATE_BENCH_POSTUPDATE_ONLY: bool = false;

Add transform hierarchy propagation benchmark #9442

Are you sure you want to change the base?

Add transform hierarchy propagation benchmark #9442

Conversation

hesiod commented Aug 14, 2023

Objective

Solution

Review Considerations

github-actions bot commented Aug 14, 2023

nicopap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nicopap commented Aug 15, 2023

github-actions bot commented Aug 15, 2023

nicopap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hesiod commented Aug 20, 2023

hesiod commented Aug 20, 2023

nicopap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rparrett commented Sep 11, 2023 • edited Loading

rparrett commented Sep 11, 2023 •

edited

Loading