Parallelized transform propagation #11760

NthTensor · 2024-02-07T14:35:29Z

Objective

Currently we traverse every tree in the entity hierarchy top-to-bottom every frame to update transforms. We iterate separate trees in parallel, but in practice (and especially when loading complex gltf scenes) the world tends to consist of a few very deep trees, and several users and reported performance problems (see for example #9442).

Fixes #7840 (partially).

Solution

This is the first of two PRs I am publishing to address this problem. In this PR, I deal with trying to start updating transforms as deeply into the tree as possible, while also running as much of the traversal in parallel as possible.

propagate_transforms now preforms two passes of parallel iteration. The first pass updates all root and orphaned nodes (just as before) but now we skip any trees without top-level changes instead of immediately doing a depth-first traversal. The second pass then walks up the tree from every changed non-root node to find minimal disjoint sub-trees containing changed transforms, and then propagates transforms down each sub-tree in parallel.

These optimizations should significantly improve the case when most transform changes occur near the leaves of the entity hierarchy. The best case performance (all changes are leaves) should be much better than current. The worse case (when all changes are in roots) should be approximately just as fast as the current implementation.

This is just a proposal. Please evaluate it for efficiency and safety. While this algorithm seems intuitively correct to me, and I have done my best to justify the new unsafe block, am not super happy with the logic in the second safety comment.

mockersf · 2024-02-07T14:52:29Z

the stress test example transform_hierarchy should have plenty of cases to test the perf changes

SkiFire13 · 2024-02-07T15:11:42Z

crates/bevy_transform/src/systems.rs

+        }
+        // SAFETY:
+        // Define any changed entity that is not a descendant of another changed entity to be an 'entry point'.
+        // - Since the hierarchy has forest structure, two distinct entry points cannot have shared decedents.


FYI this is not entirely true because you could have cycles, but luckily they get filtered out by the loop above that looks for a changed ancestor.

In general though I feel like some of these safety comments are a bit generous with the assumptions.

Maybe I should have marked this as a draft. I know I was a bit too informal with these, but I'm still getting used to the rust safety-comment proof style so I figured I could use the feedback.

I will try to shore these up.

NthTensor · 2024-02-07T19:06:57Z

the stress test example transform_hierarchy should have plenty of cases to test the perf changes

On the off chance that someone tries this for themselves, these are broken. You have to switch out MinimalPlugins for DefaultPlugins, or they won't profile properly. See tracking issue here: #7433.

pcwalton · 2024-02-07T21:00:50Z

Interesting, would be curious to see performance results.

Ultimately, if we go down this route I think we want something like Servo had, where every node enqueues its descendants and work-stealing distributes the load across all the threads. This would also allow us to encapsulate the "parallel tree traversals" logic into a single piece of generic infrastructure, eliminating the unsafe code in the systems themselves. However, I'm not sure how much work stealing infrastructure that Bevy has, and just throwing Rayon in there would result in a lot of thread bloat.

NthTensor · 2024-02-07T21:06:04Z

I profiled all of the transform_hierarchy benchmarks. All of these were run for 25 seconds on a 11th gen i7-1165G7 (2.80GHz) cpu running Linux, and built using --release. Red is main, yellow is this PR.

Keep in mind that propagation happens at both at startup and every frame. The startup tends to take longer, so the distribution is usually bimodal.

Large Tree

A fairly wide and deep tree, with a depth of 18 and 8 children per node. Entities transform with a probability of 0.5 each frame.

Wide Tree

A shallow but very wide tree, with a depth of 3 and 500 children per node. Entities transform with a probability of 0.5 each frame.

Deep Tree

A deep but not very wide tree, with a depth of 25 and 2 children per node. Entities transform with a probability of 0.5 each frame.

Chain

A chain 2500 levels deep. Entities transform with a probability of 0.5 each frame.

[Link broken, I am fixing]

Interesting performance regression here, but I don't think this is something we should be optimizing for.

Update Leaves

A fairly wide and deep tree, with a depth of 18 and 8 children per node. Leaf nodes transform with a probability of 0.5 each frame.

This is a much more serious regression. I believe it is from the added cost of walking up the tree from every root node. It may be possible to optimize this case further, but introducing a Static marker will also help in practice.

Update Shallow

A fairly wide and deep tree, with a depth of 18 and 8 children per node. Entities in the bottom half transform with a probability of 0.5 each frame.

Humanoids Active

4000 human-rig hierarchies. Every single entity moves every single frame.

Another regression. The added overhead of finding disjoint sub-trees is useless when every entity moves every frame.

Humanoids Inactive

4000 human-rig hierarchies. All but 10 are static.

An easy performance win coming from just ignoring all the trees that don't move.

Humanoids Mixed

4000 human-rig hierarchies. Half are static, half move every frame.

Very slight regression. I am going to treat this as statistically insignificant.

Analysis

This technique performs very well when few entities are moving, and about equally poorly when more than half of the entities move in a single frame. It may be possible to combine this with other techniques, such as a Static marker (or work-stealing, or just switching between the two methods) to get more consistent performance benefits.

TL;DR It's a mixed bag, depends on what we want to optimize for.

Oh and we should probably get some tests on real practical tests scenes not just benchmarks, but I don't really have anything applicable around.

james7132 · 2024-02-27T01:18:48Z

One potential alternative if #11995 gets merged is a (unsafe) alternative to the current par_iter that supports splitting tasks to support work stealing sub-hierarchies. par_iter trivially should be able to take advantage of it by subdividing archetype/table slices, but a hierarchy traversal may be difficult, but possible to do safely.

NthTensor · 2024-08-08T18:11:05Z

I think this sort of optimization is worth looking at, but to really see performance improvements across the board it will need to be coupled with a "two-pass" propagation algorithm, similar to what is done in bigspace.

NthTensor added 2 commits February 7, 2024 09:09

Add disjoint subgraph iteration to propagate_transforms

ae2eb63

Fix formatting and CI

b7db392

alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times A-Transform Translations, rotations and scales D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Feb 7, 2024

SkiFire13 reviewed Feb 7, 2024

View reviewed changes

james7132 added the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Feb 7, 2024

JMS55 added this to the 0.14 milestone Feb 7, 2024

NthTensor mentioned this pull request Feb 14, 2024

Static and cached marker components #11864

Closed

JMS55 modified the milestones: 0.14, 0.15 May 14, 2024

rparrett mentioned this pull request Jul 15, 2024

transform_hierarchy example doesn't seem to do anything #7433

Closed

NthTensor closed this Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelized transform propagation #11760

Parallelized transform propagation #11760

NthTensor commented Feb 7, 2024

mockersf commented Feb 7, 2024

SkiFire13 Feb 7, 2024

NthTensor Feb 7, 2024 •

edited

Loading

NthTensor commented Feb 7, 2024

pcwalton commented Feb 7, 2024 •

edited

Loading

NthTensor commented Feb 7, 2024 •

edited

Loading

james7132 commented Feb 27, 2024

NthTensor commented Aug 8, 2024

Parallelized transform propagation #11760

Parallelized transform propagation #11760

Conversation

NthTensor commented Feb 7, 2024

Objective

Solution

mockersf commented Feb 7, 2024

SkiFire13 Feb 7, 2024

Choose a reason for hiding this comment

NthTensor Feb 7, 2024 • edited Loading

Choose a reason for hiding this comment

NthTensor commented Feb 7, 2024

pcwalton commented Feb 7, 2024 • edited Loading

NthTensor commented Feb 7, 2024 • edited Loading

Large Tree

Wide Tree

Deep Tree

Chain

Update Leaves

Update Shallow

Humanoids Active

Humanoids Inactive

Humanoids Mixed

Analysis

james7132 commented Feb 27, 2024

NthTensor commented Aug 8, 2024

NthTensor Feb 7, 2024 •

edited

Loading

pcwalton commented Feb 7, 2024 •

edited

Loading

NthTensor commented Feb 7, 2024 •

edited

Loading