Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized transform propagation #11760

Closed

Conversation

NthTensor
Copy link
Contributor

Objective

Currently we traverse every tree in the entity hierarchy top-to-bottom every frame to update transforms. We iterate separate trees in parallel, but in practice (and especially when loading complex gltf scenes) the world tends to consist of a few very deep trees, and several users and reported performance problems (see for example #9442).

Fixes #7840 (partially).

Solution

This is the first of two PRs I am publishing to address this problem. In this PR, I deal with trying to start updating transforms as deeply into the tree as possible, while also running as much of the traversal in parallel as possible.

propagate_transforms now preforms two passes of parallel iteration. The first pass updates all root and orphaned nodes (just as before) but now we skip any trees without top-level changes instead of immediately doing a depth-first traversal. The second pass then walks up the tree from every changed non-root node to find minimal disjoint sub-trees containing changed transforms, and then propagates transforms down each sub-tree in parallel.

These optimizations should significantly improve the case when most transform changes occur near the leaves of the entity hierarchy. The best case performance (all changes are leaves) should be much better than current. The worse case (when all changes are in roots) should be approximately just as fast as the current implementation.

This is just a proposal. Please evaluate it for efficiency and safety. While this algorithm seems intuitively correct to me, and I have done my best to justify the new unsafe block, am not super happy with the logic in the second safety comment.

@alice-i-cecile alice-i-cecile added C-Performance A change motivated by improving speed, memory usage or compile times A-Transform Translations, rotations and scales D-Complex Quite challenging from either a design or technical perspective. Ask for help! labels Feb 7, 2024
@mockersf
Copy link
Member

mockersf commented Feb 7, 2024

the stress test example transform_hierarchy should have plenty of cases to test the perf changes

}
// SAFETY:
// Define any changed entity that is not a descendant of another changed entity to be an 'entry point'.
// - Since the hierarchy has forest structure, two distinct entry points cannot have shared decedents.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this is not entirely true because you could have cycles, but luckily they get filtered out by the loop above that looks for a changed ancestor.

In general though I feel like some of these safety comments are a bit generous with the assumptions.

Copy link
Contributor Author

@NthTensor NthTensor Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should have marked this as a draft. I know I was a bit too informal with these, but I'm still getting used to the rust safety-comment proof style so I figured I could use the feedback.

I will try to shore these up.

@NthTensor
Copy link
Contributor Author

the stress test example transform_hierarchy should have plenty of cases to test the perf changes

On the off chance that someone tries this for themselves, these are broken. You have to switch out MinimalPlugins for DefaultPlugins, or they won't profile properly. See tracking issue here: #7433.

@james7132 james7132 added the S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help label Feb 7, 2024
@pcwalton
Copy link
Contributor

pcwalton commented Feb 7, 2024

Interesting, would be curious to see performance results.

Ultimately, if we go down this route I think we want something like Servo had, where every node enqueues its descendants and work-stealing distributes the load across all the threads. This would also allow us to encapsulate the "parallel tree traversals" logic into a single piece of generic infrastructure, eliminating the unsafe code in the systems themselves. However, I'm not sure how much work stealing infrastructure that Bevy has, and just throwing Rayon in there would result in a lot of thread bloat.

@NthTensor
Copy link
Contributor Author

NthTensor commented Feb 7, 2024

I profiled all of the transform_hierarchy benchmarks. All of these were run for 25 seconds on a 11th gen i7-1165G7 (2.80GHz) cpu running Linux, and built using --release. Red is main, yellow is this PR.

Keep in mind that propagation happens at both at startup and every frame. The startup tends to take longer, so the distribution is usually bimodal.

Large Tree

A fairly wide and deep tree, with a depth of 18 and 8 children per node. Entities transform with a probability of 0.5 each frame.

sc_1707337744

Wide Tree

A shallow but very wide tree, with a depth of 3 and 500 children per node. Entities transform with a probability of 0.5 each frame.

sc_1707338163

Deep Tree

A deep but not very wide tree, with a depth of 25 and 2 children per node. Entities transform with a probability of 0.5 each frame.

sc_1707338516

Chain

A chain 2500 levels deep. Entities transform with a probability of 0.5 each frame.

[Link broken, I am fixing]

Interesting performance regression here, but I don't think this is something we should be optimizing for.

Update Leaves

A fairly wide and deep tree, with a depth of 18 and 8 children per node. Leaf nodes transform with a probability of 0.5 each frame.

sc_1707339035

This is a much more serious regression. I believe it is from the added cost of walking up the tree from every root node. It may be possible to optimize this case further, but introducing a Static marker will also help in practice.

Update Shallow

A fairly wide and deep tree, with a depth of 18 and 8 children per node. Entities in the bottom half transform with a probability of 0.5 each frame.

sc_1707339248

Humanoids Active

4000 human-rig hierarchies. Every single entity moves every single frame.

sc_1707339439

Another regression. The added overhead of finding disjoint sub-trees is useless when every entity moves every frame.

Humanoids Inactive

4000 human-rig hierarchies. All but 10 are static.

sc_1707339561

An easy performance win coming from just ignoring all the trees that don't move.

Humanoids Mixed

4000 human-rig hierarchies. Half are static, half move every frame.

sc_1707339769

Very slight regression. I am going to treat this as statistically insignificant.

Analysis

This technique performs very well when few entities are moving, and about equally poorly when more than half of the entities move in a single frame. It may be possible to combine this with other techniques, such as a Static marker (or work-stealing, or just switching between the two methods) to get more consistent performance benefits.

TL;DR It's a mixed bag, depends on what we want to optimize for.


Oh and we should probably get some tests on real practical tests scenes not just benchmarks, but I don't really have anything applicable around.

@JMS55 JMS55 added this to the 0.14 milestone Feb 7, 2024
@james7132
Copy link
Member

One potential alternative if #11995 gets merged is a (unsafe) alternative to the current par_iter that supports splitting tasks to support work stealing sub-hierarchies. par_iter trivially should be able to take advantage of it by subdividing archetype/table slices, but a hierarchy traversal may be difficult, but possible to do safely.

@NthTensor
Copy link
Contributor Author

I think this sort of optimization is worth looking at, but to really see performance improvements across the board it will need to be coupled with a "two-pass" propagation algorithm, similar to what is done in bigspace.

@NthTensor NthTensor closed this Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Transform Translations, rotations and scales C-Performance A change motivated by improving speed, memory usage or compile times D-Complex Quite challenging from either a design or technical perspective. Ask for help! S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Propagation requires full hierarchy traversal
7 participants