-
-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelized transform propagation #11760
Conversation
the stress test example |
} | ||
// SAFETY: | ||
// Define any changed entity that is not a descendant of another changed entity to be an 'entry point'. | ||
// - Since the hierarchy has forest structure, two distinct entry points cannot have shared decedents. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI this is not entirely true because you could have cycles, but luckily they get filtered out by the loop above that looks for a changed ancestor.
In general though I feel like some of these safety comments are a bit generous with the assumptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should have marked this as a draft. I know I was a bit too informal with these, but I'm still getting used to the rust safety-comment proof style so I figured I could use the feedback.
I will try to shore these up.
On the off chance that someone tries this for themselves, these are broken. You have to switch out |
Interesting, would be curious to see performance results. Ultimately, if we go down this route I think we want something like Servo had, where every node enqueues its descendants and work-stealing distributes the load across all the threads. This would also allow us to encapsulate the "parallel tree traversals" logic into a single piece of generic infrastructure, eliminating the unsafe code in the systems themselves. However, I'm not sure how much work stealing infrastructure that Bevy has, and just throwing Rayon in there would result in a lot of thread bloat. |
I profiled all of the Keep in mind that propagation happens at both at startup and every frame. The startup tends to take longer, so the distribution is usually bimodal. Large TreeA fairly wide and deep tree, with a depth of 18 and 8 children per node. Entities transform with a probability of 0.5 each frame. Wide TreeA shallow but very wide tree, with a depth of 3 and 500 children per node. Entities transform with a probability of 0.5 each frame. Deep TreeA deep but not very wide tree, with a depth of 25 and 2 children per node. Entities transform with a probability of 0.5 each frame. ChainA chain 2500 levels deep. Entities transform with a probability of 0.5 each frame. [Link broken, I am fixing] Interesting performance regression here, but I don't think this is something we should be optimizing for. Update LeavesA fairly wide and deep tree, with a depth of 18 and 8 children per node. Leaf nodes transform with a probability of 0.5 each frame. This is a much more serious regression. I believe it is from the added cost of walking up the tree from every root node. It may be possible to optimize this case further, but introducing a Update ShallowA fairly wide and deep tree, with a depth of 18 and 8 children per node. Entities in the bottom half transform with a probability of 0.5 each frame. Humanoids Active4000 human-rig hierarchies. Every single entity moves every single frame. Another regression. The added overhead of finding disjoint sub-trees is useless when every entity moves every frame. Humanoids Inactive4000 human-rig hierarchies. All but 10 are static. An easy performance win coming from just ignoring all the trees that don't move. Humanoids Mixed4000 human-rig hierarchies. Half are static, half move every frame. Very slight regression. I am going to treat this as statistically insignificant. AnalysisThis technique performs very well when few entities are moving, and about equally poorly when more than half of the entities move in a single frame. It may be possible to combine this with other techniques, such as a TL;DR It's a mixed bag, depends on what we want to optimize for. Oh and we should probably get some tests on real practical tests scenes not just benchmarks, but I don't really have anything applicable around. |
One potential alternative if #11995 gets merged is a (unsafe) alternative to the current par_iter that supports splitting tasks to support work stealing sub-hierarchies. |
I think this sort of optimization is worth looking at, but to really see performance improvements across the board it will need to be coupled with a "two-pass" propagation algorithm, similar to what is done in |
Objective
Currently we traverse every tree in the entity hierarchy top-to-bottom every frame to update transforms. We iterate separate trees in parallel, but in practice (and especially when loading complex gltf scenes) the world tends to consist of a few very deep trees, and several users and reported performance problems (see for example #9442).
Fixes #7840 (partially).
Solution
This is the first of two PRs I am publishing to address this problem. In this PR, I deal with trying to start updating transforms as deeply into the tree as possible, while also running as much of the traversal in parallel as possible.
propagate_transforms
now preforms two passes of parallel iteration. The first pass updates all root and orphaned nodes (just as before) but now we skip any trees without top-level changes instead of immediately doing a depth-first traversal. The second pass then walks up the tree from every changed non-root node to find minimal disjoint sub-trees containing changed transforms, and then propagates transforms down each sub-tree in parallel.These optimizations should significantly improve the case when most transform changes occur near the leaves of the entity hierarchy. The best case performance (all changes are leaves) should be much better than current. The worse case (when all changes are in roots) should be approximately just as fast as the current implementation.
This is just a proposal. Please evaluate it for efficiency and safety. While this algorithm seems intuitively correct to me, and I have done my best to justify the new unsafe block, am not super happy with the logic in the second safety comment.