-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zzre: Allocation-less colliders #371
Conversation
First benchmark resultsBenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2Job=LongRun IterationCount=100 LaunchCount=3
Of course not depicted in the performance benchmarks is the code quality: IntersectionsStruct has a horrible API that bleeds into all consumers Also the allocation-lessing is obviously not complete, the split stacks are still allocated per query and should either be fixed-size for a ridiculous tree size or pooled for amortization. For the next benchmark I will try to preserve the actual status quo as baseline, while this amortization will also be applied to a new generator-based method. |
Results with amortized split stacks
Still a bit curious why |
With Baseline corrected and the power of just removing coarse intersection tests entirely (let's just not care about out-of-bounds right?) we have no allocations for all three variants we would expect to have no allocations (minus amortization). And still are spending 20% less runtime.
|
Now we fix the baseline as separate assembly, because I want to tackle some more shared code within zzre.core the benchmark results warrant taking that risk
|
And now with the KD optimization
|
Now we merge the two levels of kd-trees into a single structure, which brings just a bit of performance (getting us to exactly 3x faster) but should also simplify some API stuff, so maybe looking into the struct enumerator might be worthwhile again. Not much but it is there
Also I should probably clean up a bit, both the math optimization as well as the KD optimization have proven themselves and we do not longer need them run them every time. |
All benchmarks should have KD optimization. also I checked the differences which seem to be just between Baseline and Current due to the triangle-sphere intersection. These differences seem to point to erroneous behaviour of the old one. So I will let that slide. And the benchmark results
|
While writing the So enjoy this one-off benchmark
The answer: Not really, any difference here is pretty near the threshold of error... So let's go with the simplest one. |
Finally the SIMD (two-split) benchmarks are in with three-split being scribbled up. Let's see how it turned out
oh well, this is surprisingly bad :) I probably still want to try the three-split one just for good measure, but we can already see that the branch reduction is not helpful and if it does help performance, it is a miniscule benefit. |
And here are the results for the SIMD512 three-split collider. Because we have more loops iterations I also readded the less branching variant for the new benchmark. I usually look at the results only after posting them here, so I cannot tell what hides under here...
The answer: not very much, we have again a minimal performance benefit of the SIMD128 two-split collider but anything higher performs worse and is naturally more complex. Just as a text note without further benchmark results: I tested a SOA variant of the SIMD128 with no discernable difference in performance. VTune showed a major bottleneck to be branch mispredictions, especially in the leaf Triangle-Sphere intersection test, which I guess is to be expected (if we reasonably knew the outcome we would not have to ask this very question), so I can see no obvious fault in the algorithm in the microarchitecture level. |
I am almost at the end of the Intersections method, with the winner being the MergedCollider and some of the more simpler variant like List, Struct or TaggedUnion. I had yet to benchmark the latter two in the merged collider, so here are the numbers for that
These numbers again show: simpler is better, so I will leave it at that. We can still cheat for one actual usecase in the game, where a line intersection is equivalent to a raycast. But for the other usecases (especially physics) we would need to incorporate more usecase-specific operations in order to allow for optimizations (e.g. filter by a product in order to reduce to a single-nearest-neighbor search). I am currently not inclined to do that. I still would like to roughly benchmark raycast, making sure that it does not allocate and maybe try out a couple variants. After that I will wrap up this PR by putting the experiments into a backup branch and applying the winner variants to the productive game. |
Initial benchmark for raycasts, the results are worrying. A lot of allocations (which could be easily amortized though) and troublesome performance. I should add a benchmark with a sorted line intersection and also definitely figure out why the merged collider is so much worse than the other two. This is surprising.
|
Let's get unsurprised here with an easy one first: Line intersections are not faster than ray casts
This can be attributed to intersection queries having to always return all intersections, while raycasts can exit out as soon as there cannot be a closer hit. |
A one-off benchmark before I have to use a profiler again: At some point I added a SSE 4.1 version of I should have benchmarked it earlier
We also have multi-modal distributions, vastly different results in MediumRun benchmarks, so summaries I would say: No use for either implementation, just scalar should be fine. EDIT: Another benchmark not worthy of uploading is trying to just disable the degeneration test. We can do that during merging and safe the test for the raycasts but that is a tiny improvement over the current state. Profiler comparison it is. |
Also not uploading: adding MIOptions makes casting almost twice as slow. Just adding AggressiveOptimization (without inlining) is better and here are the results for that
But merged it is still slower than baseline. The profiler did unfortunately tell me much so a bit of guesswork it is: I am working on an iterative version. |
Oh well. The first iterative raycast and... at least it is parity in performance with baseline and in allocation with merged? Overall still bad
The allocations are due to the coarse check, in particular casting against a box allocates at the moment. I would still like to see better numbers for the casting itself. |
Now we are getting somewhere. A new iterative variant using additional subtree-elimination reaches allllllmost the WorldCollider+Recursive Cast. That it is not faster is still beyond me. The results
|
I omitted the break-even and just continued. In instrumentation profiles I saw It worked
I highly suspect we can also push this further. In the same profiles there were also |
I am probably going to abandon the recursive merged as well as the naive iterative versions so I cleaned up the list of benchmarks a bit. Also instead of appending ever more acronyms to RW I am going to just compare the previous benchmarks results with the current changes (and baseline/simple opt). The current changes prepare for the alternative Ray-Triangle intersection and also remove the usage of nullables. By using a NaN invariant we can also omit the comparison for misses entirely. At some point I might want to even move the intersection into the Another millisecond gone
|
Möller-Trumbore came through, we are well within 2x, even though I had to add a naive check to cull backfacing triangles from the test. The old intersection method did that without my explicit knowledge. Oh well... The deeds
|
As I suspected there was an intermediate in Möller-Trumbore that we can use to cull back-faces (it's just the sign of the determinant). Also I finally removed degenerated triangles from the merged tree, removed the dummy splits of naive section collisions and reordered the triangles to remove the This might be the end for optimizations
If I want to continue, I guess more precomputation and an even faster intersection algorithm might be the way to go. But that is not one I want to go, I already use more memory for the merged trees. |
401f309
to
75fb68e
Compare
As discovered in #368 (and #313) the colliders are heavy allocating components due to many uses of LINQ and generator methods. Unfortunately it does not seem like we get value generator methods in C# anytime soon so we have to write manual enumerator structs to reduce memory allocations.
For sorting intersections we might want to also look into cached lists as well as cached stacks inside the enumerators or have intersections (instead of raycasts) always write into a sorted list.
For testing we can use the
TestRaycaster
but it should be possible to have both implementations side-by-side and (behind a compiler flag) run them both, expecting the exact same results.Review todos:
PrefixSums
instead ofSumSums
StackOverSpan
ListOverSpan
, maybePooledList
(TemporaryList
?FixedTemporaryList
? Inline-array?)Use same collider for geometry instancesSome other time.IEnumerable Intersections
harder to callTriangle.ClosestPoint
Final benchmark results
GC Profile comparison
The GC profiler shows that
TreeCollider
was the main cause of per-frame allocations but also that we have quite a way to go for zero allocations per-frame.