-
-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive data retention on fast benchmarks #61
Comments
Because the summary is calculated after teardown, I don't even have the option to force this behavior in my own tests. |
I see, thank you so much for submitting an issue. |
That's what I'm using as the workaround. I'm definitely still having some sort of issue with ensuring V8 is completely warm, but then that's a very old benchmarking problem. |
Do you suppose it would break anyone if you calculated the performance numbers prior to teardown? I have no way of knowing if anyone but you is depending on the existence of the telemetry data at the end of a run. I could expunge the samples during teardown. It's not awesome, but as a documented workaround I've encountered worse. But right now teardown seems to be first, which scuttles that idea. My most problematic library has a couple dozen benchmarks, so clearing the data between tasks is more than an order of magnitude difference in memory utilization. |
I don't think so, we can try and see. |
The samples are the problem here? |
That's where all the data goes, yes. But teardown is the last event I can see between tasks, and it is called immediately before the P## calculations are done. I could cheese something with the setup event, but that involves figuring out what task we are on and looking up the previous task, which seems problematic. For me the difference in order between the calculations and the event don't matter, excepting whether I can manually prune data prior to the table() call. But I don't know enough people using tinybench enough ways to offer any sort of suggestion on whether changing the order would break their use cases. I hate adding flags to fix user-specific problems, so a programmatic solution seems slightly more palatable IMO. But bumping the minor version number might be called for if you decide on that option. |
I've gone down a rabbit hole. Send help. https://github.com/tdunning/t-digest Unfortunately the whole tinybench library is about the size of this algorithm (although it's in Java so maybe divide that by 2?). So while this is academically cool I expect this is A Bridge Too Far territory. |
Haha, I see, it's getting a bit complicated. |
This one doesn't seem too bad. The actual logic part of it is about 2 computer screens, and it's based on a refinement of t-digest. It's also the only one that seems to be moderately actively maintained. https://github.com/hville/sample-distribution The algorithm does have fairly enticing amortized cost overheads for space and computation. Less than the sort() call. But really just inverting the calc and event emit phase is probably going to solve most people's problems. If your test can't manage 50 mb of telemetry per app you've got bigger fish to fry. If you tried to scale this up to some sort of replacement for Gatling or blazemeter, then you would really want to look at this. |
I have had vitest crash while running benchmarks, and I am pretty sure this was the reason why. The crash was caused by heap memory being filled up. On subsequent runs with lower benchmark times I am easily getting over 5GBs used by the process, which seems to be mostly the samples. Had to add a loop inside the test cases to aritifically slow them down to get the run-time I wanted without crashing. |
Workaround: It looks like you could clear the extra statistics in the 'cycle' event (though to me 'teardown' sounds like the spot where one would expect to manipulate system state). |
@polak-jan could you explain your workaround a bit more? |
@psirenny I just added a loop around the entire benchmark, so each run of the benchmark is actually X number of runs. This isn't really ideal as it can interfere with things like cache and JIT states. But I don't think there is any other reasonable way to get around the limitation right now. And assuming all your benchmarks use the same workaround they should all be offset by a similar amount so comparisons should still be mostly accurate. |
I don't think the warmup time for tinybench is anywhere near adequate to getting V8 to settle the code. 'Solving' that problem by deoptimizing tinybench is fighting the wrong fight.
In an attempt to get more accurate data out of tinybench, I've been using time: 2000 or in some cases 6000 for some test suites.
The problem here is that given some leaf node tests are running 2, 3, even 4 orders of magnitude faster than the more complex tests, some of my tests are generating thousands of samples, and others are generating several million. This extra memory pressure is causing some weird race conditions with the code under test.
Now while switching the telemetry calculation to an algorithm that uses k or logn memory complexity for running totals is probably a lot to ask, retaining samples for task A while running task C is a waste of resources.
Expected behavior: Disabling all but one task should result in the same success or failure modes for the code under test. The summaries should be calculated at the end of a task and the working data discarded.
Actual behavior: all statistics are kept for the entire run.
Workaround: iterations attribute instead of time
The text was updated successfully, but these errors were encountered: