Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: On achieving accurate benchmarks #65

Open
jdmarshall opened this issue Nov 7, 2023 · 11 comments
Open

Discussion: On achieving accurate benchmarks #65

jdmarshall opened this issue Nov 7, 2023 · 11 comments

Comments

@jdmarshall
Copy link

I notice there have been several commits that de-optimize parts of tinybench because someone saw that it was fouling their benchmarks.

I feel like the NodeJS situation is tricky enough that It might warrant a broader conversation on how to get consistent benchmarks, at the end of which the salient points should go at the bottom of the README.md file.

@jdmarshall
Copy link
Author

jdmarshall commented Nov 7, 2023

JITs mean that a function may or may not get optimized based on how big it is and how often it gets called. Benchmarks without a substantial warming time may find that the first task being run is slower than the third, especially if the benchmarks delegate to functions that are common between them.

You might be able to spot this by reordering the benchmarks. And as Feynman said, about the Scientific Method: The most important thing is that you do not fool yourself, and you are the easiest person to fool. Benchmarks that seem to tell us what we want to hear will result in PRs that may actually regress the system instead of improving it.

Doing different test runs in different orders can help with this. If your benchmarks are not coupled, this could be automated. But mostly what you want is to use a longer warmup time than tinybench's default. I have a couple modules where warmup is longer than the test run (because of the OOM issue which I have an open PR to ease). Because warmup doesn't gather telemetry it can run as long as necessary to heat up V8.

@jdmarshall
Copy link
Author

jdmarshall commented Nov 7, 2023

Thermal Throttling

This one has been hitting me big time. A lot of us these days are developing on a laptop. If the fan kicks in halfway through your run, the fan may keep your machine running at max CPU frequency. But you may still get thermal throttling even with the fan running full blast.

For local benchmarking I often give myself a break between every few changes to allow everything to settle back down to unthrottled. But what I probably need is some sort of monitor that tells me when throttling is happening. I end up relying on the CI/CD pipeline for consistent results, and I hate to say it but the branch-and-PR process is substantially what saves me from putting in regressions that only work on my machine, one time, three days ago.

@jdmarshall
Copy link
Author

jdmarshall commented Nov 7, 2023

Memory Pressure

One of my favorite Secret Optimization Tricks is the Misattributed Cost. You can have an inconsequential task that slowly smashes your L2 and L3 cache, or allocates lots of garbage to be GCed. Then a periodic high priority task comes along and your flame graph shows that it is spending a bunch of time in data retrieval or memory allocation, and the real problem is that the cadence of the app is such that this task always triggers cache misses or heavy GC - which it is not responsible for causing in the first place. It looks like a tall tent pole because it's being blamed for someone else's problems.

In some domains this problem might be construed as Priority Inversion - a low priority task is stealing resources from a high priority one.

I just landed a change that avoids defensive copies during a tree merge operation that I could prove don't suffer from aliasing problems. The effect on my benchmark times was negligible. The mean response time was basically unchanged. P95 might have gone down a little but is in the noise floor. However I did reduce the daily average CPU usage by about 2%. That's more head room during peak traffic for sure, and if your cluster size is 25 boxes, that's half a machine you either didn't need to turn on, or autoscaling that kicks in a little later and shuts off a little sooner.

@jdmarshall
Copy link
Author

jdmarshall commented Nov 7, 2023

The common thread I'm detecting in my own observations is that I think tinybench might need a CLI that can be used to override the default behaviors of your benchmarks. Being able to adjust max memory to create pressure, run the tasks in random or reverse order, adjust the iterations or timings, are all useful things. Particularly in a CI/CD environment.

I am running benchmarks on every build, and sending the telemetry from main-line builds to grafana to chart. These are used as an asynchronous performance gate by auditing them a couple of times a week, and using the timestamps to narrow down where a regression entered the system. I could one day see running benchmarks at a stress-test level once per day or once per shore (before start-of-business or end-of-business for local and remote workers). Or more simply, running a debounce trigger that schedules a run X minutes after the last commit. Having a build step that jacks up the settings for all tests would be very useful there.

@Aslemammad
Copy link
Member

Yo, you're hitting on some knowledge that I've never heard before. Great job.

For now I don't think we can have a cli and we should try to do things internally as much as possible.

@jdmarshall
Copy link
Author

jdmarshall commented Nov 7, 2023

A hypothetical CLI would probably setup some global state that gets read back when you call 'new Bench'. It could also handle some lifecycle issues that I've been having to deal with manually, like 'what does CTRL-C do when you realize one minute into a run that your new benchmark is incorrect?' (currently: not much). It could also sequence through several benchmark files, like Mocha, using a glob pattern.

The global shared state I will probably fake with some environment variables. But neither of these scale. I've added benchmarks to the 10% of our modules where I suspect problems lay, and a couple of modules that have a lot of dependencies to see if I can smoke out a few more. But at a minimum I need to do 20% and ideally I would do 80%. I'm not maintaining that many, the way they are currently written. And I'm definitely not maintaining that many in a way that covers all of the boundary conditions I feel are worth covering. It's too much tedium.

@kurtextrem
Copy link
Contributor

Also something to keep in mind: eval has a cache. So doing something like .add('eval', () => { (0,eval)('('+str+')') } will skew the result, as long as str is the same across all invocations.

@leeoniya
Copy link

had a similar convo about how my Benchmark.js results were showing big GC pressure slowdowns due to the too-tight bench loop, despite all my attempts to wait between cycles:

leeoniya/uDSV#2 (comment)

ended up trying tinybench and writing my own bench runner which was more representative of real world lib invocation patterns rather than a hot-allocating bench loop with no GC breathing room.

@jdmarshall
Copy link
Author

Sometimes GC pressure is the bottleneck (probably more than most people know) but it's important to be deliberate in what you are modelling instead of picking up sampling artifacts.

@etki
Copy link

etki commented Aug 24, 2024

I've left a lot here: #46 (comment)

I guess, my main suggestion would be "just take BenchmarkDotNet or jmh and dissect it". You need to run benchmarks in separate processes; you have to add support for users to specify the options to control the heap or whatever other flags are there; you have to spin it at least for several minutes and not just half a second, because there is always chrome in background that wakes up in an unfortunate moment; you need to write a warning for the user that he must run it on a server with no desktop environment at all; to excel, you also need to get to the level of dumping bytecode and assembly, profiling the run and polling CPU performance counters. The timers should be called as rare as possible; the results array must be pre-allocated (Array(N)), and as accounting for every run is virtually impossible, results should be bucketed (and there is pooled variance, which i believe can help with finding a somewhat correct stddev), because right now for every iteration you grow the list, which also takes time. Even though it's outside of the measured region, infrastructure code must still perform in a constant or at least linear fashion, because otherwise it will pressurize the benchmark one way or another. You also need an infrastructure for users to provide different parameters for benchmarks - otherwise you get optimizations for a single input, an in real life the profile will catch the fact that it is impossible to even account for all incoming values. But, i guess, the main thing is to question every move and how will it affect the measurements.

@jdmarshall
Copy link
Author

jdmarshall commented Sep 5, 2024

@etki

I was a Java dev until about 10 years ago, and in some respects all JITs have the same set of repeatability problems. I think I agree with everything you said there.

I wish I understood V8 Isolates well enough to know for sure if tinybench would benefit from running each benchmark in a worker thread, or whether a child fork is really necessary. But that would be a major version change with heavy documentation.

Why? Because in general you absolutely should not share global state between individual benchmarks as it can cause problems with unordered runs or for instance disabling all but one to test a theory. However people are people, and if you run each benchmark in a worker thread you will have no global shared state to exploit, and that'll break more than zero existing tests. Potentially a lot more.

I'm also wondering if running the tests in a docker image pinned to a particular cpuset would reduce the background noise from noisy neighbors or just muddy the waters further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants