-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Additional benchmarks #62
Conversation
…e charting; introduce Set vs Fill benchmarks; report bytes/entry and blocks/entry.
Do flush explicitly. It is necessary in this library's API. Started measuring byte size in another, more direct mechanism. There's a significant difference between the two forms of measurement (the checkSize method gives much flatter results than the look at the blockstore does!); I don't yet understand why this is or what significance it might (or might not) have. BenchmarkSet now has two variants: bulk flush and individual flush. The behavior in terms of how many new blocks are created varies markedly, as you'd expect. Include more bitsizes. I want to gather enough datapoints to draw interesting charts. Surely we'll expect to see some predictable change in the scaling curves across this dimension? Histograms of the sizes of blocks that appear in storage are now available. I've commented them back out after making some observations; they produce quite a lot of noise in output.
…arting. The largest scale entry in the table is commented out because it's an OOM-causer on some of my hardware; and I'd expect to get some impression of progression from these four orders-of-mag-base-10. I added (roughly) halfway points on the scale table just to try to fill things out a bit more. (You'll see why in the results discussion; some things are noisier than expected.) I'm leaving the charting tools to be handled out of band; I don't think adding them as transitive dependencies of this repo would be good. Results? I'm a little perplexed, actually. - BenchmarkFill-blocks-per-entry-vs-scale.svg ... - is completely all over the map. There appears to be *no* correlation between the bitwidth parameter and growth rate for block counts. - BenchmarkFill-totalBytes-per-entry-vs-scale.svg ... - also completely all over. No correlations. - BenchmarkFind-speed-vs-scale.svg ... - does sort of trend up as one would expect; - the noise is almost bigger than the signal, which is certainly interesting (but not wrong). - There's oddness for bitwidth=8: unlike all others, it's actually slower at the smallest scales. Maybe worth questioning? - BenchmarkSetBulk-addntlBlocks-per-addntlEntry-vs-scale.svg and BenchmarkSetIndividual-addntlBlocks-per-addntlEntry-vs-scale.svg ... - these both have reasonable up-and-to-the-right trends. - with batched flush, results are somewhat noiser. Per-insertion flush is fairly smooth. - it's a clear progression: higher bitwidth -> fewer new blocks created per insertion. Note that due to the way our randomization is seeded, even though all these keys and values are "random", they're deterministic between benchmark runs; and they're also the same on each benchmark between each parameter set. (Larger b.N adds noise, but I've been running that with fixed amounts, e.g. -benchtime=10x or -benchtime=30x, due to the fact using time as a parameter causes the larger values of "n" for scale to get a single run.) I'm very curious about the blocks and bytes per entry being apparently completely unaffected by scale across multiple orders of magnitude. Certainly we want those things to scale "well", for some definition of "well", but... complete noise is... not what I expected. Wondering if there's a measurement bug here. These are some interesting observations, but more numbers needed; I'm not certain these are the right ones, and certainly they aren't the only interesting ones. In particular I'm not sure there's enough probing of the layers of caching and flushing and how they affect observations; there may be a lot more to do in order to produce useful insights thereabout.
These are both reported as metrics, and also actively logged if repeated puts are detected in any number. (It is necessary to do both, because repeated puts could otherwise end up rounded down to zero, and I'd in fact like to see if the number is at all nonzero.) I also wanted to try to make some use of these numbers within the addAndRemoveKeys function... however, at the moment, that goes over like a lead balloon. Immediately after the first Set operations, and before the Flush, there may still be some Put operations! (In practice, I saw both 0 and 1 in the tests that already call the addAndRemoveKeys test helper function; just enough variance to make asserting on it not fly.) The reason for these Puts-without-Flush is that `modifyValue` does a `store.Put`... but, only in the path where the KV's array is full and it creates a new subshard: it then puts the new subshard (and not, arguably surprisingly, itself). Not sure I entirely grok the high level reasoning behind this. And these Puts are tad difficult to predict, short of evaluating the full hamt algorithm itself. These Puts-without-Flush also explains why the graphs for BenchmarkFill-blocks-per-entry-vs-scale.svg were so uncorrelated, as remarked on in the previous commit: the number of blocks generated is largely affected by how many changes are accumulated before Put starts to occur; and if things aren't *actually* being buffered up until Flush is called, then of course we get plenty of Put operations no matter what. This may indicate a problem: if someone was expecting batching to really work for garbage avoidance reasons, it... doesn't look like it really does. (It's possible this hasn't actually been evident in any of the workflows people are using this hamt for, though; if the workload is doing various point changes, the effectiveness of this batching doesn't matter very much.) The countSize method gains some docs, because in the first writing of this, I reported these blockstore event stats *after* countSize... and yeah, oof, that's a measurement error alright. (You'll see many, many, many duplicate puts if you do things in this order.)
This new graph tracks BenchmarkSetIndividual-addntlBlocks-per-addntlEntry-vs-scale.svg pretty darn closely, just with a scaling of the Y axis (it more or less shifts up by one! Which I suppose would make sense).
I belatedly have noticed https://github.com/ipfs/go-hamt-ipld/pull/33 also adds more benchmarks in a PR that has gone unmerged :( There may be some duplicate work between the two that might need manual resolve; have not yet deeply inspected. |
Nice. I don't see a comparison that includes |
Here's the visualization for the bulk-set: It's the same general shape as the individual sets graph, but the Y axis does have a different scale -- individual setting causes about twice as many garbage blocks. (It's also just a bit noiser of a graph.) I think these would be drastically more distinct from each other if it wasn't for the somewhat weird behavior of Sets that cause a new subshard instantly causing a Put to the blockstore -- that causes a lot of writes that would otherwise be avoidable, I suspect. |
IIRC, the failure reported by CI here is just that |
919597f
to
a4c6bbe
Compare
5, | ||
6, | ||
7, | ||
8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@warpfork do you know why we don't try things out at higher bitwidths? For context I'm working on fitting particular HAMT sizes and workloads to optimal bit widths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has no effect! There's a fence around the config initialization that just brings anything larger back down to 8.
There's also casts from an int
down into byte
in some of the internal logic which effectively limits things to 8 bits. I think that these can all be changed to int
fairly trivially though (I did this briefly out of curiosity, but then dropped the diff again because it didn't seem like a priority), and then presumably we could also drop the config range fence and then larger bitwidth numbers would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also relevant: larger bitwidths mean a larger bitfield byte sequence in the serial format (currently this is somewhat dynamic, but not necessarily usefully so; #54 / #63 discusses changing some of the serial structures to use a fixed number of bytes). So, you will probably also want to watch the serial size graphs when increasing bitwidths to see if the size of this field becomes impactful.
This PR contains a variety of new and extended benchmarks.
Existing benchmarks were moved to standardized benchmark names and standardized tables for parameters -- This is useful if we want to do charting (which will indeed be discussed later in the PR comment! Just hang on).
Set vs Fill benchmarks are introduced. Fill benchmarks (already existed, but renamed) measure how long it takes to create an entire hamt of some size; the Set benchmarks do an additional 1000 operations on top of an already created hamt, which offers more insight into costs of operations at particular scales.
Many new stats are tracked, using the now-standard ReportMetric feature:
The
mockBlocks
type was extended to add accounting for many of these statistics. (We can imagine using this in the future to make more tests asserting, for example, the number of get and put events expected for some operation. That's left for future work, however.)Now, for some selected results!
I charted these using https://github.com/cep21/benchdraw . Instructions for reproducing these can be found in the comments in the diff body.
First of all: the number of blocks created in the blockstore per entry placed into a hamt during the Fill test: is... random?
The reason for this appears to be that creation of subshards during value insertion will unconditionally and immediately Put to the blockstore. (This might be a bit surprising, since almost every other logical branch that value insertion and update can take will mutate cache nodes, and not immediately do a Put.)
This seems like it could represent a lot of unnecessary garbage block creation. However, this may vary by workloads: if a workload can't bulk its writes anyway, this may simply not matter in practice. (I don't have a lot of information about workloads right now.)
Get events to perform a lookup: larger bitwidth parameters need fewer get events. Larger hamts need slowly more gets, of course. No big surprises here. (I like no surprises.)
The number of addtional blocks created during inserting entries: is also higher for smaller bitwdiths. Larger hamts also need slowly more blocks. This is also simple good confirmatory news: it scales reasonably nicely.
More benchmarks could be useful in the future. In particular I'm very aware that these are benchmarking synthetic operations; more information about actual workloads could result in much more interesting benchmarks. But hopefully these are useful, and have useful outlines of how to do yet more in the future.