Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63

rvagg · 2020-08-13T13:00:49Z

Builds on previous PRs that are not yet merged, #57, #59 and #60. Addresses #54 and the discussed solution in there.

Only one commit, see the last in the branch, currently ~~091854c~~ 7eb4d1d.

This change takes control of the bitmap with a new Bitmap type that does the internal work. This lets us abstract away most of the details of the bitmap arrangement and avoid the specific layout of big.Int. It now works by bitWidth and validates that the bitWidth is the correct for the bytes in the serialization format, exactly.

The algorithm is the same and it'll output the same bytes as the old format, except not truncated, always exactly the right number of bytes for the bits we need to work with. I retained an old test in uhamt_test.go that validates against the original ordering and it's all good.

Further proposed change

BUT, I'd like to switch the ordering of the bytes around before this lands to match what would be more natural in most languages when you're dealing with bits within byte arrays. So rather than the BE format the big.Int gives us, go with something that looks like a LE format (we're not dealing with number representations so it's not strictly BE or LE, just ordered differently).

Currently, if we set bits 1, 3, 5, 10, 20, 31 of a 4-byte array (bitWidth 5, i.e. 32-bits) we would be setting them like so:

10000000 00010000  00000100 00101010

I'd like to reverse that so the ordering is more natural:

00101010 00000100 00010000 10000000

i.e. bits 1, 3 and 5 are all within the first byte, not the last.

This is how the JavaScript HAMT implementation works, and it's how @ianopolous' one in Peergos works (thanks to java.util.BitSet, mentioned here). It's a more natural way of addressing bits.

The implementation in here has a Bitmap#bindex() method that does the flip of the index, and the tests rely on a rev() function to reverse inputs (test inputs I pulled from my JavaScript tests). So it'd be a simple to reverse it in these two places.

It's a breaking change, but so is the serialization format introduced here with the bitmap being fully present and not truncated.

/cc @anorth

rvagg · 2020-08-13T13:07:38Z

CI is failing because I'm using binary literals for tests and they are Go 1.13+ and CI is 1.11. I'll swap them out for something else tomorrow, too late tonight.

warpfork · 2020-08-13T14:55:37Z

We should bump the go version in CI on master and just get that in. I found that ReportMetric is absent for getting a green light in #62 too.

warpfork · 2020-08-13T15:03:05Z

bitmap.go

+// individual bits and perform limited popcount for a given index to calculate
+// the position in the associated compacted array.
+type Bitmap struct {
+	Bytes []byte


I'll overthink this real quick just for completeness:

There are three (or maybe four) ways I could imagine implementing this:

like this: a struct wrapping a byte slice

just a typedef of a byte slice (type Bitmap []byte)

a struct with a fixed size array (and just panic on bitwidths that are too bit to fit in it)

a struct with a slice and also a fixed size array, and the slice initializes to the address of the fixed size array if it's big enough (like what bytes.Buffer does).

The second avoids an alloc if we assume that Bitmap is usually also used as a pointer that incurred its own heap alloc.

The third and forth avoid an alloc by virtue of using fixed size arrays.

I don't know if any of this matters however; I think in context of all the other sources of cpu cost in evaluating a hamt, this probably just isn't very important. At least, I'd assume it isn't until a benchmark or profiler report told me it is. And the current way of writing it is probably the simplest and clearest.

warpfork · 2020-08-13T15:06:40Z

bitmap.go

+
+// Copy creates a clone of the Bitmap, creating a new byte array with the same
+// contents as the original.
+func (bm *Bitmap) Copy() *Bitmap {


I'd consider having this return a Bitmap (no pointer), because the whole Bitmap type is only a word or two in size itself, so passing it around by value is pretty harmless, and might result in fewer heap-escaping allocs in some patterns of usage.

But, as with the comment about the struct design itself and various ways to make it have fewer pointers... I would not actually worry about this very much unless a benchmark or profiler report pointed at it. (I'm only making these comments for completeness because you asked for them, @rvagg ;))

It seems that it's reasonable to be a pointer here because Copy() is a parallel to New*() which return pointers. So the downstream use-case is most likely going to be as a pointer (specifically in the HAMT it is) because its really just another form of New*(). Right? Not that this is how I conceived of it when I wrote this, it just seemed more straightforward to consume a pointer from here than doing the silly foo := bar.Copy(); thing.foo = &foo dance forced by Go.

warpfork

This lgtm.

I skimmed most of the actual bitmath... but still feel confident in doing so, because the tests look pretty comprehensive, and are easy to read and seem to indicate logical things.

rvagg · 2020-08-14T03:16:56Z

Excellent, thanks for the feedback @warpfork, very helpful! I never considered type Bitmap []byte but that just shows my lack of Go-thinking. I mostly fall back to classic-OO style thinking on some of these things (e.g. it was hard for me to just expose Bytes and not put a Bytes() to return a copy, but that's just too Java I suppose and goes beyond the kind of data-hiding that's implemented throughout this library).

rvagg · 2020-08-14T03:56:26Z

made CI happy and did a tiny bit of refactoring in light of some comments from @warpfork. New HEAD to review is squashed as 7eb4d1d

anorth · 2020-11-29T22:21:42Z

@rvagg could you rebase this now that some of the upstream PRs are merged?

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg · 2020-12-02T02:18:35Z

A bit of a brutal rebase, but I removed the inclusion of #60 while I was at it so this stands alone as a single commit

Still pending the ordering flip mentioned in OP, that would mean the removal of rev() in bitmap_test.go and a simplification (or removal of) bindex() in bitmap.go. Feedback on that would be good, but as long as we're breaking the byte layout here it's not much of a stretch to pull it into line with the other HAMT implementations.

/cc @austinabell probably for Rust feedback?

rvagg · 2020-12-02T02:34:58Z

Also comment from @ZenGround0 @ #27 (comment) is relevant regarding whether this change should be adopted at all. The use of big.int is ubiquitous through the chain already, the use here isn't an outlier.

Personally this still bugs me, it's not even being used as an integer, it's just being used for the bytes, it's not really saving many bytes overall anyway and it makes cross-lang implementations of the HAMT annoying. But that cost already mostly exists regardless of this change.

rvagg · 2020-12-02T05:42:52Z

^ to nuance that last comment, I believe this is the only case in the chain where big.Int is used in this way, as a bitfield, the rest are as big integers (TokenAmount, StoragePower, DealWeight, Spacetime [and whatever FilterEstimate is doing). It seems to me that it was used here as a convenience since its Go API has the ability to get and set individual bits and there is no native BitSet type in the standard library. So this PR introduces such a type for that purpose, and a whole lot of tests that demonstrate how it should work for any alt implementation.

austinabell · 2020-12-02T18:37:16Z

Also comment from @ZenGround0 @ #27 (comment) is relevant regarding whether this change should be adopted at all. The use of big.int is ubiquitous through the chain already, the use here isn't an outlier.

Personally this still bugs me, it's not even being used as an integer, it's just being used for the bytes, it's not really saving many bytes overall anyway and it makes cross-lang implementations of the HAMT annoying. But that cost already mostly exists regardless of this change.

Honestly, I don't see the point. I don't see what this is more idiomatic toward. Switching from a bigint doesn't affect us (we don't use a BigInt on our end for the bitfield) and the order of the bytes isn't a big deal, so should be fine on our end. I would just say make sure you have a good rationale for switching, because it sounds somewhat like we are just switching other implementations to match the JavaScript one, there doesn't seem to be a clear functional or readability benefit

ZenGround0 · 2020-12-03T02:54:04Z

@austinabell thanks for voicing this. It caused me to think through the motivation further. Since you and others have implemented the semantics in other languages already the main benefit I was seeing was restricting the bitfield serialization so that many different byte sequences can't encode the same HAMT node. I've since come to recognize that this is a much more widespread so @rvagg and I are in agreement that we will not merge the serialization change here though we will try to salvage some tests and maybe refactoring.

austinabell · 2020-12-03T03:03:02Z

the main benefit I was seeing was restricting the bitfield serialization so that many different byte sequences can't encode the same HAMT node.

I'm sorry, can you explain what you mean here? Am I missing a detail of this change? Also what is the benefit to not truncating? Skimming the OP, that isn't clear to me yet.

ZenGround0 · 2020-12-03T12:15:18Z

what is the benefit to not truncating?

The motivations isn't clear from PR but is in the the original issue. Summarizing the relevant parts:

This representation is simpler and easier to implement in languages that don't have a builtin go.BigInt that does truncation for you.
Without truncating bitfields have exactly one valid serialization. Currently this is not the case because you can pack 0s in the lower order bytes and the bitfield will still decode: https://play.golang.org/p/r9RYt-AGfJf. This means HAMT nodes also have more than one valid serialization.

I'm deciding not to pursue this further because we still won't have one valid serialization of most other nodes in the filecoin state tree (or even HAMT nodes IIUC) because of other uses of big.Int and details of cbor-gen.

austinabell · 2020-12-03T12:30:26Z

what is the benefit to not truncating?

The motivations isn't clear from PR but is in the the original issue. Summarizing the relevant parts:

This representation is simpler and easier to implement in languages that don't have a builtin go.BigInt that does truncation for you.

Without truncating bitfields have exactly one valid serialization. Currently this is not the case because you can pack 0s in the lower order bytes and the bitfield will still decode: https://play.golang.org/p/r9RYt-AGfJf. This means HAMT nodes also have more than one valid serialization.

I'm deciding not to pursue this further because we still won't have one valid serialization of most other nodes in the filecoin state tree (or even HAMT nodes IIUC) because of other uses of big.Int and details of cbor-gen.

Ah I see what you're saying. Yeah just to be clear this isn't an issue for us to change, I just wasn't clear on the motivation. Thanks for the details.

Isn't the fact that it doesn't truncate just increase the size of the nodes though? And since the bytes are still being used as slice not an array you still get the same performance? I just don't see a clear win here, esp since encoding should always be consistent so the ambiguous decoding shouldn't matter.

ZenGround0 · 2020-12-03T12:39:20Z

Isn't the fact that it doesn't truncate just increase the size of the nodes though?

Yes but the increases will only be very small on average so the performance isn't a good reason not to make the change

I just don't see a clear win here, esp since encoding should always be consistent

Yup we are in agreement. This is why I removed this from the HAMT FIP.

rvagg · 2020-12-07T00:39:30Z

for the record, this change was proposed prior to mainnet when it was theoretically easier to justify something like this; the rationale for something where the disruption to benefit ratio isn't particularly strong is weaker at this point, however, so I'm fine with this being dropped.

rvagg requested review from Stebalien and warpfork August 13, 2020 13:00

rvagg mentioned this pull request Aug 13, 2020

Simplify bitfield format; size, ordering, etc. #54

Open

warpfork reviewed Aug 13, 2020

View reviewed changes

warpfork approved these changes Aug 13, 2020

View reviewed changes

rvagg force-pushed the rvagg/bitfield-fmt branch 7 times, most recently from a9a0caf to 7eb4d1d Compare August 14, 2020 03:49

warpfork mentioned this pull request Aug 26, 2020

Additional benchmarks #62

Merged

This was referenced Nov 25, 2020

Update module version in preparation for breaking changes #76

Merged

Filecoin HAMT v3 Improvements filecoin-project/FIPs#38

Closed

BREAKING CHANGE: introduce Bitmap for handling bitfield operations

f5ad4b9

Closes: https://github.com/ipfs/go-hamt-ipld/issues/54

rvagg force-pushed the rvagg/bitfield-fmt branch from 7eb4d1d to f5ad4b9 Compare December 2, 2020 02:11

rvagg closed this Dec 7, 2020

rvagg mentioned this pull request Dec 5, 2022

fix: store as representation ipld/go-ipld-adl-hamt#44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63

Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63

rvagg commented Aug 13, 2020 •

edited

Loading

rvagg commented Aug 13, 2020

warpfork commented Aug 13, 2020

warpfork Aug 13, 2020

warpfork Aug 13, 2020

rvagg Aug 14, 2020

warpfork left a comment

rvagg commented Aug 14, 2020

rvagg commented Aug 14, 2020

anorth commented Nov 29, 2020

rvagg commented Dec 2, 2020

rvagg commented Dec 2, 2020

rvagg commented Dec 2, 2020

austinabell commented Dec 2, 2020

ZenGround0 commented Dec 3, 2020

austinabell commented Dec 3, 2020

ZenGround0 commented Dec 3, 2020

austinabell commented Dec 3, 2020

ZenGround0 commented Dec 3, 2020

rvagg commented Dec 7, 2020

Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63

Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63

Conversation

rvagg commented Aug 13, 2020 • edited Loading

Further proposed change

rvagg commented Aug 13, 2020

warpfork commented Aug 13, 2020

warpfork Aug 13, 2020

Choose a reason for hiding this comment

warpfork Aug 13, 2020

Choose a reason for hiding this comment

rvagg Aug 14, 2020

Choose a reason for hiding this comment

warpfork left a comment

Choose a reason for hiding this comment

rvagg commented Aug 14, 2020

rvagg commented Aug 14, 2020

anorth commented Nov 29, 2020

rvagg commented Dec 2, 2020

rvagg commented Dec 2, 2020

rvagg commented Dec 2, 2020

austinabell commented Dec 2, 2020

ZenGround0 commented Dec 3, 2020

austinabell commented Dec 3, 2020

ZenGround0 commented Dec 3, 2020

austinabell commented Dec 3, 2020

ZenGround0 commented Dec 3, 2020

rvagg commented Dec 7, 2020

rvagg commented Aug 13, 2020 •

edited

Loading