-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63
Conversation
CI is failing because I'm using binary literals for tests and they are Go 1.13+ and CI is 1.11. I'll swap them out for something else tomorrow, too late tonight. |
We should bump the go version in CI on master and just get that in. I found that ReportMetric is absent for getting a green light in #62 too. |
// individual bits and perform limited popcount for a given index to calculate | ||
// the position in the associated compacted array. | ||
type Bitmap struct { | ||
Bytes []byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll overthink this real quick just for completeness:
There are three (or maybe four) ways I could imagine implementing this:
- like this: a struct wrapping a byte slice
- just a typedef of a byte slice (
type Bitmap []byte
) - a struct with a fixed size array (and just panic on bitwidths that are too bit to fit in it)
- a struct with a slice and also a fixed size array, and the slice initializes to the address of the fixed size array if it's big enough (like what
bytes.Buffer
does).
The second avoids an alloc if we assume that Bitmap is usually also used as a pointer that incurred its own heap alloc.
The third and forth avoid an alloc by virtue of using fixed size arrays.
I don't know if any of this matters however; I think in context of all the other sources of cpu cost in evaluating a hamt, this probably just isn't very important. At least, I'd assume it isn't until a benchmark or profiler report told me it is. And the current way of writing it is probably the simplest and clearest.
|
||
// Copy creates a clone of the Bitmap, creating a new byte array with the same | ||
// contents as the original. | ||
func (bm *Bitmap) Copy() *Bitmap { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd consider having this return a Bitmap
(no pointer), because the whole Bitmap type is only a word or two in size itself, so passing it around by value is pretty harmless, and might result in fewer heap-escaping allocs in some patterns of usage.
But, as with the comment about the struct design itself and various ways to make it have fewer pointers... I would not actually worry about this very much unless a benchmark or profiler report pointed at it. (I'm only making these comments for completeness because you asked for them, @rvagg ;))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that it's reasonable to be a pointer here because Copy()
is a parallel to New*()
which return pointers. So the downstream use-case is most likely going to be as a pointer (specifically in the HAMT it is) because its really just another form of New*()
. Right? Not that this is how I conceived of it when I wrote this, it just seemed more straightforward to consume a pointer from here than doing the silly foo := bar.Copy(); thing.foo = &foo
dance forced by Go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lgtm.
I skimmed most of the actual bitmath... but still feel confident in doing so, because the tests look pretty comprehensive, and are easy to read and seem to indicate logical things.
Excellent, thanks for the feedback @warpfork, very helpful! I never considered |
a9a0caf
to
7eb4d1d
Compare
@rvagg could you rebase this now that some of the upstream PRs are merged? |
7eb4d1d
to
f5ad4b9
Compare
A bit of a brutal rebase, but I removed the inclusion of #60 while I was at it so this stands alone as a single commit Still pending the ordering flip mentioned in OP, that would mean the removal of /cc @austinabell probably for Rust feedback? |
Also comment from @ZenGround0 @ #27 (comment) is relevant regarding whether this change should be adopted at all. The use of Personally this still bugs me, it's not even being used as an integer, it's just being used for the bytes, it's not really saving many bytes overall anyway and it makes cross-lang implementations of the HAMT annoying. But that cost already mostly exists regardless of this change. |
^ to nuance that last comment, I believe this is the only case in the chain where |
Honestly, I don't see the point. I don't see what this is more idiomatic toward. Switching from a bigint doesn't affect us (we don't use a BigInt on our end for the bitfield) and the order of the bytes isn't a big deal, so should be fine on our end. I would just say make sure you have a good rationale for switching, because it sounds somewhat like we are just switching other implementations to match the JavaScript one, there doesn't seem to be a clear functional or readability benefit |
@austinabell thanks for voicing this. It caused me to think through the motivation further. Since you and others have implemented the semantics in other languages already the main benefit I was seeing was restricting the bitfield serialization so that many different byte sequences can't encode the same HAMT node. I've since come to recognize that this is a much more widespread so @rvagg and I are in agreement that we will not merge the serialization change here though we will try to salvage some tests and maybe refactoring. |
I'm sorry, can you explain what you mean here? Am I missing a detail of this change? Also what is the benefit to not truncating? Skimming the OP, that isn't clear to me yet. |
The motivations isn't clear from PR but is in the the original issue. Summarizing the relevant parts:
I'm deciding not to pursue this further because we still won't have one valid serialization of most other nodes in the filecoin state tree (or even HAMT nodes IIUC) because of other uses of big.Int and details of cbor-gen. |
Ah I see what you're saying. Yeah just to be clear this isn't an issue for us to change, I just wasn't clear on the motivation. Thanks for the details. Isn't the fact that it doesn't truncate just increase the size of the nodes though? And since the bytes are still being used as slice not an array you still get the same performance? I just don't see a clear win here, esp since encoding should always be consistent so the ambiguous decoding shouldn't matter. |
Yes but the increases will only be very small on average so the performance isn't a good reason not to make the change
Yup we are in agreement. This is why I removed this from the HAMT FIP. |
for the record, this change was proposed prior to mainnet when it was theoretically easier to justify something like this; the rationale for something where the disruption to benefit ratio isn't particularly strong is weaker at this point, however, so I'm fine with this being dropped. |
Builds on previous PRs that are not yet merged, #57, #59 and #60. Addresses #54 and the discussed solution in there.
Only one commit, see the last in the branch, currently
091854c7eb4d1d.This change takes control of the bitmap with a new
Bitmap
type that does the internal work. This lets us abstract away most of the details of the bitmap arrangement and avoid the specific layout ofbig.Int
. It now works bybitWidth
and validates that thebitWidth
is the correct for the bytes in the serialization format, exactly.The algorithm is the same and it'll output the same bytes as the old format, except not truncated, always exactly the right number of bytes for the bits we need to work with. I retained an old test in uhamt_test.go that validates against the original ordering and it's all good.
Further proposed change
BUT, I'd like to switch the ordering of the bytes around before this lands to match what would be more natural in most languages when you're dealing with bits within byte arrays. So rather than the BE format the
big.Int
gives us, go with something that looks like a LE format (we're not dealing with number representations so it's not strictly BE or LE, just ordered differently).Currently, if we set bits 1, 3, 5, 10, 20, 31 of a 4-byte array (bitWidth 5, i.e. 32-bits) we would be setting them like so:
I'd like to reverse that so the ordering is more natural:
i.e. bits 1, 3 and 5 are all within the first byte, not the last.
This is how the JavaScript HAMT implementation works, and it's how @ianopolous' one in Peergos works (thanks to java.util.BitSet, mentioned here). It's a more natural way of addressing bits.
The implementation in here has a
Bitmap#bindex()
method that does the flip of the index, and the tests rely on arev()
function to reverse inputs (test inputs I pulled from my JavaScript tests). So it'd be a simple to reverse it in these two places.It's a breaking change, but so is the serialization format introduced here with the bitmap being fully present and not truncated.
/cc @anorth