Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce Bitmap type, use fixed size byte array in serialization, tighter bitWidth validation #63

Closed
wants to merge 1 commit into from

Conversation

rvagg
Copy link
Member

@rvagg rvagg commented Aug 13, 2020

Builds on previous PRs that are not yet merged, #57, #59 and #60. Addresses #54 and the discussed solution in there.

Only one commit, see the last in the branch, currently 091854c 7eb4d1d.

This change takes control of the bitmap with a new Bitmap type that does the internal work. This lets us abstract away most of the details of the bitmap arrangement and avoid the specific layout of big.Int. It now works by bitWidth and validates that the bitWidth is the correct for the bytes in the serialization format, exactly.

The algorithm is the same and it'll output the same bytes as the old format, except not truncated, always exactly the right number of bytes for the bits we need to work with. I retained an old test in uhamt_test.go that validates against the original ordering and it's all good.

Further proposed change

BUT, I'd like to switch the ordering of the bytes around before this lands to match what would be more natural in most languages when you're dealing with bits within byte arrays. So rather than the BE format the big.Int gives us, go with something that looks like a LE format (we're not dealing with number representations so it's not strictly BE or LE, just ordered differently).

Currently, if we set bits 1, 3, 5, 10, 20, 31 of a 4-byte array (bitWidth 5, i.e. 32-bits) we would be setting them like so:

10000000 00010000  00000100 00101010

I'd like to reverse that so the ordering is more natural:

00101010 00000100 00010000 10000000 

i.e. bits 1, 3 and 5 are all within the first byte, not the last.

This is how the JavaScript HAMT implementation works, and it's how @ianopolous' one in Peergos works (thanks to java.util.BitSet, mentioned here). It's a more natural way of addressing bits.

The implementation in here has a Bitmap#bindex() method that does the flip of the index, and the tests rely on a rev() function to reverse inputs (test inputs I pulled from my JavaScript tests). So it'd be a simple to reverse it in these two places.

It's a breaking change, but so is the serialization format introduced here with the bitmap being fully present and not truncated.

/cc @anorth

@rvagg
Copy link
Member Author

rvagg commented Aug 13, 2020

CI is failing because I'm using binary literals for tests and they are Go 1.13+ and CI is 1.11. I'll swap them out for something else tomorrow, too late tonight.

@warpfork
Copy link
Contributor

We should bump the go version in CI on master and just get that in. I found that ReportMetric is absent for getting a green light in #62 too.

// individual bits and perform limited popcount for a given index to calculate
// the position in the associated compacted array.
type Bitmap struct {
Bytes []byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll overthink this real quick just for completeness:

There are three (or maybe four) ways I could imagine implementing this:

  • like this: a struct wrapping a byte slice
  • just a typedef of a byte slice (type Bitmap []byte)
  • a struct with a fixed size array (and just panic on bitwidths that are too bit to fit in it)
  • a struct with a slice and also a fixed size array, and the slice initializes to the address of the fixed size array if it's big enough (like what bytes.Buffer does).

The second avoids an alloc if we assume that Bitmap is usually also used as a pointer that incurred its own heap alloc.

The third and forth avoid an alloc by virtue of using fixed size arrays.

I don't know if any of this matters however; I think in context of all the other sources of cpu cost in evaluating a hamt, this probably just isn't very important. At least, I'd assume it isn't until a benchmark or profiler report told me it is. And the current way of writing it is probably the simplest and clearest.


// Copy creates a clone of the Bitmap, creating a new byte array with the same
// contents as the original.
func (bm *Bitmap) Copy() *Bitmap {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd consider having this return a Bitmap (no pointer), because the whole Bitmap type is only a word or two in size itself, so passing it around by value is pretty harmless, and might result in fewer heap-escaping allocs in some patterns of usage.

But, as with the comment about the struct design itself and various ways to make it have fewer pointers... I would not actually worry about this very much unless a benchmark or profiler report pointed at it. (I'm only making these comments for completeness because you asked for them, @rvagg ;))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it's reasonable to be a pointer here because Copy() is a parallel to New*() which return pointers. So the downstream use-case is most likely going to be as a pointer (specifically in the HAMT it is) because its really just another form of New*(). Right? Not that this is how I conceived of it when I wrote this, it just seemed more straightforward to consume a pointer from here than doing the silly foo := bar.Copy(); thing.foo = &foo dance forced by Go.

Copy link
Contributor

@warpfork warpfork left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm.

I skimmed most of the actual bitmath... but still feel confident in doing so, because the tests look pretty comprehensive, and are easy to read and seem to indicate logical things.

@rvagg
Copy link
Member Author

rvagg commented Aug 14, 2020

Excellent, thanks for the feedback @warpfork, very helpful! I never considered type Bitmap []byte but that just shows my lack of Go-thinking. I mostly fall back to classic-OO style thinking on some of these things (e.g. it was hard for me to just expose Bytes and not put a Bytes() to return a copy, but that's just too Java I suppose and goes beyond the kind of data-hiding that's implemented throughout this library).

@rvagg rvagg force-pushed the rvagg/bitfield-fmt branch 7 times, most recently from a9a0caf to 7eb4d1d Compare August 14, 2020 03:49
@rvagg
Copy link
Member Author

rvagg commented Aug 14, 2020

made CI happy and did a tiny bit of refactoring in light of some comments from @warpfork. New HEAD to review is squashed as 7eb4d1d

@anorth
Copy link
Member

anorth commented Nov 29, 2020

@rvagg could you rebase this now that some of the upstream PRs are merged?

@rvagg rvagg force-pushed the rvagg/bitfield-fmt branch from 7eb4d1d to f5ad4b9 Compare December 2, 2020 02:11
@rvagg
Copy link
Member Author

rvagg commented Dec 2, 2020

A bit of a brutal rebase, but I removed the inclusion of #60 while I was at it so this stands alone as a single commit

Still pending the ordering flip mentioned in OP, that would mean the removal of rev() in bitmap_test.go and a simplification (or removal of) bindex() in bitmap.go. Feedback on that would be good, but as long as we're breaking the byte layout here it's not much of a stretch to pull it into line with the other HAMT implementations.

/cc @austinabell probably for Rust feedback?

@rvagg
Copy link
Member Author

rvagg commented Dec 2, 2020

Also comment from @ZenGround0 @ #27 (comment) is relevant regarding whether this change should be adopted at all. The use of big.int is ubiquitous through the chain already, the use here isn't an outlier.

Personally this still bugs me, it's not even being used as an integer, it's just being used for the bytes, it's not really saving many bytes overall anyway and it makes cross-lang implementations of the HAMT annoying. But that cost already mostly exists regardless of this change.

@rvagg
Copy link
Member Author

rvagg commented Dec 2, 2020

^ to nuance that last comment, I believe this is the only case in the chain where big.Int is used in this way, as a bitfield, the rest are as big integers (TokenAmount, StoragePower, DealWeight, Spacetime [and whatever FilterEstimate is doing). It seems to me that it was used here as a convenience since its Go API has the ability to get and set individual bits and there is no native BitSet type in the standard library. So this PR introduces such a type for that purpose, and a whole lot of tests that demonstrate how it should work for any alt implementation.

@austinabell
Copy link

Also comment from @ZenGround0 @ #27 (comment) is relevant regarding whether this change should be adopted at all. The use of big.int is ubiquitous through the chain already, the use here isn't an outlier.

Personally this still bugs me, it's not even being used as an integer, it's just being used for the bytes, it's not really saving many bytes overall anyway and it makes cross-lang implementations of the HAMT annoying. But that cost already mostly exists regardless of this change.

Honestly, I don't see the point. I don't see what this is more idiomatic toward. Switching from a bigint doesn't affect us (we don't use a BigInt on our end for the bitfield) and the order of the bytes isn't a big deal, so should be fine on our end. I would just say make sure you have a good rationale for switching, because it sounds somewhat like we are just switching other implementations to match the JavaScript one, there doesn't seem to be a clear functional or readability benefit

@ZenGround0
Copy link
Contributor

@austinabell thanks for voicing this. It caused me to think through the motivation further. Since you and others have implemented the semantics in other languages already the main benefit I was seeing was restricting the bitfield serialization so that many different byte sequences can't encode the same HAMT node. I've since come to recognize that this is a much more widespread so @rvagg and I are in agreement that we will not merge the serialization change here though we will try to salvage some tests and maybe refactoring.

@austinabell
Copy link

the main benefit I was seeing was restricting the bitfield serialization so that many different byte sequences can't encode the same HAMT node.

I'm sorry, can you explain what you mean here? Am I missing a detail of this change? Also what is the benefit to not truncating? Skimming the OP, that isn't clear to me yet.

@ZenGround0
Copy link
Contributor

what is the benefit to not truncating?

The motivations isn't clear from PR but is in the the original issue. Summarizing the relevant parts:

  1. This representation is simpler and easier to implement in languages that don't have a builtin go.BigInt that does truncation for you.
  2. Without truncating bitfields have exactly one valid serialization. Currently this is not the case because you can pack 0s in the lower order bytes and the bitfield will still decode: https://play.golang.org/p/r9RYt-AGfJf. This means HAMT nodes also have more than one valid serialization.

I'm deciding not to pursue this further because we still won't have one valid serialization of most other nodes in the filecoin state tree (or even HAMT nodes IIUC) because of other uses of big.Int and details of cbor-gen.

@austinabell
Copy link

what is the benefit to not truncating?

The motivations isn't clear from PR but is in the the original issue. Summarizing the relevant parts:

  1. This representation is simpler and easier to implement in languages that don't have a builtin go.BigInt that does truncation for you.
  2. Without truncating bitfields have exactly one valid serialization. Currently this is not the case because you can pack 0s in the lower order bytes and the bitfield will still decode: https://play.golang.org/p/r9RYt-AGfJf. This means HAMT nodes also have more than one valid serialization.

I'm deciding not to pursue this further because we still won't have one valid serialization of most other nodes in the filecoin state tree (or even HAMT nodes IIUC) because of other uses of big.Int and details of cbor-gen.

Ah I see what you're saying. Yeah just to be clear this isn't an issue for us to change, I just wasn't clear on the motivation. Thanks for the details.

Isn't the fact that it doesn't truncate just increase the size of the nodes though? And since the bytes are still being used as slice not an array you still get the same performance? I just don't see a clear win here, esp since encoding should always be consistent so the ambiguous decoding shouldn't matter.

@ZenGround0
Copy link
Contributor

Isn't the fact that it doesn't truncate just increase the size of the nodes though?

Yes but the increases will only be very small on average so the performance isn't a good reason not to make the change

I just don't see a clear win here, esp since encoding should always be consistent

Yup we are in agreement. This is why I removed this from the HAMT FIP.

@rvagg
Copy link
Member Author

rvagg commented Dec 7, 2020

for the record, this change was proposed prior to mainnet when it was theoretically easier to justify something like this; the rationale for something where the disruption to benefit ratio isn't particularly strong is weaker at this point, however, so I'm fine with this being dropped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants