Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7582: Add naive RLP caching for BlockHeader, Transaction, and Withdrawal #7988

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

Matilda-Clerke
Copy link
Contributor

PR description

Add naive RLP caching for BlockHeader, Transaction, and Withdrawal. This PR currently ignores Block, BlockBody, and TransactionReceipt due to complexity.

Issue

#7582

…ng-during-sync

# Conflicts:
#	ethereum/core/src/main/java/org/hyperledger/besu/ethereum/core/BlockHeader.java
Signed-off-by: Matilda Clerke <[email protected]>
@fab-10
Copy link
Contributor

fab-10 commented Dec 5, 2024

I see the potential benefit of the caching, but since this is a tradeoff between compute and memory, have you done benchmarks to quantify the performance benefit of the change, and the possible impact on memory?

@Matilda-Clerke
Copy link
Contributor Author

Matilda-Clerke commented Dec 8, 2024

I see the potential benefit of the caching, but since this is a tradeoff between compute and memory, have you done benchmarks to quantify the performance benefit of the change, and the possible impact on memory?

Fair point. I modified the BlockHeadersMessageTest locally to produce 10000 headers and timed the looped writeTo calls in BlockHeadersMessage. The original code performed the 10000 writeTo calls in 18 to 23ms, while the updated code performed the 10000 writeTo calls in 13 to 18ms. In terms of memory, it's actually just using indexes to the original underlying array, so it should only be a small amount of extra memory used.

@fab-10
Copy link
Contributor

fab-10 commented Dec 9, 2024

So the micro benchmark test confirm that there is a gain on the compute side, and that is worth exploring more, for that I suggest to run some real sync on mainnet with this change and compare the results against some control instances.

@ahamlat
Copy link
Contributor

ahamlat commented Dec 9, 2024

it's actually just using indexes to the original underlying array, so it should only be a small amount of extra memory used.

Agreed, but those references will keep strong links to the underlying byte arrays preventing it from garbage collection. I would like to have a rationale on what this feature is going to improve. It is an interesting improvement but is there is real use case for it ? and if so, we need to evaluate, how much is the latency improvement for how much overhead in terms of memory ?

@fab-10
Copy link
Contributor

fab-10 commented Dec 9, 2024

On the implementation side, have you thought about different approaches to the caching?
I am thinking in particular to an approach that is generic and flexible, with an external cache that is able to keep the RPL encoded form of many different types, that we could tag like RLP cacheable, probably this requires a certain refactor, but could be the base for a reusable framework in the long term.

@Matilda-Clerke
Copy link
Contributor Author

From the micro benchmark, we'd expect a time saving of around 10 seconds for 10 million block headers. Realistically, not a noticeable gain on our current 20+ hour sync. However, it should give us a small saving of CPU utilisation.

Memory breakdown is as follows:
4 bytes to reference the Optional
4 bytes to reference ArrayWrappingBytes
4 bytes to reference the array (reference, not a copy)
4 bytes to store the offset in the array which the ArrayWrappingBytes instance is focusing on
4 bytes to store the length of the array which the ArrayWrappingBytes instance is focusing on
20 bytes per header total additional memory

Regarding garbage collection: If the headers are currently getting garbage collected (I'd expect they are, but not sure), these references won't prevent the underlying array from being garbage collected.

For this PR, I was looking just for some potential quick wins while continuing to focus mainly on my refactoring of peer tasks. @pinges is working on a more compilcated scheme which manages to avoid decoding the RLP for large portions of the block body in addition to avoiding re-encoding as we're doing here.

@Matilda-Clerke
Copy link
Contributor Author

Backed out the withdrawal changes as we discussed and don't believe the changes will ever be used.

@ahamlat
Copy link
Contributor

ahamlat commented Dec 13, 2024

4 bytes to reference the Optional

Consider using JOL (Java Object Layout) or a heap dump to have more accurate numbers. With CompressedOOPs enabled, each Java object has a header of 12 Bytes if the JVM heap < 32 GiB or 16 Bytes if JVM heap >= 32 GiB. Depending on the total number of bytes for each object, there is internal and external padding to make the size a multiple of 8.

20 bytes per header total additional memory

I think we should focus on the RLP raw data, the underlying byte array, not the headers. The current calculation focuses more on the shallow size because it is deterministic, and takes into account only the headers and the references. With the RLP stored in the transaction, we need to evaluate the impact on the transactionPool in terms of memory usage and garbage collection.

@Matilda-Clerke
Copy link
Contributor Author

Matilda-Clerke commented Dec 19, 2024

Here's a BlockHeader from a heap dump
image
I'm not sure exactly how to read this (e.g. the integers don't have a size listed, are they included in their parent instance's size?), but it seems like double digits of extra bytes per header or transaction. What do you think?

@Matilda-Clerke
Copy link
Contributor Author

We can see from the references on the underlying byte array that it is being reused across many ArrayWrappingBytes instances, including the new rawRlp references
image

@ahamlat
Copy link
Contributor

ahamlat commented Dec 19, 2024

I'm not sure exactly how to read this (e.g. the integers don't have a size listed, are they included in their parent instance's size?), but it seems like double digits of extra bytes per header or transaction. What do you think?

In this screenshot below, we can see that the block header is having a reference to a ~2 MBytes byte array (rlp) which I guess is the RLP of the block, but that would be a big block, as we can see the offset and the length inside that AbstractWrappingBytes.

image

We can see from the references on the underlying byte array that it is being reused across many ArrayWrappingBytes instances, including the new rawRlp references

I guess all the transactions of that specific block, and the header are referencing the same underlying byte array. Now, we need to evaluate the real memory overhead. To do that, we need to have the number of live blocks in memory during sync and the average of the size of the RLP. You can share a heap dump during sync with your PR. I will extract the numbers.

Copy link
Contributor

@ahamlat ahamlat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a suggestion to init RLP during decoding the transaction from the RLP. I think overall, there is no much overhead when I look the heap dump as the underlying byte array is already referenced by the payload of the transaction, and to address.

return Transaction.builder()
.copiedFrom(transaction)
.rawRlp(Optional.of(transactionRlp.raw()))
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling again a builder to add RLP, I would do it directly in each transaction decoder just after the payload, as they're very similar fields :

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the data I collected from the shared heap dump, the payload field is a subset of the rawRlp field
image

RawRlp
image

Payload
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the transaction level, as the payload is already referencing the same underlying byte array, the only overhead is the reference of the rlp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern in applying this to the transaction encoder/decoder classes is that they seem to only populate a subset of the Transaction fields, so if we try to instead supply in the encoder the full original RLP, it may be significantly larger than it is currently.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be no difference, as rawRlp and payload are just references to the same byte array with an offset and a length. It doesn't change the size of the transaction object. With copiedFrom, we lose the first reference of the transaction as we're creating another one.
The question I didn't investigate is wether we have the (whole) rawRlp of the transaction at this level.

@Matilda-Clerke
Copy link
Contributor Author

It seems these latest changes aren't quite working right. In particular, a block encoded and written to blockchainStorage is causing RLP errors when being read and decoded. I'm parking this briefly for now to progress some other issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants