Skip to content

Commit

Permalink
Added range proof pseudocode to algorithm document
Browse files Browse the repository at this point in the history
  • Loading branch information
mappum committed Aug 27, 2020
1 parent 88fd7f6 commit e53af5c
Showing 1 changed file with 70 additions and 34 deletions.
104 changes: 70 additions & 34 deletions docs/algorithms.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

**Matt Bell ([@mappum](https://twitter.com/mappum))**[Nomic Hodlings, Inc.](https://nomic.io)

v0.0.3 - *August 28, 2019*
v0.0.4 - _August 5, 2020_

## Introduction

Merk is a Merkle AVL tree designed for performance, running on top of a backing key/value store such as RocksDB. Notable features include concurrent operations for higher throughput, an optimized key/value layout for performant usage of the backing store, and efficient proof generation to enable bulk tree replication.

*Note that this document is meant to be a way to grok how Merk works, rather than an authoritative specification.*
_Note that this document is meant to be a way to grok how Merk works, rather than an authoritative specification._

## Algorithm Overview

Expand All @@ -21,10 +21,12 @@ The Merk tree was inspired by [`tendermint/iavl`](https://github.com/tendermint/
In many Merkle tree designs, only leaf nodes contain key/value pairs (inner nodes only contain child hashes). To contrast, every node in a Merk tree contains a key and a value, including inner nodes.

Each node contains a "kv hash", which is the hash of its key/value pair, in addition to its child hashes. The hash of the node is just the hash of the concatenation of these three hashes:

```
kv_hash = H(key, value)
node_hash = H(kv_hash, left_child_hash, right_child_hash)
```

Note that the `left_child_hash` and/or `right_child_hash` values may be null since it is possible for the node to have no children or only one child.

In our implementation, the hash function used is Blake2b (chosen for its performance characteristics) but this choice is trivially swappable.
Expand All @@ -47,9 +49,10 @@ To mutate the tree, we apply batches of operations, each of which can either be

Batches of operations are expected to be sorted by key, with every key appearing only once. Our implementation provides an `apply` method which sorts the batch and checks for duplicate keys, and an `apply_unchecked` method which skips the sorting/checking step for performance reasons when the caller has already ensured the batch is sorted.

The algorithm to apply these operations to the tree is called recursively on each relevant node.
The algorithm to apply these operations to the tree is called recursively on each relevant node.

_Simplified pseudocode for the operation algorithm:_

*Simplified pseudocode for the operation algorithm:*
- Given a node and a batch of operations:
- Binary search for the current node's key in the batch:
- If this node's key is found in the batch at index `i`:
Expand All @@ -70,7 +73,7 @@ The algorithm to apply these operations to the tree is called recursively on eac
- If after recursing the left and right subtrees are unbalanced (their heights differ by more than 1), perform an AVL tree rotation (possibly more than one)
- Recompute node's hash based on hash of its updated children and `kv_hash`, then return

This batch application of operations can happen concurrently - recursing into the left and right subtrees of a node are two fully independent operations (operations on one subtree will never involve reading or writing to/from any of the nodes on the other subtree). This means we have an *implicit lock* - we don't need to coordinate with mutexes but only need to wait for both the left side and right side to finish their operations.
This batch application of operations can happen concurrently - recursing into the left and right subtrees of a node are two fully independent operations (operations on one subtree will never involve reading or writing to/from any of the nodes on the other subtree). This means we have an _implicit lock_ - we don't need to coordinate with mutexes but only need to wait for both the left side and right side to finish their operations.

### Proofs

Expand All @@ -81,18 +84,20 @@ Merk was designed with efficient proofs in mind, both for application queries (e
Merk proofs are a list of stack-based operators and node data, with 3 possible operators: `Push(node)`, `Parent`, and `Child`. A stream of these operators can be processed by a verifier in order to reconstruct a sparse representation of part of the tree, in a way where the data can be verified against a known root hash.

The value of `node` in a `Push` operation can be one of three types:
- `Hash(hash)` - The hash of a node
- `KVHash(hash)` - The key/value hash of a node
- `KV(key, value)` - The key and value of a node


- `Hash(hash)` - The hash of a node
- `KVHash(hash)` - The key/value hash of a node
- `KV(key, value)` - The key and value of a node

This proof format can be encoded in a binary format and has negligible space overhead for efficient transport over the network.

#### Verification

A verifier can process a proof by maintaining a stack of connected tree nodes, and executing the operators in order:
- `Push(node)` - Push some node data onto the stack.
- `Child` - Pop a value from the stack, `child`. Pop another value from the stack, `parent`. Set `child` as the right child of `parent`, and push the combined result back on the stack.
- `Parent` - Pop a value from the stack, `parent`. Pop another value from the stack, `child`. Set `child` as the left child of `parent`, and push the combined result back on the stack.

- `Push(node)` - Push some node data onto the stack.
- `Child` - Pop a value from the stack, `child`. Pop another value from the stack, `parent`. Set `child` as the right child of `parent`, and push the combined result back on the stack.
- `Parent` - Pop a value from the stack, `parent`. Pop another value from the stack, `child`. Set `child` as the left child of `parent`, and push the combined result back on the stack.

Proof verification will fail if e.g. `Child` or `Parent` try to pop a value from the stack but the stack is empty, `Child` or `Parent` try to overwrite an existing child, or the proof does not result in exactly one stack item.

Expand All @@ -106,32 +111,33 @@ Efficient proof generation is important since nodes will likely receive a high v

Nodes can generate proofs for a set of keys by traversing through the tree from the root and building up the required proof branches. Much like the batch operator aglorithm, this algorithm takes a batch of sorted, unique keys as input.

*Simplified pseudocode for proof generation (based on an in-order traversal):*
- Given a node and a batch of keys to include in the proof:
- If the batch is empty, append `Push(Hash(node_hash))` to the proof and return
- Binary search the for the current node's key in the batch:
- If this node's key is found in the batch at index `i`:
- Partition the batch into left and right sub-batches at index `i` (excluding index `i`)
- If this node's key is not found in the batch, but could be inserted at index `i` maintaining sorted order:
- Partition the batch into left and right sub-batches at index `i`
- **Recurse left:** If there is a left child:
- If the left sub-batch is not empty, query the left child (appending operators to the proof)
- If the left sub-batch is empty, append `Push(Hash(left_child_hash))` to the proof
- Append proof operator:
- If this node's key is in the batch, or if the left sub-batch was not empty and no left child exists, or if the right sub-batch is not empty and no right child exists,or if the left child's right edge queried a non-existent key, or if the right child's left edge queried a non-existent key, append `Push(KV(key, value))` to the proof
- Otherwise, append `Push(KVHash(kv_hash))` to the proof
- If the left child exists, append `Parent` to the proof
- **Recurse right:** If there is a right child:
- If the right sub-batch is not empty, query the right child (appending operators to the proof)
- If the right sub-batch is empty, append `Push(Hash(left_child_hash))` to the proof
- Append `Child` to the proof
_Simplified pseudocode for proof generation (based on an in-order traversal):_

- Given a node and a batch of keys to include in the proof:
- If the batch is empty, append `Push(Hash(node_hash))` to the proof and return
- Binary search the for the current node's key in the batch:
- If this node's key is found in the batch at index `i`:
- Partition the batch into left and right sub-batches at index `i` (excluding index `i`)
- If this node's key is not found in the batch, but could be inserted at index `i` maintaining sorted order:
- Partition the batch into left and right sub-batches at index `i`
- **Recurse left:** If there is a left child:
- If the left sub-batch is not empty, query the left child (appending operators to the proof)
- If the left sub-batch is empty, append `Push(Hash(left_child_hash))` to the proof
- Append proof operator:
- If this node's key is in the batch, or if the left sub-batch was not empty and no left child exists, or if the right sub-batch is not empty and no right child exists,or if the left child's right edge queried a non-existent key, or if the right child's left edge queried a non-existent key, append `Push(KV(key, value))` to the proof
- Otherwise, append `Push(KVHash(kv_hash))` to the proof
- If the left child exists, append `Parent` to the proof
- **Recurse right:** If there is a right child:
- If the right sub-batch is not empty, query the right child (appending operators to the proof)
- If the right sub-batch is empty, append `Push(Hash(left_child_hash))` to the proof
- Append `Child` to the proof

Since RocksDB allows concurrent reading from a consistent snapshot/checkpoint, nodes can concurrently generate proofs on all cores to service a higher volume of queries, even if our algorithm isn't designed for concurrency.

#### Binary Format

We can efficiently encode these proofs by encoding each operator as follows:

```
Push(Hash(hash)) => 0x01 <20-byte hash>
Push(KVHash(hash)) => 0x02 <20-byte hash>
Expand Down Expand Up @@ -161,7 +167,7 @@ Due to the tree structure we already use, streaming the entries in key-order giv
1 3 5 7
```

Our algorithm builds verifiable chunks by first constructing a chunk of the upper levels of the tree, called the *trunk chunk*, plus each subtree below that (each of which is called a *leaf chunk*).
Our algorithm builds verifiable chunks by first constructing a chunk of the upper levels of the tree, called the _trunk chunk_, plus each subtree below that (each of which is called a _leaf chunk_).

The number of levels to include in the trunk can be chosen to control the size of the leaf nodes. For example, a tree of height 10 should have approximately 1,023 nodes. If the trunk contains the top 5 levels, the trunk and the 32 resulting leaf nodes will each contain ~31 nodes. We can even prove to the verifier the trunk size was chosen correctly by also including an approximate tree height proof, by including the branch all the way to the leftmost node of the tree (node `1` in the figure) and using this height as our basis to select the number of trunk levels.

Expand All @@ -171,6 +177,19 @@ The generated proofs can be efficiently encoded into the same proof format descr

Note that this algorithm produces proofs with very little memory requirements, plus little overhead added to the sequential read from disk. In a proof-of-concept benchmark, proof generation was measured to be ~750 MB/s on a modern solid-state drive and processor, meaning a 4GB state tree (the size of the Cosmos Hub state at the time of writing) could be fully proven in ~5 seconds (without considering parallelization). In conjunction with the RocksDB checkpoint feature, this process can happen in the background without blocking the node from executing later blocks.

_Pseudocode for the range proof generation algorithm:_

- Given a tree and a range of keys to prove:
- Create a stack of keys (initially empty)
- **Range iteration:** for every key/value entry within the query range in the backing store:
- Append `Push(KV(key, value))` to the proof
- If the current node has a left child, append `Parent` to the proof
- If the current node has a right child, push the right child's key onto the key stack
- If the current node does not have a right child:
- While the current node's key is greater than or equal to the key at the top of the key stack, append `Child` to the proof and pop from the key stack

Note that this algorithm produces the proof in a streaming fashion and has very little memory requirements (the only overhead is the key stack, which will be small even for extremely large trees since its length is a maximum of `log N`).

#### Example Proofs

Let's walk through a concrete proof example. Consider the following tree:
Expand All @@ -186,11 +205,12 @@ Let's walk through a concrete proof example. Consider the following tree:
3 6 8 10
```

*Small proof:*
_Small proof:_

First, let's create a proof for a small part of the tree. Let's say the user makes a query for keys `1, 2, 3, 4`.

If we follow our proof generation algorithm, we should get a proof that looks like this:

```
Push(KV(1, <value of 1>)),
Push(KV(2, <value of 2>)),
Expand All @@ -210,21 +230,26 @@ Let's step through verification to show that this proof works. We'll create a ve
```
Stack: (empty)
```

We will push a key/value pair on the stack, creating a node. However, note that for verification purposes this node will only need to contain the kv_hash which we will compute at this step.

```
Operator: Push(KV(1, <value of 1>))
Stack:
1
```

```
Operator: Push(KV(2, <value of 2>))
Stack:
1
2
```

Now we connect nodes 1 and 2, with 2 as the parent.

```
Operator: Parent
Expand All @@ -233,6 +258,7 @@ Stack:
/
1
```

```
Operator: Push(KV(3, <value of 3>))
Expand All @@ -242,6 +268,7 @@ Stack:
1
3
```

```
Operator: Push(KV(4, <value of 4>))
Expand All @@ -252,6 +279,7 @@ Stack:
3
4
```

```
Operator: Parent
Expand All @@ -263,7 +291,9 @@ Stack:
/
3
```

Now connect these two graphs with 4 as the child of 2.

```
Operator: Child
Expand All @@ -274,7 +304,9 @@ Stack:
/
3
```

Since the user isn't querying the data from node 5, we only need its kv_hash.

```
Operator: Push(KVHash(<kv_hash of 5>))
Expand All @@ -286,6 +318,7 @@ Stack:
3
5
```

```
Operator: Parent
Expand All @@ -298,7 +331,9 @@ Stack:
/
3
```

We only need the hash of node 9.

```
Operator: Push(Hash(<hash of 9>))
Expand All @@ -312,6 +347,7 @@ Stack:
3
9
```

```
Operator: Child
Expand All @@ -325,4 +361,4 @@ Stack:
3
```

Now after going through all these steps, we have sufficient knowlege of the tree's structure and data to compute node hashes in order to verify. At the end, we will have computed a hash for node 5 (the root), and we verify by comparing this hash to the one we expected.
Now after going through all these steps, we have sufficient knowlege of the tree's structure and data to compute node hashes in order to verify. At the end, we will have computed a hash for node 5 (the root), and we verify by comparing this hash to the one we expected.

0 comments on commit e53af5c

Please sign in to comment.