Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Element-addressable memory follow-up #1614

Open
1 of 3 tasks
plafer opened this issue Jan 8, 2025 · 11 comments
Open
1 of 3 tasks

Element-addressable memory follow-up #1614

plafer opened this issue Jan 8, 2025 · 11 comments
Assignees
Milestone

Comments

@plafer
Copy link
Contributor

plafer commented Jan 8, 2025

This is an issue covering all the things left to do following #1598.

  • Fix the book
    • i.e. fix all the places that refer to word-addressable memory
  • Add a 32-bit range-check on the memory address being accessed
    • This was previously never implemented
  • Add a 32-bit range-check on the word_addr column
    • See context for why this is needed here (where batch was since then renamed to word_addr)

Ideally, we'd only need a single 32-bit range-check, but having both is at least sufficient for correctness.

@bobbinth bobbinth modified the milestones: v0.13.0, v0.12.0 Jan 8, 2025
@bobbinth
Copy link
Contributor

Do we have a sketch for the 32-bit range checks? It seems to me that we need these for soundness (and we also need to make sure that the address is a multiple of 4 for word-addressable operations). The naive approach would require 2 more columns in the main trace and 1 more column in the auxiliary trace. This would bring up the cost of element-addressable memory to 3 extra columns in the main trace and 1 extra column in the auxiliary trace, making it somewhat less attractive.

Could we do anything better than the naive approach?

@Al-Kindi-0
Copy link
Collaborator

Maybe we can avoid the 3 extra columns in the main trace (but not the one in the auxiliary trace) by relying on just range checking the address on the decoder side.
Will open an issue soon to further discuss this.

@bobbinth
Copy link
Contributor

bobbinth commented Jan 12, 2025

I think we can probably discuss this in this issue.

One potential solution which almost works is to add 010 operation prefix as the flag for range check operations on the stack side (similar to how we do this for operations with 100 prefix). Basically, any operation that has 010 or 100 would have 4 range checks available to it. This would require:

  1. Making a minor modification to the EQ operation (so that it uses a different helper register).
  2. Moving FRIE2F4 operation into a different bucket (e.g., maybe into "high-degree operations").

This would help us with MLOADW, MSTOREW, and MSTORE operations. The unsolved bit is the MLOAD operation which has a different prefix because we don't shift stack to the left on MLOAD.

We could change MLOAD to shift the stack to the left as well, and then to execute it, we'd do something like PAD MLOAD. But this feels a bit hacky.

Oh - one other operation I forgot about is MSTREAM which would also need to be handled on the stack side.

@Al-Kindi-0
Copy link
Collaborator

I would like to pinpoint the problem(s) we are solving. The way I understand it, the main question is:

Do we need to range check both the memory address of the op as well as the memory address of the word? Or is range checking only one of the two is enough and if yes, which one?

I think the answer is that we are going to need two u32 range checks, however, we can do both range checks on the side of the decoder. The way to go about it is:

  1. The prover sets the helper registers h0 and h1 to two 16-bit limbs, say a_l and a_h such that $a_{addr} = a_h 2^{16} + a_l$
  2. The prover sets the helper registers h2 and h3 to two 16-bit limbs, say b_l and b_h such that $a_{word-addr} = 4 (b_h 2^{16} + b_l)$
  3. The prover sets the helper registers h4 and h5 to $idx0$ and $idx1$.

We then enforce the following constraint $a_{addr} = 4 (b_h 2^{16} + b_l) + 2 \cdot idx1 + idx0$ and do $16$ bits range checks on $(a_l, a_h, b_l, b_h)$. Note that

  1. $(idx0, idx1)$ will be included in the bus request and hence checking that they are binary on the chiplet side should be enough.
  2. The above should in principle guarantee that by-4 alignment.

As @bobbinth mentioned in his previous comment, we are going to shuffle around and change some of the opcodes to make this work.

@plafer
Copy link
Contributor Author

plafer commented Jan 14, 2025

As discussed offline, whether we perform the range checks on the decoder side or the chiplet side will depend notably on #1610.

In short, the downside of doing it on the chiplet side is that it adds more columns to the chiplet. However if with #1610 we decide to stack the circuit evaluation chiplet with the other chiplets and this results in more columns being added to all chiplets, then we would perform the range-checks in the memory chiplet as well. Otherwise, we would probably prefer to do them on the decoder side.

Another consideration is that the circuit evaluation chiplet would also need to perform memory accesses, but wouldn't benefit from the do_range_checks column discussed below (since it is in the decoder part of the main trace only, and thus not accessible to chiplets).

As @bobbinth mentioned in his previous comment, we are going to shuffle around and change some of the opcodes to make this work.

Also as discussed offline, we could avoid shuffling the opcodes around by adding a do_range_checks column, which would perform 16-bit range-checks on the first 4 helper registers.

@bobbinth
Copy link
Contributor

One thing I was thinking about: if there is a way to somehow enforce sorting without relying on $d_0$ and $d_1$ columns in the memory chiplet (i.e., the columns that perform range checks), we could use these 2 range checks for the address. Then, we should be able to validate the address on the chiplet side, which has lots of nice properties.

I know there are some memory designs which do not rely on range checks. @Al-Kindi-0 @plafer - do you know how these work? Basically, curious how costly these would be as compared to the alternatives.

@Al-Kindi-0
Copy link
Collaborator

The $d_0$ and $d_1$ are first and foremost to sort by the clock cycle. In other words, one can have a memory argument without sorting by addresses but cannot, as far as I know, have a memory argument with sorting by clock cycle.
Hence $d_0$ and $d_1$ are fundamental, as far as I see, due to the need for clock cycle ordering.
The next question is then if we can avoid ordering by address, and here there are two ways, as far as I know, to do this:

  1. The contiguity argument where the main idea is to first insure that all memory records are grouped by (unique) addresses. Once we have this grouping, and assuming that in each group the records are order by clock cycle, we can check the consistency of memory records (at each address). The contiguity argument comes at a significant cost though, I am counting in the document 3 main trace columns and 4 extension trace columns.
  2. The offline memory argument: Though this is quiet different from our current design and it will require significant changes, in my opinion, in order to fully utilize its properties (assuming those properties are what we want).

@bobbinth
Copy link
Contributor

In other words, one can have a memory argument without sorting by addresses but cannot, as far as I know, have a memory argument with sorting by clock cycle.

Totally agree that we need to sort by clock cycle - but I guess the question is do other constructions rely on range checks to do sorting, or do they use something else for this?

@Al-Kindi-0
Copy link
Collaborator

Not familiar with other ways to do sorting without range checking the differences, and from the docs referenced above it seems that they use range checks to do the ordering.

@plafer
Copy link
Contributor Author

plafer commented Jan 28, 2025

In this comment, I will summarize the state of the discussion regarding the options for implementing the missing range checks.

To recap, the problem we're trying to solve is guaranteeing that the values of the w_addr column in the memory chiplet are multiples of 4. We will implement this constraint using 2 range-checks due to the following observation:

Note: idx = 2*idx1 + idx0

if

  1. addr = 4*w_addr + idx
  2. addr < 2^32
  3. w_addr < 2^32
  4. idx < 4 (which is already enforced by decomposing idx = 2*idx1 + idx0)

then w_addr % 4 == 0 (or in words, w_addr is a multiple of 4). Thus, if we run a 32-bit range-check on addr and on w_addr, then w_addr is guaranteed to be a multiple of 4.

Note: the VM only has 16-bit range checks. We emulate 32-bit range-checks on some value x by decomposing x into x1 and x0 such that x = 2^16 * x1 + x0, and running 2 16-bit range checks on x1 and x0.

We are still exploring 2 possible approaches: either we run the range-checks in the Memory chiplet itself, or on the "caller" side (e.g. when an mload or an mstore instruction is executed). Which of the 2 approaches is most desirable also depends on choices made with the new ACE chiplet (#1636).

Both approaches will require adding an extra auxiliary column for the range-checks, since the current one is already "full" (i.e. the logup constraints are already degree 9). However, @Al-Kindi-0 mentioned that we might be able to optimize the current one such that we wouldn't need this new column.

Next, I will describe how both approaches work in more detail.

Run the range-checks in the Memory chiplet

With this approach, we replace the w_addr column with 4 columns: addr1, addr0, w_addr1, w_addr0, where addr = 2^16 * addr1 + addr0 and w_addr = 2^16 * w_addr1 + w_addr0. We run the range-checks on these 4 new columns. This requires no changes to the messages currently sent on the bus.

Run the range-checks on the caller side

The 2 current known "callers" are the decoder (with instructions mload{w}, mstore{w}, mstream, etc), and the ACE chiplet. We will cover both independently.

decoder-side

With this approach, we add a column run_range_checks, which if set to 1, runs 16-bit range checks on the first 4 helper registers. All instructions that access the memory will put addr1, addr0, w_addr1 and w_addr0 in those registers.

For those range-checks to carry over to the Memory chiplet, we need to use those range-checked values in the bus message itself. Thus, the addr variable in the bus message would be changed to $4 * (2^{16} \cdot w_{addr1} + w_{addr0}) + 2\cdot idx1 + idx0$. We'd also constrain $2^{16} \cdot addr_1 + addr_0 = 4 * (2^{16} \cdot w_{addr1} + w_{addr0})$.

Thus, I believe we also need to commit to idx1 and idx0 in helper registers for this to work, although I am not sure if there's a workaround that avoids it. This would bring up the number of helper registers needed to serve a memory access to 6.

So far we've only discussed those instructions that access a single memory address. However, instructions like MSTREAM and RCOMBASE make 2 memory requests. It is currently unclear to me how we would handle that, as the naive approach would bring up the number helper registers needed to 8 or 12. We would also need 8 range-checks (which would probably mean adding a second run_range_checks column). Also, RCOMBASE already uses all helper registers, so we would need another 6 registers for that instruction? Although this instruction is being replaced with #1633, this highlights challenges we might face with a future instruction.

ACE chiplet side

Note: once @Al-Kindi-0 posts a summary of the current design of the ACE chiplet, I will update this discussion.

Summary

Below we summarize the pro's and con's of each approach

  • Memory-side
    • Pros
      • simple
      • Requires no change to any new instruction or chiplet that wants to access the memory
    • Cons
      • This adds 3 columns to the chiplets, which will be padding for all other chiplets as they are today.
        • However, note that the number of columns of the ACE chiplet is still TBD. If it turns out to require those 3 columns, then we clearly want to go with this approach and implement the range-checks in the memory chiplet.
  • Caller-side
    • Pros
      • We don't need to add the 3 columns in the Memory chiplet
    • Cons
      • Need to add 8 new columns in the decoder
        • That is, 2 run_range_checks columns, and 6 additional helper columns
        • (although I'm probably missing something here which would bring this number down)
      • Any new instruction or chiplet needs to run range-checks, which could be inconvenient, as is the case of the RCOMBASE instruction or the ACE chiplet

@Al-Kindi-0
Copy link
Collaborator

Suppose we stack the chiplets as follows:

Hasher -> Bitwise -> ACE -> Memory -> ROM

Then we could reuse the wiring bus for the ACE chiplet for the range checks performed by the memory chiplet. The way we can do this is by defining the bus equation as
$$b^{'} = b + s_{\textsf{ACE}} \cdot \left(\frac{m}{\alpha -(id_{\textsf{out}} + \beta\cdot v_{\textsf{out}})} - \frac{1}{\alpha -(id_{\textsf{l-in}} + \beta\cdot v_{\textsf{l-in}})} - \frac{1}{\alpha -(id_{\textsf{r-in}} + \beta\cdot v_{\textsf{r-in}})} \right) - s_{\textsf{MEM}} \cdot \left(\frac{1}{\alpha -v_0} + \frac{1}{\alpha -v_1} + \frac{1}{\alpha -v_2} + \frac{1}{\alpha -v_3} \right)$$

Here:

  1. $s_{\textsf{ACE}}$ is an ACE chiplet selector.
  2. $s_{\textsf{MEM}}$ is a Memory chiplet selector.
  3. $id_{\textsf{out}}$, $id_{\textsf{l-in}}$ and $id_{\textsf{r-in}}$ are respectively the ids of the output, left input and right input wires.
  4. $v_{\textsf{out}}$, $v_{\textsf{l-in}}$ and $v_{\textsf{r-in}}$ are respectively the values of the output, left input and right input wires.
  5. $m$ is the output arity of the gate.
  6. $v_i$ is the value of the $i$-th 16-bit limb range checked by the memory chiplet.

The constraints on the above bus will then be:

  1. For rows of the ACE chiplet:
    1. At the beginning of each circuit evaluation cycle, the bus has to start at value $0$.
    2. At the end of each circuit evaluation cycle, the bus has to end with the value $0$.
  2. For rows of the Memory chiplet:
    1. The bus has to start with the value $0$ at the first row of the Memory chiplet.
    2. The terminal value of the current bus when added to the terminal value of the other range checker bus should sum to $0$.

This brings us to two related questions:

  1. The degree of the above expression defining the bus, and
  2. How to do the stacking optimally?

The answer to the first question is implied by that of the second and it will involve adding degree reduction columns to reduce the degree of the selectors.
The answer to the second question will depend on the work going on with the horner_eval_* instructions. If our calculations are not very far off, I believe that the stack/decoder trace length will end up being very close to the length of the chiplet trace, without including the ACE chiplet. This means that we will need to flatten the Hasher chiplet and bitwise chiplet and reduce their cycles from $8$ to $4$. Focusing only on the Hasher chiplet as it is the widest, this will imply that the width of the Hasher chiplet will go from $17$ columns to $29$ columns.

With such a wide chiplet trace, we can afford the degree reduction columns without any issue. This will also mean that adding the extra columns to the memory chiplet should not be an issue.
Given this, the main trace width will go from $71$ to $82$, which is a bit unfortunate because if we went from $71$ to $80$ instead then the cost of the above solution will not be more than a $15$% degradation to the VM performance.

Hence, if we can find a way to somehow remove $2$ columns from the main trace, then I believe that with the above solution we will end up with a main trace that is $80$ columns wide and an auxiliary column that is $16$ columns wide. As a bonus, the perfect $8$ field element alignment will mean a faster recursive verifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants