Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(topdown): pull effects up until committed finality. #887

Merged
merged 3 commits into from
Sep 25, 2024

Conversation

raulk
Copy link
Contributor

@raulk raulk commented Apr 19, 2024

Fendermint was pulling parent effects only until committed_finality-1.

This choice assumed that chains like Filecoin leaked deferred execution mechanics through the query mechanism we use (Eth JSON-RPC API). Therefore it was deemed that the safe choice was to lag behind by one epoch. However, we can safely assume that Eth JSON-RPC APIs will only expose blocks that have been executed and whose side-effects are available (unless "pending" is requested). Filecoin reconciles its deferred execution model with Eth client expectations of immediate execution, alas with some known rough edges [1].

Furthermore, this choice assumed that our chain_head_delay can be 0, i.e. we can consume the exposed HEAD of the parent in real time. In practice, that's not a good idea, ever.

All in all, this is problematic because it's a hidden gotcha. Users inspecting the committed parent finality on-chain and seeing N now need to understand that events emitted in N are (counterintuitively) not imported yet, and they need to wait longer.

This PR:

  • Rectifies the queried range so that we import up to, and including, the committed parent finality.
  • Disallows a zero chain_head_delay.

Note: I'm clarifying with @aakoshh the mechanics of Fendermint's Eth JSON-RPC API; concretely, if it renconciles deferred execution or not. There are some off-by-one shifts assumed in a number of places, so unclear. However, by disallowing a chain_head_delay of 0, we are able to work with APIs that expose heights before their results are available.

Note 2: when we adopt light clients to watch the parent, some assumptions may change.

[1] filecoin-project/lotus#11205

Fendermint was pulling parent effects only until committed_finality-1.

This choice assumed that chains like Filecoin leaked deferred execution
mechanics through the query mechanism we use (Eth JSON-RPC API). Therefore
it was deemed that the safe choice was to lag behind by one epoch. However,
we can safely assume that Eth JSON-RPC APIs will only expose blocks that
have been executed and whose side-effects are available (unless "pending"
is requested). Filecoin reconciles its deferred execution model with
Eth client expectations of immediate execution, alas with some known rough
edges [1].

Furthermore, this choice assumed that our chain_head_delay can be 0, i.e.
we can consume the exposed HEAD of the parent in real time. In practice,
that's not a good idea, ever.

All in all, this is problematic because it's a hidden. Users inspecting
the committed parent finality and seeing N now need to understand that
events emitted in N are (counterintuitively) not imported yet, and they
need to wait longer.

This PR:
- Rectifies the queried range so that we import up to, and including,
  the committed parent finality.
- Disallows a zero chain_head_delay.

Note: I'm clarifying with @aakoshh the mechanics of Fendermint's Eth
JSON-RPC API; concretely, if it renconciles deferred execution or not.
There are some off-by-one shifts assumed in a number of places, so
unclear. However, by disallowing a chain_head_delay of 0, we are able
to work with APIs that expose heights before their results are available.

Note 2: when we adopt light clients to watch the parent, some assumptions
may change.

[1] filecoin-project/lotus#11205
@raulk raulk requested review from aakoshh and cryptoAtwill April 19, 2024 12:10
Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to have a problem where the data pre-fetched by the sycner into the cache, and the data fetched in-sync during catch-up, were different because of which height the query was executed at. It seems to make sense that if we rely on the Ethereum API then you can get the results of a final height at that height, so hopefully this is not an issue here.

Worth noting that this isn't just a fix, it's a breaking change that would cause consensus failure if applied by anyone with an existing chain. That is, anyone trying to sync with the chain from genesis would apply different top-down effects. The disciplined way to roll this out for such chains would be an upgrade that enables this behaviour from a certain height onward.

Regarding the Fendermint Ethereum API itself: the ethapi pretends that the latest height of the chain is 1 less than what it really is in CometBFT, to make sure it only shows executed blocks. It also adjusts the query height by +1 to execute queries at the height the results are stored at.

What I'm not sure about is the event log. The eth_getLogs API goes block by block, looking at the results of the execution, so if you go [990-1000] you will see the the effects of block 1000. The end of the range is also explicitly restricted to the latest height (which is -1 adjusted, so results should be available). If the results still weren't available (e.g. because someone queries by block hash and it's not executed yet, for whatever reason), the results would be empty for that block, with no error returned. If someone were to use the CometBFT search API to look for transaction effects with certain topics, they would see them at however CometBFT chooses the block index, which I believe is by where the transaction was included.

@aakoshh
Copy link
Contributor

aakoshh commented Apr 19, 2024

It's also worth noting that if somebody just applied this on a running chain, they could accidentally skip the effects of a block:

  1. Say we are running a fendermint without this change.
  2. Say prev_height = 100 and finalized.height = 105.
  3. The execution grabs the topdown effects between 100 and 104 (inclusive)
  4. Now we restart fendermint with this change
  5. The next block is executing, we have a new parent finality at 110, ie. prev_height = 105, finalized.height = 110
  6. The execution grabs the topdown effects between 106 and 110

105 ends up not being applied.

@aakoshh
Copy link
Contributor

aakoshh commented Apr 19, 2024

Another reason to roll this out as an upgrade, or with a condition of which height to apply this logic from, is that unless all in-sync nodes are restarted right at the same height, they can experience consensus failure too, not just the ones syncing from genesis.

@aakoshh
Copy link
Contributor

aakoshh commented Apr 19, 2024

So the way to roll this would be something like:

let (execution_fr, execution_to) = 
  match state.block_height() {
    h if h <  self.finality_exec_switch_height => (prev_height, finality.height - 1), 
    h if h == self.finality_exec_switch_height => (prev_height, finality.height),
    _                                          => (prev_height + 1, finality.height),
  };    

And to add a finality_exec_switch_height setting with a default value of 0.

@raulk
Copy link
Contributor Author

raulk commented May 7, 2024

@aakoshh This is consensus breaking indeed, with the mitigating factor that that the fork would only manifest if the events being pulled top-down are in a boundary height. The fork would not manifest if the events are in the inner portion of the parent block range being pulled.

Prior to submitting this PR, Fluence had already applied a height condition in their IPC fork. I am not aware of any other live networks that would be impacted by this change, so taking a pragmatic stance, I'm reluctant to add the proposed switch flag in our codebase as it'll quickly turn into technical debt without sound justification. I think it's safe to assume that —today— all IPC adopters are running forks, and reviewing incoming changes carefully before merging from upstream.

Taking a step back, what's missing here is proper versioning, a changelog, and explicit breaking change signalling. We do expect a large amount of consensus-breaking changes in the next months as we refactor things and build out missing features (as IPC is really in its infancy), and it won't scale well to put each one behind a fork flag.

A more manageable solution for the time being is to treat this repo as experimental (through v0 versioning), flag consensus-breaking changes in changelogs, and work directly with impacted adopters to evaluate the condition they'd need to include. This is more reactive, but right now it's a tradeoff I'd strongly advise us to accept.

This situation will change when we modularise IPC, and package it behind neat modules and APIs that can be composed to build a customized client without forking the whole repo. Fluence really wants this and I think it makes a ton of sense.

Furthermore, I'd like to note that we've been making contract and Solidity and Rust compiler changes without much sympathy in that they affect the genesis payload and break syncing from genesis.

@cryptoAtwill cryptoAtwill requested a review from a team as a code owner September 25, 2024 09:33
Copy link
Contributor

@cryptoAtwill cryptoAtwill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine, LGTM.

Anyways, once the cert and commitment is onchain, then there is no need for this from and to anymore.

@raulk raulk merged commit ad43416 into main Sep 25, 2024
15 checks passed
@raulk raulk deleted the raulk/fix-top-down-finality-range-read branch September 25, 2024 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants