Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes Issues w/ Stream Crashing Over Large Replay #996

Open
wants to merge 43 commits into
base: main
Choose a base branch
from

Conversation

l-monninger
Copy link
Collaborator

@l-monninger l-monninger commented Jan 13, 2025

Summary

Adds fixes and features for debugging issues with Movement DA Light Node Stream.

  • Problem: over large replays, the stream from Celestia was occasionally dropped owing to (a) overfetching against the Celestia API which (b) could result in errors that were incorrectly propagated and closed the stream.
  • Solution: this PR addresses streaming issues by softening error propagation. It also adds debugging tools.
  • Other Common Problems: when debugging in a production context, transport issues may occasionally arise. HTTP/2 used with gRPC streams does not always play nicely with CloudFlare and AWS ALB in particular. The updated production setup avoids routing HTTP/2 traffic through either of these, instead preferring a direct connection.

*Review the change log to understand useful debugging features!!!

Change Log

Usage: movement-full-node da stream-blocks --movement-path <MOVEMENT_PATH> <LIGHT_NODE_URL> <FROM_HEIGHT>

Tip

As a rule of thumb, in so far as tests against the generalized version of the stream with the mock pass, issues with stream disconnects are either (a) transport issues or (b) implementation issues stemming from a case where an error in the stream should be non-fatal.

  • Introduces Digest Store: this digests blobs before sending them to the DA, e.g. Celestia, and looks them up when rebuilding the stream. Importantly, this removes the need for the blob submission heuristics, as all blobs are now a fixed size.

Tip

If additional issues are encountered specifically with high-contention Celestia streams, use the disk-fifo provider for a temporary patch.

Testing

Outstanding issues

Copy link
Contributor

@0xmovses 0xmovses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update for governed-gas-pool check also.

@@ -17,6 +17,7 @@ message BlobResponse {
Blob passed_through_blob = 1;
Blob sequenced_blob_intent = 2;
Blob sequenced_blob_block = 3;
Blob heartbeat_blob = 4;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a backwards compatible change, but we are now at the stage where the versioning starts to matter. I think you should break this out into a v1beta2.proto.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also not make this a Blob type.

@@ -130,6 +130,22 @@ pub trait DaOperations: Send + Sync {

last_height = height;
}
// Already executed Height are use to send Heartbeat.
Ok(Certificate::Height(height)) => {
//old certificate, use to send Heartbeat block.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to stream blobs from the DA here. You can just yield something that will be interpreted as a heartbeat. That is, this should basically just be yield DaBlob::heartbeat() or similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought you want to yield blob to keep the same type of data in BlobType. I'll change to something more simple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an update with v1beta2.proto file and heartbeat with a bool.

@musitdev musitdev force-pushed the l-monninger/stream-size-fix branch from 1473fcf to 9dd4c62 Compare January 22, 2025 16:55
@andygolay
Copy link
Contributor

I tried syncing a follower node against the most recent commit, 057e22de350d51b237b48edf77cc35e96cf8590f. I wanted to report a couple errors, to document them.

  • First from the movement-full-follower logs, a repeated Error: Failed to create the executor. I believe this may be a firewall issue:
ubuntu@ip-172-31-28-150:~$ docker logs 66a07f2f4c78 --tail 10
   5: tokio::runtime::park::CachedParkThread::block_on
   6: tokio::runtime::runtime::Runtime::block_on
   7: movement_full_node::main
   8: std::sys_common::backtrace::__rust_begin_short_backtrace
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal
  11: main
  12: __libc_start_call_main
  13: __libc_start_main_alias_2
  14: _start
ubuntu@ip-172-31-28-150:~$ docker logs 66a07f2f4c78 --tail 100
    0: Failed to connect to light node
    1: transport error
    2: dns error: failed to lookup address information: Name or service not known
    3: dns error: failed to lookup address information: Name or service not known
    4: failed to lookup address information: Name or service not known

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: movement_da_light_node_client::MovementDaLightNodeClient::try_http2::{{closure}}
   2: movement_full_node::MovementFullNode::execute::{{closure}}
   3: <core::pin::Pin<P> as core::future::future::Future>::poll
   4: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   5: tokio::runtime::park::CachedParkThread::block_on
   6: tokio::runtime::runtime::Runtime::block_on
   7: movement_full_node::main
   8: std::sys_common::backtrace::__rust_begin_short_backtrace
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal
  11: main
  12: __libc_start_call_main
  13: __libc_start_main_alias_2
  14: _start
2025-01-29T23:21:38.376469Z  INFO movement_full_node::node::partial: Creating the http2 client https://movement-celestia-da-light-node.testnet.bardock.movementlabs.xyz:443
Error: Failed to create the executor

Caused by:
    0: Failed to connect to light node
    1: transport error
    2: dns error: failed to lookup address information: Name or service not known
    3: dns error: failed to lookup address information: Name or service not known
    4: failed to lookup address information: Name or service not known

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: movement_da_light_node_client::MovementDaLightNodeClient::try_http2::{{closure}}
   2: movement_full_node::MovementFullNode::execute::{{closure}}
   3: <core::pin::Pin<P> as core::future::future::Future>::poll
   4: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
   5: tokio::runtime::park::CachedParkThread::block_on
   6: tokio::runtime::runtime::Runtime::block_on
   7: movement_full_node::main
   8: std::sys_common::backtrace::__rust_begin_short_backtrace
   9: std::rt::lang_start::{{closure}}
  10: std::rt::lang_start_internal
  11: main
  12: __libc_start_call_main
  13: __libc_start_main_alias_2
  14: _start
2025-01-29T23:21:39.430921Z  INFO movement_full_node::node::partial: Creating the http2 client https://movement-celestia-da-light-node.testnet.bardock.movementlabs.xyz:443
Error: Failed to create the executor
...

This was in an EC2 instance without any special firewall permissions. The movement-full-follower container exits almost immediately whenever I attempt to restart it.

  • Also, in the instance I'm using, the ~/.movement/config.json file was empty after I ran through the steps in the follower node runbook. If anyone has any idea of the cause for an empty config.json, it could help me make more testing progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants