Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucy Local Framework extension #14648

Merged
merged 32 commits into from
Jan 11, 2024

Conversation

MartinMinkov
Copy link
Contributor

@MartinMinkov MartinMinkov commented Nov 30, 2023

Summary

Implements o1-labs/rfcs#40 🗒️

This PR implements the local extension of Lucy to use Docker Swarm to run integration tests on a single host. As extensive testing confirms, all existing integration tests run successfully on the local version. The work to add this to CI will not be added in this PR, but in a follow-up PR where we can spin up a single machine and run the tests locally.

Changes from the RFC

This PR changes implementation details outlined in the RFC regarding node communication between Lucy and how logs are gathered. Inside the RFC, the original plan was to mount a Unix pipe to ingest all logs from the container, which the Lucy process would read from and process events for test conditions. Instead of that approach, this PR implements the same log-gathering process as the cloud integration, in which we utilize the GraphQL interface of each node in an integration test.

For more details, the cloud integration uses a GraphQL interface to tell nodes to start collecting structured logs and expose those structured logs via a GraphQL query. When nodes are first launched, Lucy will make a StartFilteredLog mutation, which tells the node to start collecting any structured logs it generates. Then, we can query for those structured logs and use them as test conditions with the same GraphQL interface.

This approach is cleaner than the original pipe-based approach because we can leverage the same architecture for both backends. Additionally, it makes communication between nodes easier to handle, as mounting a pipe and reading logs can have performance issues and blocked logs. Additionally, the pipe-based approach does not let us scale to multiple Docker Swarm nodes if we want to extend the functionality of this backend to use additional machines outside of just one host.

So, in summary, we do not use a pipe-based approach for logs since we have an existing infrastructure that we can use via GraphQL polling that works excellently. It can be extended further if we want to add additional features.

Details on machine performance

I run an AMD Ryzen 9 5900X 12-Core Processor on my machine and have 64GB of RAM. While running tests, the most significant strain on my machine is done in the bootstrap phase when nodes initialize. While a node is initializing, it uses about 6 CPU threads and around 4-5GB of memory, and then resource strain flattens once the node is synced. The next performance strain is also from SNARK workers, which take up 6-8 CPU threads. I haven't had any issues with tests failing due to limited resources, even when spinning up 8+ nodes during integration tests. These performance strains are burst-oriented, meaning that when nodes initialize, or SNARK workers are doing their computing, a burst of resources is used but flatlines when that work is done.

Running the tests

If you would like to run an integration test, you can use the following script as an example:

#!/bin/bash
set -e
make build_intgtest 
MINA_IMAGE=gcr.io/o1labs-192920/mina-daemon:2.0.0rampup5-develop-1e360f7-bullseye-berkeley
MINA_ARCHIVE=gcr.io/o1labs-192920/mina-archive:2.0.0rampup5-develop-1e360f7-bullseye
TEST=zkapps
ulimit -n 65532
script -f -c "./_build/default/src/app/test_executive/test_executive.exe local $TEST --mina-image=$MINA_IMAGE --archive-image=$MINA_ARCHIVE --debug" test.log

MartinMinkov and others added 19 commits June 8, 2021 12:06
Import the engine interface that will be implemented for the local
engine. Additionally add the local engine as a CLI flag to the
test_executive
This adds the functionality to create a base docker-compose file from
the specified test configs in the integration framework. Each node will
be assigned to a service that will be running a container for each node
in the network.

This implementation currently supports creating a docker-compose file,
creating a docker stack to run each container, and basic monitoring
before continuing the rest of the test_executive execution.

Lots of code in this PR has been stubbed and copied to make the compiler
happy. Much of this code will be refactored to support the changes
needed for the local engine later.
Adds support for snark coordinator and workers based off the test
configurations passed into the test executive. Right now, we only
implement 1 snark coordinator for all other snark workers.

Additionally added a new file for all the default Mina node commands and
environment variables.
Import the engine interface that will be implemented for the local
engine. Additionally add the local engine as a CLI flag to the
test_executive
This adds the functionality to create a base docker-compose file from
the specified test configs in the integration framework. Each node will
be assigned to a service that will be running a container for each node
in the network.

This implementation currently supports creating a docker-compose file,
creating a docker stack to run each container, and basic monitoring
before continuing the rest of the test_executive execution.

Lots of code in this PR has been stubbed and copied to make the compiler
happy. Much of this code will be refactored to support the changes
needed for the local engine later.
Adds support for snark coordinator and workers based off the test
configurations passed into the test executive. Right now, we only
implement 1 snark coordinator for all other snark workers.

Additionally added a new file for all the default Mina node commands and
environment variables.
Addressed the feedback given in the Network_Config PR. Made the
docker_compose.ml file more simple to read with additional refactors to
mina_docker.ml. Additionally removed a lot of the copied over code in
swarm_network to clean up the file.
…Protocol/mina into feature/local-engine-network-config
Added support for archive nodes in the local engine. If archive nodes
are specified in the test plan, the test executive will download the
archive schema to the working directory to use the schema as a bind
mount for the postgres container. The archive node will then connect to
the postgres container as usual.
Renamed node_config to docker_node_config and refactored some of the
modules to have more static references baked in. Additionally refactored
mina_docker to use these static references instead of string values in
many places.
Added functionality for GraphQL networking to the docker containers.
This work was mostly just adding the '--insecure-rest-server' flag to
the nodes and opening up the corresponding rest port within the
docker-compose file.

To allow communication to the docker network from localhost, we map
ports on the host starting at 7000 to port 3085 to the node which is
used for GraphQL communication.
@MartinMinkov MartinMinkov force-pushed the feature/local-engine-network-config-develop branch 6 times, most recently from bfe4719 to 5012192 Compare December 23, 2023 03:13
@MartinMinkov MartinMinkov force-pushed the feature/local-engine-network-config-develop branch from 5012192 to 4c3cca3 Compare December 23, 2023 04:04
The changes were done to make the graphql_polling_log_engine more reusable by making it a functor. This allows it to support different networks, such as the docker network. The old files were deleted as they are no longer needed. The error_json was added to the dune file in integration_test_lib for better error handling.
Changed the Make_GraphQL_Polling_log_engine to accept a polling interval
parameter. This is important because inside the start_filtered_log
function, we want to call it as soon as the graphql API is available
for a node in the docker network. If we wait longer, we have the
potential to miss the `Node_initialization` event which will lead to a
test failure. Additionally exposed the retry_delay_sec value as a
function parameter to start_filtered_log.
@MartinMinkov MartinMinkov force-pushed the feature/local-engine-network-config-develop branch from 1acd4e2 to 38050eb Compare December 23, 2023 17:28
@MartinMinkov MartinMinkov force-pushed the feature/local-engine-network-config-develop branch from 38050eb to 4c2d451 Compare December 23, 2023 17:46
@MartinMinkov MartinMinkov marked this pull request as ready for review January 2, 2024 17:46
@MartinMinkov MartinMinkov requested review from a team as code owners January 2, 2024 17:46
@MartinMinkov MartinMinkov changed the title [WIP] Lucy Local Framework extension Lucy Local Framework extension Jan 2, 2024
…gration_test_cloud_engine.ml, graphql_polling_log_engine.ml, and integration_test_local_engine.ml for better clarity and understanding of its purpose

(** This implements Log_engine_intf for integration tests, by creating a simple system that polls a mina daemon's graphql endpoint for fetching logs*)

module Make_GraphQL_polling_log_engine
Copy link
Contributor Author

@MartinMinkov MartinMinkov Jan 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the GraphQL polling log engine a functor so that both the cloud and the local integrations can utilize it. In addition to making it a functor, I added a parameter to specify an interval to issue GraphQL requests to start collecting events, which is vitally important for the local version.

It's important because Lucy will wait for all nodes to initialize as a test condition. These conditions are checked by issuing GraphQL requests to the nodes to start collecting their events and make them available in a query. There is a race condition where a node can initialize before it's been set to start collecting it's logs, which means the node initialization event will be missed and the tests fail. To fix this, we make GraphQL requests much more frequent on the local version to make nodes immediately start collecting logs after the GraphQL endpoint is available.

I thought about making the GraphQL availability it's own structured event which we listen too, so that we can avoid this interval parameter and just issue a request to collect events right when GraphQL is available. I decided not to do that since we would need to land that change into different branches so that the daemon emits that event, use newly built Docker images instead of any old ones for testing, and I figured this is an easier way to unblock this issue.

The local version issues requests every 0.25 seconds to nodes and the cloud version keeps the same 10 seconds. This is just to kick off the event collection, normal polling queries still use the same time as before 10 seconds.

LIBP2P_KEY_PATH="|}
^ container_libp2p_key_path
^ {|"
# Generate keypair and set permissions if libp2p_key does not exist
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I define custom entry points for the daemon, postgres, and archive node containers. Would these be better served in a directory and removed from the inline definition? Where is the best location for these to live if we wanted to move them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Viewing the infra code, the current standard appears to be to extract this and move it to mina/dockerfiles/puppeteer-context/.

Our src dockerfiles would then be updated to include the files within the docker container image (example can be seen here).

Those files inside the container image could then be referenced by the integration test engine by passing the --entrypoint value with the docker run command.

If I understand correctly, a version of this can be seen in kubernetes_network.ml line 117.

For now, I would leave the format as you have it (to avoid being blocked) and add an issue to the platform team zenhub, to have the additional entrypoints included in the standard docker builds.

(Edit: If you are familiar with the docker build flow in CI, you can attempt adding it as well. Myself or others in the platform team can pair with you to get it shipped, but I am making an assumption this would slow down other ITN/hardfork work and may not be helpful at the moment.)

LIBP2P_KEY_PATH="|}
^ container_libp2p_key_path
^ {|"
# Generate keypair and set permissions if libp2p_key does not exist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Viewing the infra code, the current standard appears to be to extract this and move it to mina/dockerfiles/puppeteer-context/.

Our src dockerfiles would then be updated to include the files within the docker container image (example can be seen here).

Those files inside the container image could then be referenced by the integration test engine by passing the --entrypoint value with the docker run command.

If I understand correctly, a version of this can be seen in kubernetes_network.ml line 117.

For now, I would leave the format as you have it (to avoid being blocked) and add an issue to the platform team zenhub, to have the additional entrypoints included in the standard docker builds.

(Edit: If you are familiar with the docker build flow in CI, you can attempt adding it as well. Myself or others in the platform team can pair with you to get it shipped, but I am making an assumption this would slow down other ITN/hardfork work and may not be helpful at the moment.)

@stevenplatt
Copy link
Contributor

!ci-build-me

@dkijania
Copy link
Member

!ci-build-me

@MartinMinkov MartinMinkov merged commit 5dacd1e into develop Jan 11, 2024
38 checks passed
@MartinMinkov MartinMinkov deleted the feature/local-engine-network-config-develop branch January 11, 2024 17:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants