Lucy Local Framework extension #14648

MartinMinkov · 2023-11-30T21:17:26Z

Summary

This PR implements the local extension of Lucy to use Docker Swarm to run integration tests on a single host. As extensive testing confirms, all existing integration tests run successfully on the local version. The work to add this to CI will not be added in this PR, but in a follow-up PR where we can spin up a single machine and run the tests locally.

Changes from the RFC

This PR changes implementation details outlined in the RFC regarding node communication between Lucy and how logs are gathered. Inside the RFC, the original plan was to mount a Unix pipe to ingest all logs from the container, which the Lucy process would read from and process events for test conditions. Instead of that approach, this PR implements the same log-gathering process as the cloud integration, in which we utilize the GraphQL interface of each node in an integration test.

For more details, the cloud integration uses a GraphQL interface to tell nodes to start collecting structured logs and expose those structured logs via a GraphQL query. When nodes are first launched, Lucy will make a StartFilteredLog mutation, which tells the node to start collecting any structured logs it generates. Then, we can query for those structured logs and use them as test conditions with the same GraphQL interface.

This approach is cleaner than the original pipe-based approach because we can leverage the same architecture for both backends. Additionally, it makes communication between nodes easier to handle, as mounting a pipe and reading logs can have performance issues and blocked logs. Additionally, the pipe-based approach does not let us scale to multiple Docker Swarm nodes if we want to extend the functionality of this backend to use additional machines outside of just one host.

So, in summary, we do not use a pipe-based approach for logs since we have an existing infrastructure that we can use via GraphQL polling that works excellently. It can be extended further if we want to add additional features.

Details on machine performance

I run an AMD Ryzen 9 5900X 12-Core Processor on my machine and have 64GB of RAM. While running tests, the most significant strain on my machine is done in the bootstrap phase when nodes initialize. While a node is initializing, it uses about 6 CPU threads and around 4-5GB of memory, and then resource strain flattens once the node is synced. The next performance strain is also from SNARK workers, which take up 6-8 CPU threads. I haven't had any issues with tests failing due to limited resources, even when spinning up 8+ nodes during integration tests. These performance strains are burst-oriented, meaning that when nodes initialize, or SNARK workers are doing their computing, a burst of resources is used but flatlines when that work is done.

Running the tests

If you would like to run an integration test, you can use the following script as an example:

#!/bin/bash
set -e
make build_intgtest 
MINA_IMAGE=gcr.io/o1labs-192920/mina-daemon:2.0.0rampup5-develop-1e360f7-bullseye-berkeley
MINA_ARCHIVE=gcr.io/o1labs-192920/mina-archive:2.0.0rampup5-develop-1e360f7-bullseye
TEST=zkapps
ulimit -n 65532
script -f -c "./_build/default/src/app/test_executive/test_executive.exe local $TEST --mina-image=$MINA_IMAGE --archive-image=$MINA_ARCHIVE --debug" test.log

Import the engine interface that will be implemented for the local engine. Additionally add the local engine as a CLI flag to the test_executive

This adds the functionality to create a base docker-compose file from the specified test configs in the integration framework. Each node will be assigned to a service that will be running a container for each node in the network. This implementation currently supports creating a docker-compose file, creating a docker stack to run each container, and basic monitoring before continuing the rest of the test_executive execution. Lots of code in this PR has been stubbed and copied to make the compiler happy. Much of this code will be refactored to support the changes needed for the local engine later.

Adds support for snark coordinator and workers based off the test configurations passed into the test executive. Right now, we only implement 1 snark coordinator for all other snark workers. Additionally added a new file for all the default Mina node commands and environment variables.

Import the engine interface that will be implemented for the local engine. Additionally add the local engine as a CLI flag to the test_executive

This adds the functionality to create a base docker-compose file from the specified test configs in the integration framework. Each node will be assigned to a service that will be running a container for each node in the network. This implementation currently supports creating a docker-compose file, creating a docker stack to run each container, and basic monitoring before continuing the rest of the test_executive execution. Lots of code in this PR has been stubbed and copied to make the compiler happy. Much of this code will be refactored to support the changes needed for the local engine later.

Adds support for snark coordinator and workers based off the test configurations passed into the test executive. Right now, we only implement 1 snark coordinator for all other snark workers. Additionally added a new file for all the default Mina node commands and environment variables.

Addressed the feedback given in the Network_Config PR. Made the docker_compose.ml file more simple to read with additional refactors to mina_docker.ml. Additionally removed a lot of the copied over code in swarm_network to clean up the file.

…e/local-engine-network-config

…Protocol/mina into feature/local-engine-network-config

Added support for archive nodes in the local engine. If archive nodes are specified in the test plan, the test executive will download the archive schema to the working directory to use the schema as a bind mount for the postgres container. The archive node will then connect to the postgres container as usual.

Renamed node_config to docker_node_config and refactored some of the modules to have more static references baked in. Additionally refactored mina_docker to use these static references instead of string values in many places.

Added functionality for GraphQL networking to the docker containers. This work was mostly just adding the '--insecure-rest-server' flag to the nodes and opening up the corresponding rest port within the docker-compose file. To allow communication to the docker network from localhost, we map ports on the host starting at 7000 to port 3085 to the node which is used for GraphQL communication.

…-engine-network-config-develop

The changes were done to make the graphql_polling_log_engine more reusable by making it a functor. This allows it to support different networks, such as the docker network. The old files were deleted as they are no longer needed. The error_json was added to the dune file in integration_test_lib for better error handling.

Changed the Make_GraphQL_Polling_log_engine to accept a polling interval parameter. This is important because inside the start_filtered_log function, we want to call it as soon as the graphql API is available for a node in the docker network. If we wait longer, we have the potential to miss the `Node_initialization` event which will lead to a test failure. Additionally exposed the retry_delay_sec value as a function parameter to start_filtered_log.

…gine

…itions

This reverts commit b0ca83b.

This reverts commit 8613893.

…gration_test_cloud_engine.ml, graphql_polling_log_engine.ml, and integration_test_local_engine.ml for better clarity and understanding of its purpose

MartinMinkov · 2024-01-02T18:27:10Z

src/lib/integration_test_lib/graphql_polling_log_engine.ml

+
+(** This implements Log_engine_intf for integration tests, by creating a simple system that polls a mina daemon's graphql endpoint for fetching logs*)
+
+module Make_GraphQL_polling_log_engine


I made the GraphQL polling log engine a functor so that both the cloud and the local integrations can utilize it. In addition to making it a functor, I added a parameter to specify an interval to issue GraphQL requests to start collecting events, which is vitally important for the local version.

It's important because Lucy will wait for all nodes to initialize as a test condition. These conditions are checked by issuing GraphQL requests to the nodes to start collecting their events and make them available in a query. There is a race condition where a node can initialize before it's been set to start collecting it's logs, which means the node initialization event will be missed and the tests fail. To fix this, we make GraphQL requests much more frequent on the local version to make nodes immediately start collecting logs after the GraphQL endpoint is available.

I thought about making the GraphQL availability it's own structured event which we listen too, so that we can avoid this interval parameter and just issue a request to collect events right when GraphQL is available. I decided not to do that since we would need to land that change into different branches so that the daemon emits that event, use newly built Docker images instead of any old ones for testing, and I figured this is an easier way to unblock this issue.

The local version issues requests every 0.25 seconds to nodes and the cloud version keeps the same 10 seconds. This is just to kick off the event collection, normal polling queries still use the same time as before 10 seconds.

MartinMinkov · 2024-01-03T19:51:15Z

src/lib/integration_test_local_engine/docker_node_config.ml

+  LIBP2P_KEY_PATH="|}
+      ^ container_libp2p_key_path
+      ^ {|"
+  # Generate keypair and set permissions if libp2p_key does not exist


I define custom entry points for the daemon, postgres, and archive node containers. Would these be better served in a directory and removed from the inline definition? Where is the best location for these to live if we wanted to move them?

Viewing the infra code, the current standard appears to be to extract this and move it to mina/dockerfiles/puppeteer-context/.

Our src dockerfiles would then be updated to include the files within the docker container image (example can be seen here).

Those files inside the container image could then be referenced by the integration test engine by passing the --entrypoint value with the docker run command.

If I understand correctly, a version of this can be seen in kubernetes_network.ml line 117.

For now, I would leave the format as you have it (to avoid being blocked) and add an issue to the platform team zenhub, to have the additional entrypoints included in the standard docker builds.

(Edit: If you are familiar with the docker build flow in CI, you can attempt adding it as well. Myself or others in the platform team can pair with you to get it shipped, but I am making an assumption this would slow down other ITN/hardfork work and may not be helpful at the moment.)

stevenplatt · 2024-01-04T21:59:34Z

src/lib/integration_test_local_engine/docker_node_config.ml

+  LIBP2P_KEY_PATH="|}
+      ^ container_libp2p_key_path
+      ^ {|"
+  # Generate keypair and set permissions if libp2p_key does not exist


Viewing the infra code, the current standard appears to be to extract this and move it to mina/dockerfiles/puppeteer-context/.

Our src dockerfiles would then be updated to include the files within the docker container image (example can be seen here).

Those files inside the container image could then be referenced by the integration test engine by passing the --entrypoint value with the docker run command.

If I understand correctly, a version of this can be seen in kubernetes_network.ml line 117.

For now, I would leave the format as you have it (to avoid being blocked) and add an issue to the platform team zenhub, to have the additional entrypoints included in the standard docker builds.

(Edit: If you are familiar with the docker build flow in CI, you can attempt adding it as well. Myself or others in the platform team can pair with you to get it shipped, but I am making an assumption this would slow down other ITN/hardfork work and may not be helpful at the moment.)

stevenplatt · 2024-01-04T22:37:07Z

!ci-build-me

dkijania · 2024-01-11T07:55:26Z

!ci-build-me

MartinMinkov and others added 19 commits June 8, 2021 12:06

Add .opam and dune files for local engine

eddc593

Add engine interfaces for local engine

6fe2363

Import the engine interface that will be implemented for the local engine. Additionally add the local engine as a CLI flag to the test_executive

Add .opam and dune files for local engine

2f89ef8

Add engine interfaces for local engine

bb3787b

Import the engine interface that will be implemented for the local engine. Additionally add the local engine as a CLI flag to the test_executive

Local Integration work refactor and feedback

0707643

Addressed the feedback given in the Network_Config PR. Made the docker_compose.ml file more simple to read with additional refactors to mina_docker.ml. Additionally removed a lot of the copied over code in swarm_network to clean up the file.

Merge branch 'compatible' of github.com:MinaProtocol/mina into featur…

51bb3e9

…e/local-engine-network-config

Merge branch 'feature/local-engine-network-config' of github.com:Mina…

9539e59

…Protocol/mina into feature/local-engine-network-config

Cleaned up modules in mina_docker and node_config

b86c34a

Renamed node_config to docker_node_config and refactored some of the modules to have more static references baked in. Additionally refactored mina_docker to use these static references instead of string values in many places.

Merge branch 'compatible' into feature/local-engine-network-config

2570ec8

Fix compilation

71cfc6d

add local version for rebuild-deb script

8613893

added local/buildkite modes to rebuild-deb.sh

b0ca83b

Merge branch 'feature/local-engine-network-config' into feature/local…

875a7ff

…-engine-network-config-develop

MartinMinkov force-pushed the feature/local-engine-network-config-develop branch 6 times, most recently from bfe4719 to 5012192 Compare December 23, 2023 03:13

catchup latest develop changes to local integration lib

4c3cca3

MartinMinkov force-pushed the feature/local-engine-network-config-develop branch from 5012192 to 4c3cca3 Compare December 23, 2023 04:04

MartinMinkov added 3 commits December 22, 2023 21:15

rename Make_GraphQL_Polling_log_engine to Make_GraphQL_polling_log_en…

8cf3062

…gine

MartinMinkov force-pushed the feature/local-engine-network-config-develop branch from 1acd4e2 to 38050eb Compare December 23, 2023 17:28

refactor mina nodes to use a base config when generating docker defin…

4c2d451

…itions

MartinMinkov force-pushed the feature/local-engine-network-config-develop branch from 38050eb to 4c2d451 Compare December 23, 2023 17:46

Add archive node entrypoint

5a6716e

MartinMinkov marked this pull request as ready for review January 2, 2024 17:46

MartinMinkov requested review from a team as code owners January 2, 2024 17:46

Revert "added local/buildkite modes to rebuild-deb.sh"

126f64d

This reverts commit b0ca83b.

MartinMinkov changed the title ~~[WIP] Lucy Local Framework extension~~ Lucy Local Framework extension Jan 2, 2024

MartinMinkov added 2 commits January 2, 2024 09:49

Revert "add local version for rebuild-deb script"

9d40215

This reverts commit 8613893.

refactor: rename 'interval' to 'start_filtered_logs_interval' in inte…

f7af3ce

…gration_test_cloud_engine.ml, graphql_polling_log_engine.ml, and integration_test_local_engine.ml for better clarity and understanding of its purpose

MartinMinkov commented Jan 2, 2024

View reviewed changes

MartinMinkov added 2 commits January 2, 2024 10:30

Merge branch 'develop' into feature/local-engine-network-config-develop

821599a

refactor(docker_network.ml): remove unused warning directive

e7bb1e3

MartinMinkov commented Jan 3, 2024

View reviewed changes

stevenplatt approved these changes Jan 4, 2024

View reviewed changes

Merge branch 'develop' into feature/local-engine-network-config-develop

186fc0b

dkijania mentioned this pull request Jan 5, 2024

[CI] Enable local Lucy run in CI #14790

Closed

dkijania approved these changes Jan 9, 2024

View reviewed changes

MartinMinkov mentioned this pull request Jan 11, 2024

[Local Test Framework] - Docker Network Config #9000

Closed

Merge branch 'develop' into feature/local-engine-network-config-develop

4fcdae6

MartinMinkov merged commit 5dacd1e into develop Jan 11, 2024
38 checks passed

MartinMinkov deleted the feature/local-engine-network-config-develop branch January 11, 2024 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucy Local Framework extension #14648

Lucy Local Framework extension #14648

MartinMinkov commented Nov 30, 2023 •

edited

Loading

MartinMinkov Jan 2, 2024 •

edited

Loading

MartinMinkov Jan 3, 2024

stevenplatt Jan 4, 2024

stevenplatt Jan 4, 2024

stevenplatt commented Jan 4, 2024

dkijania commented Jan 11, 2024


		(** This implements Log_engine_intf for integration tests, by creating a simple system that polls a mina daemon's graphql endpoint for fetching logs*)

		module Make_GraphQL_polling_log_engine

Lucy Local Framework extension #14648

Lucy Local Framework extension #14648

Conversation

MartinMinkov commented Nov 30, 2023 • edited Loading

Summary

Changes from the RFC

Details on machine performance

Running the tests

MartinMinkov Jan 2, 2024 • edited Loading

Choose a reason for hiding this comment

MartinMinkov Jan 3, 2024

Choose a reason for hiding this comment

stevenplatt Jan 4, 2024

Choose a reason for hiding this comment

stevenplatt Jan 4, 2024

Choose a reason for hiding this comment

stevenplatt commented Jan 4, 2024

dkijania commented Jan 11, 2024

MartinMinkov commented Nov 30, 2023 •

edited

Loading

MartinMinkov Jan 2, 2024 •

edited

Loading