-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lucy Local Framework extension #14648
Lucy Local Framework extension #14648
Conversation
Import the engine interface that will be implemented for the local engine. Additionally add the local engine as a CLI flag to the test_executive
This adds the functionality to create a base docker-compose file from the specified test configs in the integration framework. Each node will be assigned to a service that will be running a container for each node in the network. This implementation currently supports creating a docker-compose file, creating a docker stack to run each container, and basic monitoring before continuing the rest of the test_executive execution. Lots of code in this PR has been stubbed and copied to make the compiler happy. Much of this code will be refactored to support the changes needed for the local engine later.
Adds support for snark coordinator and workers based off the test configurations passed into the test executive. Right now, we only implement 1 snark coordinator for all other snark workers. Additionally added a new file for all the default Mina node commands and environment variables.
Import the engine interface that will be implemented for the local engine. Additionally add the local engine as a CLI flag to the test_executive
This adds the functionality to create a base docker-compose file from the specified test configs in the integration framework. Each node will be assigned to a service that will be running a container for each node in the network. This implementation currently supports creating a docker-compose file, creating a docker stack to run each container, and basic monitoring before continuing the rest of the test_executive execution. Lots of code in this PR has been stubbed and copied to make the compiler happy. Much of this code will be refactored to support the changes needed for the local engine later.
Adds support for snark coordinator and workers based off the test configurations passed into the test executive. Right now, we only implement 1 snark coordinator for all other snark workers. Additionally added a new file for all the default Mina node commands and environment variables.
Addressed the feedback given in the Network_Config PR. Made the docker_compose.ml file more simple to read with additional refactors to mina_docker.ml. Additionally removed a lot of the copied over code in swarm_network to clean up the file.
…e/local-engine-network-config
…Protocol/mina into feature/local-engine-network-config
Added support for archive nodes in the local engine. If archive nodes are specified in the test plan, the test executive will download the archive schema to the working directory to use the schema as a bind mount for the postgres container. The archive node will then connect to the postgres container as usual.
Renamed node_config to docker_node_config and refactored some of the modules to have more static references baked in. Additionally refactored mina_docker to use these static references instead of string values in many places.
Added functionality for GraphQL networking to the docker containers. This work was mostly just adding the '--insecure-rest-server' flag to the nodes and opening up the corresponding rest port within the docker-compose file. To allow communication to the docker network from localhost, we map ports on the host starting at 7000 to port 3085 to the node which is used for GraphQL communication.
…-engine-network-config-develop
bfe4719
to
5012192
Compare
5012192
to
4c3cca3
Compare
The changes were done to make the graphql_polling_log_engine more reusable by making it a functor. This allows it to support different networks, such as the docker network. The old files were deleted as they are no longer needed. The error_json was added to the dune file in integration_test_lib for better error handling.
Changed the Make_GraphQL_Polling_log_engine to accept a polling interval parameter. This is important because inside the start_filtered_log function, we want to call it as soon as the graphql API is available for a node in the docker network. If we wait longer, we have the potential to miss the `Node_initialization` event which will lead to a test failure. Additionally exposed the retry_delay_sec value as a function parameter to start_filtered_log.
1acd4e2
to
38050eb
Compare
38050eb
to
4c2d451
Compare
This reverts commit b0ca83b.
This reverts commit 8613893.
…gration_test_cloud_engine.ml, graphql_polling_log_engine.ml, and integration_test_local_engine.ml for better clarity and understanding of its purpose
|
||
(** This implements Log_engine_intf for integration tests, by creating a simple system that polls a mina daemon's graphql endpoint for fetching logs*) | ||
|
||
module Make_GraphQL_polling_log_engine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the GraphQL polling log engine a functor so that both the cloud and the local integrations can utilize it. In addition to making it a functor, I added a parameter to specify an interval to issue GraphQL requests to start collecting events, which is vitally important for the local version.
It's important because Lucy will wait for all nodes to initialize as a test condition. These conditions are checked by issuing GraphQL requests to the nodes to start collecting their events and make them available in a query. There is a race condition where a node can initialize before it's been set to start collecting it's logs, which means the node initialization event will be missed and the tests fail. To fix this, we make GraphQL requests much more frequent on the local version to make nodes immediately start collecting logs after the GraphQL endpoint is available.
I thought about making the GraphQL availability it's own structured event which we listen too, so that we can avoid this interval parameter and just issue a request to collect events right when GraphQL is available. I decided not to do that since we would need to land that change into different branches so that the daemon emits that event, use newly built Docker images instead of any old ones for testing, and I figured this is an easier way to unblock this issue.
The local version issues requests every 0.25
seconds to nodes and the cloud version keeps the same 10
seconds. This is just to kick off the event collection, normal polling queries still use the same time as before 10
seconds.
LIBP2P_KEY_PATH="|} | ||
^ container_libp2p_key_path | ||
^ {|" | ||
# Generate keypair and set permissions if libp2p_key does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I define custom entry points for the daemon, postgres, and archive node containers. Would these be better served in a directory and removed from the inline definition? Where is the best location for these to live if we wanted to move them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Viewing the infra code, the current standard appears to be to extract this and move it to mina/dockerfiles/puppeteer-context/
.
Our src dockerfiles would then be updated to include the files within the docker container image (example can be seen here).
Those files inside the container image could then be referenced by the integration test engine by passing the --entrypoint
value with the docker run
command.
If I understand correctly, a version of this can be seen in kubernetes_network.ml line 117.
For now, I would leave the format as you have it (to avoid being blocked) and add an issue to the platform team zenhub, to have the additional entrypoints included in the standard docker builds.
(Edit: If you are familiar with the docker build flow in CI, you can attempt adding it as well. Myself or others in the platform team can pair with you to get it shipped, but I am making an assumption this would slow down other ITN/hardfork work and may not be helpful at the moment.)
LIBP2P_KEY_PATH="|} | ||
^ container_libp2p_key_path | ||
^ {|" | ||
# Generate keypair and set permissions if libp2p_key does not exist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Viewing the infra code, the current standard appears to be to extract this and move it to mina/dockerfiles/puppeteer-context/
.
Our src dockerfiles would then be updated to include the files within the docker container image (example can be seen here).
Those files inside the container image could then be referenced by the integration test engine by passing the --entrypoint
value with the docker run
command.
If I understand correctly, a version of this can be seen in kubernetes_network.ml line 117.
For now, I would leave the format as you have it (to avoid being blocked) and add an issue to the platform team zenhub, to have the additional entrypoints included in the standard docker builds.
(Edit: If you are familiar with the docker build flow in CI, you can attempt adding it as well. Myself or others in the platform team can pair with you to get it shipped, but I am making an assumption this would slow down other ITN/hardfork work and may not be helpful at the moment.)
!ci-build-me |
!ci-build-me |
Summary
Implements o1-labs/rfcs#40 🗒️
This PR implements the local extension of Lucy to use Docker Swarm to run integration tests on a single host. As extensive testing confirms, all existing integration tests run successfully on the local version. The work to add this to CI will not be added in this PR, but in a follow-up PR where we can spin up a single machine and run the tests locally.
Changes from the RFC
This PR changes implementation details outlined in the RFC regarding node communication between Lucy and how logs are gathered. Inside the RFC, the original plan was to mount a Unix pipe to ingest all logs from the container, which the Lucy process would read from and process events for test conditions. Instead of that approach, this PR implements the same log-gathering process as the cloud integration, in which we utilize the GraphQL interface of each node in an integration test.
For more details, the cloud integration uses a GraphQL interface to tell nodes to start collecting structured logs and expose those structured logs via a GraphQL query. When nodes are first launched, Lucy will make a
StartFilteredLog
mutation, which tells the node to start collecting any structured logs it generates. Then, we can query for those structured logs and use them as test conditions with the same GraphQL interface.This approach is cleaner than the original pipe-based approach because we can leverage the same architecture for both backends. Additionally, it makes communication between nodes easier to handle, as mounting a pipe and reading logs can have performance issues and blocked logs. Additionally, the pipe-based approach does not let us scale to multiple Docker Swarm nodes if we want to extend the functionality of this backend to use additional machines outside of just one host.
So, in summary, we do not use a pipe-based approach for logs since we have an existing infrastructure that we can use via GraphQL polling that works excellently. It can be extended further if we want to add additional features.
Details on machine performance
I run an AMD Ryzen 9 5900X 12-Core Processor on my machine and have 64GB of RAM. While running tests, the most significant strain on my machine is done in the bootstrap phase when nodes initialize. While a node is initializing, it uses about 6 CPU threads and around 4-5GB of memory, and then resource strain flattens once the node is synced. The next performance strain is also from SNARK workers, which take up 6-8 CPU threads. I haven't had any issues with tests failing due to limited resources, even when spinning up 8+ nodes during integration tests. These performance strains are burst-oriented, meaning that when nodes initialize, or SNARK workers are doing their computing, a burst of resources is used but flatlines when that work is done.
Running the tests
If you would like to run an integration test, you can use the following script as an example: