-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0011] Local Integration Extention for Lucy #40
base: main
Are you sure you want to change the base?
Changes from 9 commits
01eb76a
9b2dde1
3beb504
327e4d5
1330d43
0a10166
5a5c642
96c3c65
b20c28c
ecac840
c50ecf9
4d20cf3
1018c24
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# Local Integration Extension to Lucy | ||
|
||
## Summary | ||
|
||
[Lucy](https://github.com/MinaProtocol/mina/blob/compatible/src/app/test_executive/README.md) is an end-to-end integration testing framework for the Mina protocol. It is designed with the goal of being able to test the protocol with configurable network topologies and node configurations. | ||
|
||
This RFC proposes an extension to Lucy to allow for local integration testing. This will allow for easier development of the protocol as well as allow for contributors to run integration tests on their local machines without having to set up a cloud environment. | ||
|
||
## Motivation | ||
|
||
Lucy has an extensible architecture that allows for the addition of different networking and deployment backends. Currently, Lucy runs off a Kubernetes cloud backend (specifically, GKE is used), where it uses Terraform to deploy a Kubernetes cluster and Helm to deploy the network and specified nodes. While this method works for cloud environments, we would like a more lightweight solution to run locally. Thus, we would like to have a local backend that can be used to run the integration tests on a local machine. | ||
|
||
Instead of using a Kubernetes backend to run the integration tests in the cloud, we can use containers to run and deploy the network configuration on a local machine. The aim is that these containers will be lighter-weight than a full Kubernetes cluster and will allow for easier development of the protocol. | ||
|
||
## Detailed Design | ||
|
||
### Functional Requirements | ||
|
||
To implement a local backend for Lucy, we will use Docker containers to run the network configuration on a local machine. Docker containers are a lightweight way to run applications in an isolated environment. They are a great way to run applications in a consistent environment and are a great way to run applications locally. By using Docker containers, we can run the network configuration on a local machine without having to worry about the specifics of running the network on a local machine. | ||
|
||
We want the following properties for the local backend: | ||
|
||
- **Lightweight**: The local backend should be lightweight and should not require a lot of resources to run. This is important as we want to be able to run the network on a local machine without having to worry about the resource requirements of the network. | ||
- **Easy to use**: The local backend should be easy to use and should not require a lot of setup to run. This is important as we want to be able to run the network on a local machine without having to worry about the network setup. | ||
- **Fast**: The local backend should be fast and should not take a long time to run. This is important as we want to be able to run the network on a local machine without having to worry about the time it takes to run the network. | ||
|
||
### Orchestration | ||
|
||
To run containers locally, we will use [Docker Swarm](https://docs.docker.com/engine/swarm/) as the main driver. Docker Swarm is a container orchestration tool that allows us to run containers on a local machine or on a cluster of machines. It's built into the Docker Engine and there's no need for a seperate installation using it. The main reason to use this over [Docker Compose](https://docs.docker.com/compose/), is that Docker Swarm has multi-host networking built in, so if we want to further extend Lucy to run on a cluster of machines with Docker, we can do so easily. Docker Swarm takes a Docker Swarm file as input, which specifies all the containers to run and their configurations. To create a correct Docker Swarm file for our integration tests, we can create a mapping of the network configuration used to specify the tests and a Docker Swarm file. The network configuration is defined at the beginning of each integration test, which is then used to create a corresponding docker swarm file that has the specified network topology to use. We can use this file to specify the network topology configuration and then run the network on a local machine. | ||
|
||
Using Docker Swarm allows us to handle the orchestration of containers without having to worry about the specifics of running containers on a local machine. Docker Swarm will handle all the networking, log collection, and resource management for us. | ||
|
||
### Extending Lucy | ||
|
||
Lucy is designed with an extensible architecture that allows for the addition of different networking and deployment backends. Lucy extends an [OCaml module interface (named `Engine`)](https://github.com/MinaProtocol/mina/blob/compatible/src/lib/integration_test_lib/intf.ml), which offers an abstract way to specify different testing engines when running the test executive. It defines several important behaviours that must be implemented by any backend engine that is used. This module interface is the contract between the backend engine and how Lucy expects the network to be deployed and managed. The `Engine` module interface defines the following: | ||
|
||
- [`Network_config_intf`](https://github.com/MinaProtocol/mina/blob/c396e82af6da7f69817b8885e46b4da94715e27e/src/lib/integration_test_lib/intf.ml#L20): | ||
- Takes as input a user-defined network topology configuration and generates a corresponding file that specifies how the specified backend engine should deploy the network. In the case of the current Kubernetes backend, it generates a Terraform file that specifies the Kubernetes cluster configuration. For the local backend, it will generate a docker swarm file that specifies the network topology configuration. | ||
- [`Network_intf`](https://github.com/MinaProtocol/mina/blob/c396e82af6da7f69817b8885e46b4da94715e27e/src/lib/integration_test_lib/intf.ml#L39): | ||
- Specifies how to interact with the network that is deployed. Several behaviours include starting/stopping nodes, checking the status of nodes, checking the network information itself, etc. The Kubernetes backend uses the Kubernetes API to interact with the network that is deployed by issuing commands like `kubectl exec ...` for example. The local backend will operate in the same way but instead use the Docker API to interact with the network that is deployed (e.g. `docker exec ...`). | ||
- [`Network_manager_intf`](https://github.com/MinaProtocol/mina/blob/c396e82af6da7f69817b8885e46b4da94715e27e/src/lib/integration_test_lib/intf.ml#L99): | ||
- Specifies how to deploy the network. Functions like `create,` `destroy,` `deploy,` and `clean up` are all defined in this interface. For the Kubernetes backend, this means deploying a Kubernetes cluster to GKE and initializing the network. For the local backend, this means running the docker swarm file and initializing the network. | ||
- [`Log_engine_intf`](https://github.com/MinaProtocol/mina/blob/c396e82af6da7f69817b8885e46b4da94715e27e/src/lib/integration_test_lib/intf.ml#L115): | ||
- Specifies how to gather logs from the network. This is important as this is how Lucy gathers network information and confirms if integration test conditions pass or fail. The Kubernetes backend uses the Mina GraphQL API to poll for network state and stores those logs in Google Stack Driver. The local backend will use a pipe to gather logs from the network in a similar fashion. | ||
|
||
### Node Communication/Logging | ||
|
||
Node logging and information gathering is done by [GraphQL queries](https://github.com/MinaProtocol/mina/blob/compatible/src/lib/integration_test_lib/graphql_requests.ml) to the Mina daemon. These queries are what allow Lucy to gather information about the network and confirm if integration test conditions pass or fail. When the Kubernetes backend is deployed, nodes are deployed via the user-defined network configuration. They are then started and [polled for logs](https://github.com/MinaProtocol/mina/blob/78535ae3a73e0e90c5f66155365a934a15535779/src/lib/integration_test_cloud_engine/graphql_polling_log_engine.ml#L122) and network information. This polling approach works for the Kubernetes backend as the logs are pre-filtered (so the volume of logs is very low), and the logs parsing is fast enough in this case. However, the polling approach can be a performance bottleneck for the local backend as the volume of logs can be very high if we run SNARKless networks. These types of networks run much faster by comparison, where a 5-10 second delay in log gathering could break the running test. Furthermore, the log volume per second of a SNARKless network is massive by comparison due to the speed at which the network operates. For this reason, polling is not the best approach for the local backend. | ||
|
||
Instead, the local backend will use a push-based approach, where all logs from nodes are forwarded to Lucy. This approach will be implemented by using a [pipe](https://man7.org/linux/man-pages/man3/mkfifo.3.html) that is shared between all containers in the network. This pipe will be created by Lucy and will be mounted in every container specified in the docker swarm file. This pipe will then act as the communication between all container logs and the Lucy. On startup, | ||
MartinMinkov marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Lucy will create a pipe in the current directory and will include that file as a bind mount for each container in the docker swarm file. Furthermore, we can tag all container logs with a unique identifier so that Lucy can filter logs by the container. This will allow us to deal with scenarios where we read duplicate logs and allow for easier filtering. | ||
|
||
A further optimization we can do is apply [`logproc`](https://github.com/MinaProtocol/mina/blob/compatible/src/app/logproc/logproc.ml) to all container output before it's written to the pipe. `logproc` can help us filter logs by log level, which will help us reduce the volume of logs that are written to the pipe. This will help us reduce the number of logs that are written to the pipe and the number of logs that Lucy has to parse. | ||
|
||
For example, we could write an entry point script for each container that redirects stdout and stderr to the named pipe with a prefix of the container name. This will allow us to filter logs by container name and will allow us to deal with duplicate logs. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The last 3 sections has:
Can you be more specific about what you will do rather than what you may do? Either say "this will happen" or for those which we are not scoping these in, just group these in a section of "Future optimizations" and remove the language. It's okay to include optimizations that are optional based on critieria such as "if when I run test X, it takes longer than Y, then we will do Z". There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good! I clarified the text to be more clear in what we will be building in this section. These were intended to be built 👍 |
||
|
||
```bash | ||
#!/bin/bash | ||
NAMED_PIPE="/path/to/named_pipe" | ||
CONTAINER_ID=${CONTAINER_ID:-"unknown_container"} | ||
prepend_container_id() { | ||
while IFS= read -r line; do | ||
echo "$CONTAINER_ID: $line" | ||
done | ||
} | ||
cat - | prepend_container_id | logproc -i inline -f '!(.level in ["Spam", "Debug"])' | tee -a $NAMED_PIPE | ||
``` | ||
|
||
Then, inside the docker swarm file, we can specify the entrypoint script to use for each container. | ||
|
||
```yaml | ||
version: "3.8" | ||
services: | ||
node: | ||
image: minaprotocol/mina-daemon:... | ||
entrypoint: /bin/bash -c "/path/to/puppeteer-context/start.sh $$@ | /usr/local/bin/log_processor.sh" | ||
environment: | ||
- CONTAINER_ID=mina-node-1 | ||
volumes: | ||
- ./pipe:/path/to/pipe | ||
- ./log_processor.sh:/usr/local/bin/log_processor.sh | ||
``` | ||
|
||
## Drawbacks | ||
|
||
Because we are running all Mina daemon nodes on a single local machine, we will be limited by the resources of that machine. This means that we will not be able to run large networks on a single machine. However, this is not a problem as we can run smaller networks on a single machine and run larger networks in a cloud environment. | ||
|
||
## Rationale and alternatives | ||
|
||
By using Docker containers, we can run the network configuration on a local machine without having to worry about the specifics of running the network. Containers are also portable and consistent across all machines, which makes it easy to run the network on any machine (e.g. inside CI). | ||
|
||
One could argue that we could leverage the existing Kubernetes infrastructure and use Kubernetes locally instead of targeting the cloud. This is a valid alternative, but the existing Kubernetes backend is tightly coupled towards GKE, which makes it difficult to port over the existing backend to target a local cluster instead. Furthermore, using containers for local network deployment offers a simple mental model for how the network is deployed and managed compared to Kubernetes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like to introduce a Platform Engineering perspective regarding this point 👀 : While I fully support the idea that developers should use testing tools closer to their domain, I wonder if the ease of configuration offered by docker-compose may lure us away from providing a test environment closer to what is used in a real production/testing environment 🤔. For instance, creating a local Kubernetes environment and deploying networks with a Helm infrastructure backend would:
In summary, the goal of this perspective is to highlight that investing in adopting an efficient local Kubernetes setup leveraging a Helm infrastructure backend would produce desirable by-products including: optimization of existing infra components for local tests (e.g., Helm Charts), and ease the transition to CD by introducing it early in the development process (i.e., shift-left). If this seems to be interesting, I can help identifying the (some) of the requirement gaps 💯 . There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair points! The idea of adopting Kubernetes or Docker was discussed early on when this tool was thought about. To answer the question of what technology we need, I think it's important to establish what we want this tool really to do. To me (and I could be wrong here, this is why we need this RFC!), this tool is meant to be a solution that engineers can run on their local machine to do a "quick-pass" on features that are written and have a quick feedback loop. It's not my current understanding that this is intended to be run in a cluster on CI (or if it does, just on one machine), I was assuming that was what the cloud integration of Lucy is attempting to solve. If we do want to run this in CI, is it intended to be run on a single server, or on a cluster? If we start zooming out and thinking about using this local integration across a cluster, it feels close to what we do in GKE currently. Maybe this is something that @bkase @nholland94 and @deepthiskumar can shed some light on? Assuming that this tool is intended to be run on a developers machine (and/or a single server in CI), I think we should optimize for developer experience over similarity of deployment environments. Using a Kubernetes backend to deploy a network locally and run tests on it can make things more complicated for the developer and understand where the tests fail. We have to introduce more things like Helm charts, dealing with Kubernetes networking and potentially more, which in my opinion leads to a worse developer experience. Also, if we plan to use this in CI and we just use one machine, I'm not sure if the choice between Kubernetes or Docker matters too much, what are your thoughts? If we use Docker containers, things are kept lean and simple; a simple docker compose file is used as the deployment which is easy to understand, and there isn't much more additional tooling needed. Things will be faster, easier to install and setup, and have much less overhead to think about. What sorts of functional requirements are important for this tool to achieve in your opinion? That really is the thing we want to have a consensus on. Another question I have is, given your experience with deploying our nodes in a Kubernets environment, are there implementation details that we have to worry about between network differences? I was going under the assumption that Dockers local network will just work without issues, I could be wrong here 😅 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking that even if the introduction of docker-compose deployments could be very simple, it will end up adding, well, something more to maintain and be aware of. I think that with fairly few fixes the current Helm Charts could be made as generic as needed as to be deployed on local Kubernetes instances on the dev computer (e.g., KiND, K3s, Minikube or other alternatives will do). Anyway, this is an interesting discussion as it deals with a tradeoff, as you have identified: dev experience vs deployment reliability. The former aims at offering simplicity as heavy abstraction for fast feature tests, while the latter attempts to extend and use existing infra artifacts. Would love to spark a discussion here 😄; relevantly, to derive more specific requirements about this tool and its introduction as a test tool in CI. Any insight @stevenplatt @dkijania ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good call out @SanabriaRusso From a higher-level, I think this brings us back to the question of boundaries. In a perfect world, when platform-eng isn't busy with other important work, we could argue this "execution layer" should be owned by platform-eng; though even then, it's not clear: In practice with this particular project, an ownership boundary that was too high up has caused the integration testing framework to not evolve as quickly as we would have liked. Anyway, this is why we're experimenting with pushing the boundary between team ownership slightly downwards from where it was before. Also, pragmatically, @MartinMinkov has a PR that is 85% complete which implements the docker backend! So we can bring this to production within a couple weeks rather than the much larger task of starting from "scratch". This doesn't mean we can't throw it out later, though, if the Helm Charts get to the point where they run successfully locally. So I'd suggest: Let's finish this project that's 85% done, get us start relying on running the tests locally and on a single CI server. And then separately prioritize cleaning up the Helm Charts. Whenever that is done and works and is cleaner than the docker-compose solution (ie. behaves at least as well, without confusion to devs when they're iterating locally) and then we can evaluate throwing out the docker-compose backend at that point. |
||
|
||
By not implementing this RFC, we will not be able to run integration tests locally. This will make it difficult for contributors to run integration tests locally and will make it difficult to develop the protocol locally. | ||
|
||
## Unresolved questions | ||
|
||
- What are the limits of the local backend? How many nodes can we run on a single machine? For a given hardware configuration, it's not clear how many nodes we can run on a single machine. This is something that we will have to test and figure out. | ||
|
||
- Are there differences in how the chosen tools will operate between operating systems? Could there be performance or stability issues on certain operating systems? This is something that we will have to test and figure out. | ||
|
||
- Are containers still built with [puppeteer](https://github.com/MinaProtocol/mina/tree/develop/dockerfiles/puppeteer-context) to allow for dynamic control of nodes starting/stopping? Seems like they are [here](https://github.com/MinaProtocol/mina/blob/78535ae3a73e0e90c5f66155365a934a15535779/dockerfiles/Dockerfile-mina-daemon#L90), but would appreciate a confirmation. If not, what are other alternatives for defining node behaviour? | ||
MartinMinkov marked this conversation as resolved.
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add one more layer of detail here? Since you've already done a good chunk of the work it should be fairly easy for you to figure out how it works.
ie. turn
into something like:
"The
Local_network_config
will have a type likeand will include logic for encoding it into a docker-swarm file by doing a single pass and pretty-printing the YAML using library X. Here's a code snippet of one such function... etc.
And can you do the same treatment for how the pipe will be shared for the logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing! I added a note about the config type here
The section for the pipe is already included, it will be shared with using volumes across each service configuration