Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial draft of local test framework RFC #8896

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
272 changes: 272 additions & 0 deletions rfcs/0042-local-test-framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,272 @@
# Summary:

Currently, the test integration framework has the capabilities to upload and test a network to a cloud environment (specifically GCP) but in addition to this, we want to add functionality to deploy and test a network on a local user machine. The cloud integration uses Terraform and Helm to deploy to a GKE environment to deploy the network and specified nodes. While this method works for cloud environments, we would like a more lightweight solution to run locally. Thus, we have chosen `Docker Swarm` as the tool of choice for container orchestration.

Docker swarm can configure a "swarm" on a local machine to deploy and manage containers on that specific swarm. Docker Swarm takes as input a `docker-compose` file in which all container information is specified and will handle the deployment of all containers on that local swarm. When we want to run a network test locally, we can create a swarm and have all the containers deploy via a `docker-compose.json` file that is built from specified network configurations. Docker swarm also gives us the ability to get logs of all the containers running in an aggregated way, meaning we do not have to query individual containers for their logs. This gives us a way to apply event filters to specific nodes (block producers, snark workers, seed nodes, etc), and check for test success/failure in a portable way.

# Requirements

The new local testing framework should be run on a user's local system using Docker as its main engine to create a network and spawn nodes. This feature will be built on top of the existing Test Executive which runs our cloud integration tests. By implementing the interface specified in `src/lib/integration_test_lib/intf.ml`, we will have an abstract way to specify different testing engines when running the Test Executive.

The specific interface to implement would be:

```ocaml
(** The signature of integration test engines. An integration test engine
* provides the core functionality for deploying, monitoring, and
* interacting with networks.
*)
module type S = sig
(* unique name identifying the engine (used in test executive cli) *)
val name : string

module Network_config : Network_config_intf

module Network : Network_intf

module Network_manager :
Network_manager_intf
with module Network_config := Network_config
and module Network := Network

module Log_engine : Log_engine_intf with module Network := Network
end
end
```

To implement this interface, a new subdirectory will be created in `src/lib` named `integration_local_engine` to hold all implementation details for the local engine.

The new local testing engine must implement all existing features which include:

- Starting/Stopping nodes dynamically
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to mention as a note here: If you continue using the puppeteer container (which I recommend), then this is very easy to do. You can peek at how this works in the cloud engine, but starting/stopping a puppeteered node is as simple as running a shell script in the container which communicates with the puppeteer process.

- Sending GraphQL quries to running nodes
- Streaming event logs from nodes for further processing
- Spawning nodes based on a test configuration

Additionally, the test engine should take a Docker image as input in the CLI.

An example command of using the local testing framework could look like this:

```bash
$ test_executive local send-payment --mina-image codaprotocol/coda-daemon-puppeteered:
1.1.5-compatible --debug | tee test.log | logproc -i inline -f '!(.level in \["Spam", "Debug"\])'
```

Note that this is very similar to the current command of calling the cloud testing framework.

# Detailed Design

## Orchestration:

To handle container orchestration, we will be utilizing `Docker Swarm` to spawn and manage containers. Docker Swarm lets us create a cluster and run containers on a cluster to manage availability. We have opted for Docker Swarm instead of other orchestration tools like Kubernetes due to Docker being much easier to run on a local machine while still giving us much of the same benefits. Kubernetes is more complex and is somewhat overkill for what we are trying to achieve with the local testing framework. Both Docker Swarm and Kubernetes can handle container orchestration but the complexity of dealing with Kubernetes does not give much payoff. Additionally, if we want community members to also use this tool, setting up Kubernetes on end-user systems would be even more of a hassle.

Docker Swarm takes a `docker-compose` file in which it will generate the desired network state. A cluster can be defined in Docker Swarm by issuing `docker swarm init` which creates the environment in which all containers will be orchestrated on. In the context of our system, we do not need to take advantage of different machines to run these containers on, rather we will run all containers on the local system. Thus, the end result of the swarm will be all containers running locally while Docker Swarm provides availability and other resource management options.

## Creating a docker-compose file for local instead of terraform on cloud

In the current cloud architecture, we launch a given network with `Terraform`. We specify a `Network_config.t` data structure which holds all necessary information about creating the network and then it is transformed into a `Terraform` file like so:

```ocaml
type terraform_config =
{ k8s_context: string
; cluster_name: string
; cluster_region: string
; aws_route53_zone_id: string
; testnet_name: string
; deploy_graphql_ingress: bool
; coda_image: string
; coda_agent_image: string
; coda_bots_image: string
; coda_points_image: string
; coda_archive_image: string
(* this field needs to be sent as a string to terraform, even though it's a json encoded value *)
; runtime_config: Yojson.Safe.t
[@to_yojson fun j -> `String (Yojson.Safe.to_string j)]
; block_producer_configs: block_producer_config list
; log_precomputed_blocks: bool
; archive_node_count: int
; mina_archive_schema: string
; snark_worker_replicas: int
; snark_worker_fee: string
; snark_worker_public_key: string }
[@@deriving to_yojson]

type t =
{ coda_automation_location: string
; debug_arg: bool
; keypairs: Network_keypair.t list
; constants: Test_config.constants
; terraform: terraform_config }
[@@deriving to_yojson]
```

[https://github.com/MinaProtocol/mina/blob/67cc4205cc95138cf729a2f14b57b754f9e9204e/src/lib/integration_test_cloud_engine/coda_automation.ml#L35](https://github.com/MinaProtocol/mina/blob/67cc4205cc95138cf729a2f14b57b754f9e9204e/src/lib/integration_test_cloud_engine/coda_automation.ml#L35)

We launch the network after all configuration has been applied by running `terraform apply`

We can leverage some of this existing work by specifying a config for Docker Swarm instead. Docker Compose can use a `docker-compose` file (which can be specified as a `.json` file [https://docs.docker.com/compose/faq/#can-i-use-json-instead-of-yaml-for-my-compose-file](https://docs.docker.com/compose/faq/#can-i-use-json-instead-of-yaml-for-my-compose-file)) to launch containers on a given swarm environment. The interface can look mostly the same while cutting out a lot of the specific information needed by Terraform.

```ocaml
type docker_compose_config =
{
; coda_image: string
; coda_agent_image: string
; coda_bots_image: string
; coda_points_image: string
; coda_archive_image: string
; runtime_config: Yojson.Safe.t
[@to_yojson fun j -> `String (Yojson.Safe.to_string j)]
; block_producer_configs: block_producer_config list
; log_precomputed_blocks: bool
; archive_node_count: int
; mina_archive_schema: string
; snark_worker_replicas: int
; snark_worker_fee: string
; snark_worker_public_key: string }
[@@deriving to_yojson]

type t =
{ coda_automation_location: string
; debug_arg: bool
; keypairs: Network_keypair.t list
; constants: Test_config.constants
; docker_compose: docker_compose_config }
[@@deriving to_yojson]
```

By taking a `Network_config.t` struct, we can transform the data structure into a corresponding `docker-compose` file that specifies all containers to run as well as any other configurations.
After computing the corresponding `docker-compose` file, we can simply call `docker stack deploy -c local-docker-compose.json testnet_name`

<img src="./res/local-test-integration-docker-compose.png" alt="drawing" width="500"/>

The resulting `docker-compose.json` file can have a service for each type of node that we want to spawn. Services in Docker Swarm are similar to pods in Kubernetes as they will schedule containers to nodes to run specified tasks.

A very generic example format of what the `docker-compose.json` could look as follows:

```bash
{
"version":"3",
"services":{
"block-producer":{
"image":"codaprotocol/coda-daemon-puppeteered",
"entrypoint":"/mina-entrypoint.sh",
"networks":[
"mina_local_test_network"
],
"deploy":{
"replicas":2,
"restart_policy":{
"condition":"on-failure"
}
}
},
"seed-node":{
"image":"codaprotocol/coda-daemon-puppeteered",
"entrypoint":"/mina-entrypoint.sh",
"networks":[
"mina_local_test_network"
],
"deploy":{
"replicas":1,
"restart_policy":{
"condition":"on-failure"
}
}
},
"snark-worker":{
"image":"codaprotocol/coda-daemon-puppeteered",
"entrypoint":"/mina-entrypoint.sh",
"networks":[
"mina_local_test_network"
],
"deploy":{
"replicas":3,
"restart_policy":{
"condition":"on-failure"
}
}
}
},
"networks":{
"mina_local_test_framework":null
}
}
```

## Logging:

Docker Swarm aggregates all logs from containers based on the running services. This makes it easy for us to parse out all logs on a container level without specifying specific containers.

The following is an example of the logs aggregated by Docker Swarm with 2 containers running the ping command.

```ocaml
$ docker service create --name ping --replicas 2 alpine ping 8.8.8.8

$ docker service logs ping
ping.2.odlt7ajje64e@node1 | PING 8.8.8.8 (8.8.8.8): 56 data bytes
...
ping.1.egjtdoz7tvkt@node1 | PING 8.8.8.8 (8.8.8.8): 56 data bytes
...
```

For our use case, we can specify different node types to be different services. For example, in our docker-compose configuration, we could specify a service for seed nodes, block producers, and snark workers and parse out the logs individually for each service. We can additionally do further computation on the logs to parse out which container is emitting these logs for a more granular level.

These logs can be polled on an interval and processed by a filter as they come in.

## Interface To Develop:

The current logging for the cloud framework is done by creating a Google Stackdriver subscription and issuing poll requests for logs while doing some pre-defined filtering.

An example of this is shown below:

```ocaml
let rec pull_subscription_in_background ~logger ~network ~event_writer
~subscription =
if not (Pipe.is_closed event_writer) then (
[%log spam] "Pulling StackDriver subscription" ;
let%bind log_entries =
Deferred.map (Subscription.pull ~logger subscription) ~f:Or_error.ok_exn
in
if List.length log_entries > 0 then
[%log spam] "Parsing events from $n logs"
~metadata:[("n", `Int (List.length log_entries))]
else [%log spam] "No logs were pulled" ;
let%bind () =
Deferred.List.iter ~how:`Sequential log_entries ~f:(fun log_entry ->
log_entry
|> parse_event_from_log_entry ~network
|> Or_error.ok_exn
|> Pipe.write_without_pushback_if_open event_writer ;
Deferred.unit )
in
let%bind () = after (Time.Span.of_ms 10000.0) in
pull_subscription_in_background ~logger ~network ~event_writer
~subscription )
else Deferred.unit
```

[https://github.com/MinaProtocol/mina/blob/67cc4205cc95138cf729a2f14b57b754f9e9204e/src/lib/integration_test_cloud_engine/stack_driver_log_engine.ml#L269](https://github.com/MinaProtocol/mina/blob/67cc4205cc95138cf729a2f14b57b754f9e9204e/src/lib/integration_test_cloud_engine/stack_driver_log_engine.ml#L269)

A similar interface can be written for Docker-Swarm instead. By defining a `Service.pull` function with a given logger, we can leverage a lot of the work already done by modifying parts of the code where the log formats diverge. All logs can be specified to an output stream, such as stdout or a specified file by the user on their local system.

<img src="./res/local-test-integration-logging.png" alt="drawing" width="500"/>

# Work Breakdown/Prio

The following will be a work breakdown of what needs to be done to see this feature to completion:

1. Implement the `Network_Config` interface to accept a network configuration and create a corresponding `docker-compose.json` file.
2. Implement the `Network_manager` interface to take a corresponding `docker-compose.json` file and create a local swarm with the specified container configuration
3. Implement functionality to simply log all container logs into a single stream (stdout or a file, maybe this can be specified in startup?)
4. Implement filter on event functionality
5. Ensure that current integration test specs are able to run on the local framework with success


# Unresolved Questions

- Is compiling a docker-compose file the right approach for scheduling the containers? The nice thing about using a docker-compose file is that all network management should be automatic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is the right way to go, especially for the network management bit.


- Is using a different service for each type of node the best effective approach? Would it be better to launch all nodes under the same service in the docker-compose file?

- Is polling each service and then aggregating those logs the best approach? Would it be better to do filtering before aggregating?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the polling logs in the local integration test engine is not the correct approach.

We do a pull-based approach for the cloud engine, and that works out fine for us right now because we pre-filter the logs (so the volume we handle per poll is very low), but more importantly, because we are only running networks with snarks enabled. When snarks are enabled, the networks move pretty slow, so small delays in the order of 5-10 seconds are not a problem.

However, with the local engine, the goal is the run snarkless networks, which move extremely quickly by comparison. In that world, a 5-10 second delay in logs could break the test. Furthermore, the log volume per second of a snarkless network is massive by comparison due to the speed the network operates at. Because of this, I think we both need to pre-filter the logs in the docker containers (to bring the log volume the single-threaded test_executive needs to handle down to a fraction of the size), and we need to be reading those logs off of a pipe in a more push-based fashion.

Another advantage of using a push-based pipe approach here is that you don't need to deal with re-reading logs you already handled, which appears to be a problem you would need to solve with the docker swarm logging interface. You might be able to get by with asking docker swarm to only give you logs that were recorded since the last time you asked for logs, but you have some risk of missing logs in that approach and are relying on docker swarm's performance to quickly perform that query on it's log cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed up with a restructuring of the proposed polling solution to use the push-based pipe approach. 👍


- Does this plan capture the overall direction we want the local testing framework to go?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

3 changes: 3 additions & 0 deletions rfcs/res/local-test-integration-docker-compose.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions rfcs/res/local-test-integration-logging.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.