Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0011] Local Integration Extention for Lucy #40

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

MartinMinkov
Copy link

@MartinMinkov MartinMinkov commented Nov 20, 2023

This RFC introduces an extension to the Lucy integration testing framework for the Mina protocol, enabling local integration testing using Docker containers. This new approach is designed to be lightweight, easy to use, and fast, allowing developers and contributors to run integration tests on their local machines without needing a cloud environment.

@MartinMinkov MartinMinkov marked this pull request as ready for review November 21, 2023 17:04

By using Docker containers, we can run the network configuration on a local machine without having to worry about the specifics of running the network. Containers are also portable and consistent across all machines, which makes it easy to run the network on any machine (e.g. inside CI).

One could argue that we could leverage the existing Kubernetes infrastructure and use Kubernetes locally instead of targeting the cloud. This is a valid alternative, but the existing Kubernetes backend is tightly coupled towards GKE, which makes it difficult to port over the existing backend to target a local cluster instead. Furthermore, using containers for local network deployment offers a simple mental model for how the network is deployed and managed compared to Kubernetes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to introduce a Platform Engineering perspective regarding this point 👀 :

While I fully support the idea that developers should use testing tools closer to their domain, I wonder if the ease of configuration offered by docker-compose may lure us away from providing a test environment closer to what is used in a real production/testing environment 🤔.

For instance, creating a local Kubernetes environment and deploying networks with a Helm infrastructure backend would:

  • (bad) increase the resources footprint of the environment, but
  • (nice) provides an environment leveraging existing Charts used in main/test-nets. That is, using workflows closer to what is actually done for deployments.
  • (bad?) It may require Teams to get involved in the design of Helm Charts (with Plat. Eng. Team support) for what is to be deployed, but..
  • (nice!)... that's the point! IMHO this is key for enabling efficient Continuous Deployment (CD) workflows with the added benefits, like: idempotent deployments, Git as source of truth and GitOps, ownership, same tooling, etc.

In summary, the goal of this perspective is to highlight that investing in adopting an efficient local Kubernetes setup leveraging a Helm infrastructure backend would produce desirable by-products including: optimization of existing infra components for local tests (e.g., Helm Charts), and ease the transition to CD by introducing it early in the development process (i.e., shift-left).

If this seems to be interesting, I can help identifying the (some) of the requirement gaps 💯 .

Copy link
Author

@MartinMinkov MartinMinkov Nov 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair points!

The idea of adopting Kubernetes or Docker was discussed early on when this tool was thought about. To answer the question of what technology we need, I think it's important to establish what we want this tool really to do. To me (and I could be wrong here, this is why we need this RFC!), this tool is meant to be a solution that engineers can run on their local machine to do a "quick-pass" on features that are written and have a quick feedback loop. It's not my current understanding that this is intended to be run in a cluster on CI (or if it does, just on one machine), I was assuming that was what the cloud integration of Lucy is attempting to solve. If we do want to run this in CI, is it intended to be run on a single server, or on a cluster? If we start zooming out and thinking about using this local integration across a cluster, it feels close to what we do in GKE currently. Maybe this is something that @bkase @nholland94 and @deepthiskumar can shed some light on?

Assuming that this tool is intended to be run on a developers machine (and/or a single server in CI), I think we should optimize for developer experience over similarity of deployment environments. Using a Kubernetes backend to deploy a network locally and run tests on it can make things more complicated for the developer and understand where the tests fail. We have to introduce more things like Helm charts, dealing with Kubernetes networking and potentially more, which in my opinion leads to a worse developer experience. Also, if we plan to use this in CI and we just use one machine, I'm not sure if the choice between Kubernetes or Docker matters too much, what are your thoughts?

If we use Docker containers, things are kept lean and simple; a simple docker compose file is used as the deployment which is easy to understand, and there isn't much more additional tooling needed. Things will be faster, easier to install and setup, and have much less overhead to think about.

What sorts of functional requirements are important for this tool to achieve in your opinion? That really is the thing we want to have a consensus on.

Another question I have is, given your experience with deploying our nodes in a Kubernets environment, are there implementation details that we have to worry about between network differences? I was going under the assumption that Dockers local network will just work without issues, I could be wrong here 😅

Copy link

@SanabriaRusso SanabriaRusso Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that even if the introduction of docker-compose deployments could be very simple, it will end up adding, well, something more to maintain and be aware of.

I think that with fairly few fixes the current Helm Charts could be made as generic as needed as to be deployed on local Kubernetes instances on the dev computer (e.g., KiND, K3s, Minikube or other alternatives will do).

Anyway, this is an interesting discussion as it deals with a tradeoff, as you have identified: dev experience vs deployment reliability. The former aims at offering simplicity as heavy abstraction for fast feature tests, while the latter attempts to extend and use existing infra artifacts.

Would love to spark a discussion here 😄; relevantly, to derive more specific requirements about this tool and its introduction as a test tool in CI. Any insight @stevenplatt @dkijania ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out @SanabriaRusso

From a higher-level, I think this brings us back to the question of boundaries. In a perfect world, when platform-eng isn't busy with other important work, we could argue this "execution layer" should be owned by platform-eng; though even then, it's not clear: In practice with this particular project, an ownership boundary that was too high up has caused the integration testing framework to not evolve as quickly as we would have liked. Anyway, this is why we're experimenting with pushing the boundary between team ownership slightly downwards from where it was before.

Also, pragmatically, @MartinMinkov has a PR that is 85% complete which implements the docker backend! So we can bring this to production within a couple weeks rather than the much larger task of starting from "scratch". This doesn't mean we can't throw it out later, though, if the Helm Charts get to the point where they run successfully locally.

So I'd suggest: Let's finish this project that's 85% done, get us start relying on running the tests locally and on a single CI server. And then separately prioritize cleaning up the Helm Charts. Whenever that is done and works and is cleaner than the docker-compose solution (ie. behaves at least as well, without confusion to devs when they're iterating locally) and then we can evaluate throwing out the docker-compose backend at that point.

0011-local-integration-test-framework.md Outdated Show resolved Hide resolved
0011-local-integration-test-framework.md Outdated Show resolved Hide resolved

A further optimization we can do is apply [`logproc`](https://github.com/MinaProtocol/mina/blob/compatible/src/app/logproc/logproc.ml) to all container output before it's written to the pipe. `logproc` can help us filter logs by log level, which will help us reduce the volume of logs that are written to the pipe. This will help us reduce the number of logs that are written to the pipe and the number of logs that Lucy has to parse.

For example, we could write an entry point script for each container that redirects stdout and stderr to the named pipe with a prefix of the container name. This will allow us to filter logs by container name and will allow us to deal with duplicate logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last 3 sections has:

  1. "Furthermore, we can tag all container logs..."
  2. "A further optimization we can do is..."
  3. "For example, we could write..."

Can you be more specific about what you will do rather than what you may do? Either say "this will happen" or for those which we are not scoping these in, just group these in a section of "Future optimizations" and remove the language. It's okay to include optimizations that are optional based on critieria such as "if when I run test X, it takes longer than Y, then we will do Z".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! I clarified the text to be more clear in what we will be building in this section. These were intended to be built 👍


By using Docker containers, we can run the network configuration on a local machine without having to worry about the specifics of running the network. Containers are also portable and consistent across all machines, which makes it easy to run the network on any machine (e.g. inside CI).

One could argue that we could leverage the existing Kubernetes infrastructure and use Kubernetes locally instead of targeting the cloud. This is a valid alternative, but the existing Kubernetes backend is tightly coupled towards GKE, which makes it difficult to port over the existing backend to target a local cluster instead. Furthermore, using containers for local network deployment offers a simple mental model for how the network is deployed and managed compared to Kubernetes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out @SanabriaRusso

From a higher-level, I think this brings us back to the question of boundaries. In a perfect world, when platform-eng isn't busy with other important work, we could argue this "execution layer" should be owned by platform-eng; though even then, it's not clear: In practice with this particular project, an ownership boundary that was too high up has caused the integration testing framework to not evolve as quickly as we would have liked. Anyway, this is why we're experimenting with pushing the boundary between team ownership slightly downwards from where it was before.

Also, pragmatically, @MartinMinkov has a PR that is 85% complete which implements the docker backend! So we can bring this to production within a couple weeks rather than the much larger task of starting from "scratch". This doesn't mean we can't throw it out later, though, if the Helm Charts get to the point where they run successfully locally.

So I'd suggest: Let's finish this project that's 85% done, get us start relying on running the tests locally and on a single CI server. And then separately prioritize cleaning up the Helm Charts. Whenever that is done and works and is cleaner than the docker-compose solution (ie. behaves at least as well, without confusion to devs when they're iterating locally) and then we can evaluate throwing out the docker-compose backend at that point.

0011-local-integration-test-framework.md Outdated Show resolved Hide resolved

Using Docker Swarm allows us to handle the orchestration of containers without having to worry about the specifics of running containers on a local machine. Docker Swarm will handle all the networking, log collection, and resource management for us.

### Extending Lucy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add one more layer of detail here? Since you've already done a good chunk of the work it should be fairly easy for you to figure out how it works.

ie. turn

For the local backend, it will generate a docker swarm file that specifies the network topology configuration.

into something like:

"The Local_network_config will have a type like

module Config_record = struct
  type t = { foo: int, bar : int }
end

and will include logic for encoding it into a docker-swarm file by doing a single pass and pretty-printing the YAML using library X. Here's a code snippet of one such function... etc.

And can you do the same treatment for how the pipe will be shared for the logging?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing! I added a note about the config type here

The section for the pipe is already included, it will be shared with using volumes across each service configuration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants