Skip to content
This repository has been archived by the owner on Oct 7, 2021. It is now read-only.

WIP: Provide basis for new API; move to environment configuration style #179

Open
wants to merge 19 commits into
base: master
Choose a base branch
from

Conversation

Mierdin
Copy link
Member

@Mierdin Mierdin commented Dec 6, 2018

  • Move to environment variables instead of config files. This enables us to more easily run ToDD in containerized environments, and by stripping out all of the external dependencies, the config file approach is really not needed anymore.

  • Implement new GRPC-based API. Uses grpc-gateway to provide REST capabilities, and the swagger plugin for swagger docs. Limiting this PR to the "groups" resource, so we can make sure things are done properly end-to-end, from the client to the server, including tests, before adding more endpoints:

  • Implementing only agents and groups in the API for now. Going to do it end-to-end, and with tests before moving on. Agents and groups are related so it makes sense to start here.

@Mierdin Mierdin changed the title WIP: start new API base WIP: Provide basis for new API; move to environment configuration style Dec 7, 2018
@Mierdin Mierdin added the ToDD Rebuild Tasks related to the effort to rebuild ToDD the right way label Dec 7, 2018
@Mierdin Mierdin mentioned this pull request Dec 7, 2018
@bigmstone
Copy link

I'm a little fearful about the move to gRPC, but that would involve a deeper discussion. My main concern is for two-way communication. This seems like it would be a big requirement for future uses of todd. gRPC primarily requires client to initiate to server (only exception I know of is client requesting streams, but it's still limited). So, assuming I'm right there, either you'll back yourself into a non-event-driven architecture where agents periodically check for new tasks to run, or you'll require your clients to be gRPC servers so that the coordinator needs to connect to the agent to request an RPC. The second having obvious flaws (among them NAT traversal). Perhaps the REST gateway gets around some of this, but then there's the question of if REST is the best mechanism for agents that need to pick up jobs as they are requested since you'll still be forced to periodically check for new executions. My primary issue against checking for new executions periodically is the natural wait loop it will make you write into the clients and the needless load on the servers serving a ton of "nothing to see here" requests.

I'm certainly +1 to modeling it, both with OpenAPI and Protobuf, so maybe don't abandon either but just change the transport? Still modeling your north-bound in OpenAPI, and model your inter-client comms in prodobuf, but maybe send the protouf over a zeromq socket? If you scale out the controlling nodes (not sure what you call them in todd) you could have the servers themselves coordinate or even act like a broker if needed.

Anyway, food for thought. Let me know if you want to chat more or if I'm totally off base.

@Mierdin
Copy link
Member Author

Mierdin commented Dec 14, 2018

@bigmstone There's definitely two-way communication here. See https://github.com/toddproject/todd/pull/179/files#diff-ee511bec1607ffbef277e4fb70499275R11. And I can definitely call you soon to go over more details

@Mierdin
Copy link
Member Author

Mierdin commented Dec 14, 2018

@bigmstone Okay I have a little more time in the airport now to elaborate.

The idea I'm playing with is using the AgentRegister RPC as a representation of the full lifecycle of the agent communication process, meaning the RPC is called on initial agent connection to server, and the RPC doesn't actually return until the agent disconnects from the server.

Once this RPC function determines this agent is valid, and assigns a group, it stores a pointer to the agent's stream in memory (see s.agents map). This map will be made available to all other goroutines in the todd server.

This RPC then sends the receipt of additional messages to a goroutine, and then blocks at the channel created from stream.Context().Done(), which blocks until the stream is closed. So, all code after the line that blocks on this channel can be run under the assumption that the agent has disconnected, meaning we can clean up the in-memory map, and return the RPC.

This is really nice because we don't have to fuss with keepalives. Both the agent's details as well as it's bidirectional stream are maintained in memory exactly as long as that agent is alive. The agent will also maintain it's own stream in the other direction, constantly holding the stream open to receive messages from the server. No checking, polling, no problems with NAT traversal. The agent is responsible for initiating the connection, but once established, stream communication can be initiated in either direction.

This is all done with native gRPC. On top of this, I'm considering adding the grpc-gateway functionality so that 3rd-party software can use ToDD without having to use gRPC. However, the AgentRegister, or any other endpoint for working with agents, will not be made available via REST, as it would be pointless. Agents only speak gRPC.

Hope that clarifies. Let me know if I can clarify anything else.

@bigmstone
Copy link

So you're accomplishing this with a long-running connection (gRPC stream?) between client/server. What happens if the client disconnects because of network related issues, etc? Does gRPC handle the reconnect semantics? Its been a while since I worked with it, but IIRC you have to build in the re-connect logic, right?

Also, on a long standing stream what's the plan for test results? Will you have an unmodeled results field? I don't think there's a way to have that kind of dynamic modeling on a single open stream. Regardless of how it's accomplished I think all messaging should be modeled. Including results. Maybe tests can define their own model (proto file)? I still don't think it would be modeled on the wire though since the open stream will require you to communicate via the stream message type.

@Mierdin
Copy link
Member Author

Mierdin commented Dec 14, 2018

What happens if the client disconnects because of network related issues, etc? Does gRPC handle the reconnect semantics? Its been a while since I worked with it, but IIRC you have to build in the re-connect logic, right?

You're correct. The stream won't be automatically re-stablished, so that's why in the agent, I'm connecting in a loop (which again, won't iterate all the time, only when there's a disconnect event). This uses the same kind of logic that the server is using to detect a disconnect on the agent side (which is why some of those comments are copypasta right now). In addition, and this isn't done yet, but the agent will have a local in-memory store of metrics, similar to how they're being stored to disk in the "old" version, that will be uploaded when reconnected.

Also, on a long standing stream what's the plan for test results? Will you have an unmodeled results field? I don't think there's a way to have that kind of dynamic modeling on a single open stream. Regardless of how it's accomplished I think all messaging should be modeled. Including results. Maybe tests can define their own model (proto file)? I still don't think it would be modeled on the wire though since the open stream will require you to communicate via the stream message type.

This won't be part of this PR, but I have some plans here. I can tell you that everything will be modeled and driven via schema. I want to do some basic aggregation at the server so I can make assertions about test data, and the only way to do that is is the server can get a much tighter handle on types than it has in the past (it was really just concatenating JSON objects and printing them to the screen which was wildly uninspiring).

It's clear to me that I should stop working on code until I have a proper design doc published 😄

@bigmstone
Copy link

You're correct. The stream won't be automatically re-stablished, so that's why in the agent, I'm connecting in a loop (which again, won't iterate all the time, only when there's a disconnect event). This uses the same kind of logic that the server is using to detect a disconnect on the agent side (which is why some of those comments are copypasta right now). In addition, and this isn't done yet, but the agent will have a local in-memory store of metrics, similar to how they're being stored to disk in the "old" version, that will be uploaded when reconnected.

Where possible my opinion is to throw as much of that "pig over the fence" as possible. Rabbit offers more in this area, and if you don't want a broker 0mq. I am personally partial to 0mq, but that's a broader discussion because it comes with its own warts...and your instinct to draw up the actual requirements is good, will help drive the discussion. I just think with todd we're solving a specific problem and if we can get world-class connection and message handling via import thingy we should. A quick "for instance": does current retry logic plan account for job completion while server is disconnected or does that data get lost? These are questions we don't have to think about if using a properly configured 0mq socket.

I can tell you that everything will be modeled and driven via schema.

Good, but I don't see a solid path to do that "on the wire" while using gRPC. To explain better, I actually do really like gRPC (and, more specifically, protobuf) but I think it might not be a fantastic choice here since a long-lived stream of data will lock us into one message type. So we have the nightmare of having a massive agent message that tries to capture every possible data element, or we end up with a single unmodeled field that we either a) treat unmodeled the entire and pray we can unmarshal at some point in the future or b) model in a different schema or with protobuf but with our own handling logic since it will be out of the gRPC message at that point.

So yeah, with my current understanding I lean towards a more appropriate client/server primitive that can send various different, but modeled, message types. This is what I initially was referring to with two-way-communication. Not to imply gRPC isn't two-way, but that it is initiated in only one direction and with a specific message type for that connection.

We should drink moar beer and discuss in a better forum once you solidify the basic requirements.

@Mierdin
Copy link
Member Author

Mierdin commented Dec 15, 2018

I am a firm 👎 against continuing with an external broker like RMQ. IMO, that was one of the biggest PITA aspects to ToDD to date. Sounds like you're really only talking about the queueing functionality, and not explicitly an external broker, so I won't belabor this further.

Let's assume for the time being that it would be needlessly complex to try to do BOTH gRPC and ZMQ. In that case, it seems there are two main problems with the current approach.

  1. No guarantees for message delivery over stream (noting that regular non-stream RPCs are still reliable).
  2. Streams are single-type, so if we had multiple message types, your predictions would be correct re: either using a massive message type to include all use cases, or use a payload that we unmarshal/marshal outside of gRPC.

Is that about the size of it? Let me know if I missed anything. I think I have ideas to address both of these concerns but want to make sure I have it right.

@bigmstone
Copy link

  1. Yes, and the heart of that is not that it can't be worked around, but rather should a project re-solve that problem when it's been so well solved with other technologies/libraries. With fear of over-tooting the 0mq horn, you create a socket and stat sending it to the server and it's all pretty much taken care of you.

  2. Yes, and again this isn't a limitation of protobuf, but rather how gRPC leverages within this type of use case. I'd still be +1 to protobuf over 0mq, and have use that architecture before with success.

On really any simple socket we treat as a queue you can label the message with the type of message being returned and unmarshal into a struct for handling via a callback. We'll have to build a lot less scaffolding than what would eventually be required with gRPC.

Ultimately ToDD is a bit reversed from normal gRPC applications. The clients are the things with the resources and the server aggregates that and provides command and control to them. So it needs a more robust communications structure than a server that exclusively provides a resource.

@Mierdin
Copy link
Member Author

Mierdin commented Dec 15, 2018

Hm. In your opinion would it be feasible to use ZMQ for agent comms but gRPC+gateway/REST for everything else? I have other ideas for non-agent clients that might make use of gRPC streaming, and it would be more the canonical gRPC use case where the server is streaming data to an external entity.

Assuming that both gRPC and ZMQ would be model-driven via protobuf, would this be needlessly complex, or not a big deal?

@Mierdin
Copy link
Member Author

Mierdin commented Dec 15, 2018

That is to say, use ZMQ "southbound" to the agents, and grpc/REST northbound to everything else.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
ToDD Rebuild Tasks related to the effort to rebuild ToDD the right way
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants