-
Notifications
You must be signed in to change notification settings - Fork 30
WIP: Provide basis for new API; move to environment configuration style #179
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Move to dep for dependency management
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
Signed-off-by: Matt Oswalt <[email protected]>
I'm a little fearful about the move to gRPC, but that would involve a deeper discussion. My main concern is for two-way communication. This seems like it would be a big requirement for future uses of todd. gRPC primarily requires client to initiate to server (only exception I know of is client requesting streams, but it's still limited). So, assuming I'm right there, either you'll back yourself into a non-event-driven architecture where agents periodically check for new tasks to run, or you'll require your clients to be gRPC servers so that the coordinator needs to connect to the agent to request an RPC. The second having obvious flaws (among them NAT traversal). Perhaps the REST gateway gets around some of this, but then there's the question of if REST is the best mechanism for agents that need to pick up jobs as they are requested since you'll still be forced to periodically check for new executions. My primary issue against checking for new executions periodically is the natural wait loop it will make you write into the clients and the needless load on the servers serving a ton of "nothing to see here" requests. I'm certainly +1 to modeling it, both with OpenAPI and Protobuf, so maybe don't abandon either but just change the transport? Still modeling your north-bound in OpenAPI, and model your inter-client comms in prodobuf, but maybe send the protouf over a zeromq socket? If you scale out the controlling nodes (not sure what you call them in todd) you could have the servers themselves coordinate or even act like a broker if needed. Anyway, food for thought. Let me know if you want to chat more or if I'm totally off base. |
@bigmstone There's definitely two-way communication here. See https://github.com/toddproject/todd/pull/179/files#diff-ee511bec1607ffbef277e4fb70499275R11. And I can definitely call you soon to go over more details |
@bigmstone Okay I have a little more time in the airport now to elaborate. The idea I'm playing with is using the Once this RPC function determines this agent is valid, and assigns a group, it stores a pointer to the agent's stream in memory (see This RPC then sends the receipt of additional messages to a goroutine, and then blocks at the channel created from This is really nice because we don't have to fuss with keepalives. Both the agent's details as well as it's bidirectional stream are maintained in memory exactly as long as that agent is alive. The agent will also maintain it's own stream in the other direction, constantly holding the stream open to receive messages from the server. No checking, polling, no problems with NAT traversal. The agent is responsible for initiating the connection, but once established, stream communication can be initiated in either direction. This is all done with native gRPC. On top of this, I'm considering adding the grpc-gateway functionality so that 3rd-party software can use ToDD without having to use gRPC. However, the AgentRegister, or any other endpoint for working with agents, will not be made available via REST, as it would be pointless. Agents only speak gRPC. Hope that clarifies. Let me know if I can clarify anything else. |
So you're accomplishing this with a long-running connection (gRPC stream?) between client/server. What happens if the client disconnects because of network related issues, etc? Does gRPC handle the reconnect semantics? Its been a while since I worked with it, but IIRC you have to build in the re-connect logic, right? Also, on a long standing stream what's the plan for test results? Will you have an unmodeled results field? I don't think there's a way to have that kind of dynamic modeling on a single open stream. Regardless of how it's accomplished I think all messaging should be modeled. Including results. Maybe tests can define their own model (proto file)? I still don't think it would be modeled on the wire though since the open stream will require you to communicate via the stream message type. |
You're correct. The stream won't be automatically re-stablished, so that's why in the agent, I'm connecting in a loop (which again, won't iterate all the time, only when there's a disconnect event). This uses the same kind of logic that the server is using to detect a disconnect on the agent side (which is why some of those comments are copypasta right now). In addition, and this isn't done yet, but the agent will have a local in-memory store of metrics, similar to how they're being stored to disk in the "old" version, that will be uploaded when reconnected.
This won't be part of this PR, but I have some plans here. I can tell you that everything will be modeled and driven via schema. I want to do some basic aggregation at the server so I can make assertions about test data, and the only way to do that is is the server can get a much tighter handle on types than it has in the past (it was really just concatenating JSON objects and printing them to the screen which was wildly uninspiring). It's clear to me that I should stop working on code until I have a proper design doc published 😄 |
Where possible my opinion is to throw as much of that "pig over the fence" as possible. Rabbit offers more in this area, and if you don't want a broker 0mq. I am personally partial to 0mq, but that's a broader discussion because it comes with its own warts...and your instinct to draw up the actual requirements is good, will help drive the discussion. I just think with
Good, but I don't see a solid path to do that "on the wire" while using gRPC. To explain better, I actually do really like gRPC (and, more specifically, protobuf) but I think it might not be a fantastic choice here since a long-lived stream of data will lock us into one message type. So we have the nightmare of having a massive agent message that tries to capture every possible data element, or we end up with a single unmodeled field that we either a) treat unmodeled the entire and pray we can unmarshal at some point in the future or b) model in a different schema or with protobuf but with our own handling logic since it will be out of the gRPC message at that point. So yeah, with my current understanding I lean towards a more appropriate client/server primitive that can send various different, but modeled, message types. This is what I initially was referring to with two-way-communication. Not to imply gRPC isn't two-way, but that it is initiated in only one direction and with a specific message type for that connection. We should drink moar beer and discuss in a better forum once you solidify the basic requirements. |
I am a firm 👎 against continuing with an external broker like RMQ. IMO, that was one of the biggest PITA aspects to ToDD to date. Sounds like you're really only talking about the queueing functionality, and not explicitly an external broker, so I won't belabor this further. Let's assume for the time being that it would be needlessly complex to try to do BOTH gRPC and ZMQ. In that case, it seems there are two main problems with the current approach.
Is that about the size of it? Let me know if I missed anything. I think I have ideas to address both of these concerns but want to make sure I have it right. |
On really any simple socket we treat as a queue you can label the message with the type of message being returned and unmarshal into a struct for handling via a callback. We'll have to build a lot less scaffolding than what would eventually be required with gRPC. Ultimately ToDD is a bit reversed from normal gRPC applications. The clients are the things with the resources and the server aggregates that and provides command and control to them. So it needs a more robust communications structure than a server that exclusively provides a resource. |
Hm. In your opinion would it be feasible to use ZMQ for agent comms but gRPC+gateway/REST for everything else? I have other ideas for non-agent clients that might make use of gRPC streaming, and it would be more the canonical gRPC use case where the server is streaming data to an external entity. Assuming that both gRPC and ZMQ would be model-driven via protobuf, would this be needlessly complex, or not a big deal? |
That is to say, use ZMQ "southbound" to the agents, and grpc/REST northbound to everything else. |
Move to environment variables instead of config files. This enables us to more easily run ToDD in containerized environments, and by stripping out all of the external dependencies, the config file approach is really not needed anymore.
Implement new GRPC-based API. Uses grpc-gateway to provide REST capabilities, and the swagger plugin for swagger docs. Limiting this PR to the "groups" resource, so we can make sure things are done properly end-to-end, from the client to the server, including tests, before adding more endpoints:
Implementing only agents and groups in the API for now. Going to do it end-to-end, and with tests before moving on. Agents and groups are related so it makes sense to start here.