WIP: Provide basis for new API; move to environment configuration style #179

Mierdin · 2018-12-06T22:27:14Z

Move to environment variables instead of config files. This enables us to more easily run ToDD in containerized environments, and by stripping out all of the external dependencies, the config file approach is really not needed anymore.
Implement new GRPC-based API. Uses grpc-gateway to provide REST capabilities, and the swagger plugin for swagger docs. Limiting this PR to the "groups" resource, so we can make sure things are done properly end-to-end, from the client to the server, including tests, before adding more endpoints:
Implementing only agents and groups in the API for now. Going to do it end-to-end, and with tests before moving on. Agents and groups are related so it makes sense to start here.

Signed-off-by: Matt Oswalt <[email protected]>

Move to dep for dependency management

Signed-off-by: Matt Oswalt <[email protected]>

bigmstone · 2018-12-13T23:40:41Z

I'm a little fearful about the move to gRPC, but that would involve a deeper discussion. My main concern is for two-way communication. This seems like it would be a big requirement for future uses of todd. gRPC primarily requires client to initiate to server (only exception I know of is client requesting streams, but it's still limited). So, assuming I'm right there, either you'll back yourself into a non-event-driven architecture where agents periodically check for new tasks to run, or you'll require your clients to be gRPC servers so that the coordinator needs to connect to the agent to request an RPC. The second having obvious flaws (among them NAT traversal). Perhaps the REST gateway gets around some of this, but then there's the question of if REST is the best mechanism for agents that need to pick up jobs as they are requested since you'll still be forced to periodically check for new executions. My primary issue against checking for new executions periodically is the natural wait loop it will make you write into the clients and the needless load on the servers serving a ton of "nothing to see here" requests.

I'm certainly +1 to modeling it, both with OpenAPI and Protobuf, so maybe don't abandon either but just change the transport? Still modeling your north-bound in OpenAPI, and model your inter-client comms in prodobuf, but maybe send the protouf over a zeromq socket? If you scale out the controlling nodes (not sure what you call them in todd) you could have the servers themselves coordinate or even act like a broker if needed.

Anyway, food for thought. Let me know if you want to chat more or if I'm totally off base.

Mierdin · 2018-12-14T00:08:38Z

@bigmstone There's definitely two-way communication here. See https://github.com/toddproject/todd/pull/179/files#diff-ee511bec1607ffbef277e4fb70499275R11. And I can definitely call you soon to go over more details

Mierdin · 2018-12-14T18:09:14Z

@bigmstone Okay I have a little more time in the airport now to elaborate.

The idea I'm playing with is using the AgentRegister RPC as a representation of the full lifecycle of the agent communication process, meaning the RPC is called on initial agent connection to server, and the RPC doesn't actually return until the agent disconnects from the server.

Once this RPC function determines this agent is valid, and assigns a group, it stores a pointer to the agent's stream in memory (see s.agents map). This map will be made available to all other goroutines in the todd server.

This RPC then sends the receipt of additional messages to a goroutine, and then blocks at the channel created from stream.Context().Done(), which blocks until the stream is closed. So, all code after the line that blocks on this channel can be run under the assumption that the agent has disconnected, meaning we can clean up the in-memory map, and return the RPC.

This is really nice because we don't have to fuss with keepalives. Both the agent's details as well as it's bidirectional stream are maintained in memory exactly as long as that agent is alive. The agent will also maintain it's own stream in the other direction, constantly holding the stream open to receive messages from the server. No checking, polling, no problems with NAT traversal. The agent is responsible for initiating the connection, but once established, stream communication can be initiated in either direction.

This is all done with native gRPC. On top of this, I'm considering adding the grpc-gateway functionality so that 3rd-party software can use ToDD without having to use gRPC. However, the AgentRegister, or any other endpoint for working with agents, will not be made available via REST, as it would be pointless. Agents only speak gRPC.

Hope that clarifies. Let me know if I can clarify anything else.

bigmstone · 2018-12-14T18:50:24Z

So you're accomplishing this with a long-running connection (gRPC stream?) between client/server. What happens if the client disconnects because of network related issues, etc? Does gRPC handle the reconnect semantics? Its been a while since I worked with it, but IIRC you have to build in the re-connect logic, right?

Also, on a long standing stream what's the plan for test results? Will you have an unmodeled results field? I don't think there's a way to have that kind of dynamic modeling on a single open stream. Regardless of how it's accomplished I think all messaging should be modeled. Including results. Maybe tests can define their own model (proto file)? I still don't think it would be modeled on the wire though since the open stream will require you to communicate via the stream message type.

Mierdin · 2018-12-14T21:14:23Z

What happens if the client disconnects because of network related issues, etc? Does gRPC handle the reconnect semantics? Its been a while since I worked with it, but IIRC you have to build in the re-connect logic, right?

You're correct. The stream won't be automatically re-stablished, so that's why in the agent, I'm connecting in a loop (which again, won't iterate all the time, only when there's a disconnect event). This uses the same kind of logic that the server is using to detect a disconnect on the agent side (which is why some of those comments are copypasta right now). In addition, and this isn't done yet, but the agent will have a local in-memory store of metrics, similar to how they're being stored to disk in the "old" version, that will be uploaded when reconnected.

Also, on a long standing stream what's the plan for test results? Will you have an unmodeled results field? I don't think there's a way to have that kind of dynamic modeling on a single open stream. Regardless of how it's accomplished I think all messaging should be modeled. Including results. Maybe tests can define their own model (proto file)? I still don't think it would be modeled on the wire though since the open stream will require you to communicate via the stream message type.

This won't be part of this PR, but I have some plans here. I can tell you that everything will be modeled and driven via schema. I want to do some basic aggregation at the server so I can make assertions about test data, and the only way to do that is is the server can get a much tighter handle on types than it has in the past (it was really just concatenating JSON objects and printing them to the screen which was wildly uninspiring).

It's clear to me that I should stop working on code until I have a proper design doc published 😄

bigmstone · 2018-12-15T03:15:29Z

You're correct. The stream won't be automatically re-stablished, so that's why in the agent, I'm connecting in a loop (which again, won't iterate all the time, only when there's a disconnect event). This uses the same kind of logic that the server is using to detect a disconnect on the agent side (which is why some of those comments are copypasta right now). In addition, and this isn't done yet, but the agent will have a local in-memory store of metrics, similar to how they're being stored to disk in the "old" version, that will be uploaded when reconnected.

Where possible my opinion is to throw as much of that "pig over the fence" as possible. Rabbit offers more in this area, and if you don't want a broker 0mq. I am personally partial to 0mq, but that's a broader discussion because it comes with its own warts...and your instinct to draw up the actual requirements is good, will help drive the discussion. I just think with todd we're solving a specific problem and if we can get world-class connection and message handling via import thingy we should. A quick "for instance": does current retry logic plan account for job completion while server is disconnected or does that data get lost? These are questions we don't have to think about if using a properly configured 0mq socket.

I can tell you that everything will be modeled and driven via schema.

Good, but I don't see a solid path to do that "on the wire" while using gRPC. To explain better, I actually do really like gRPC (and, more specifically, protobuf) but I think it might not be a fantastic choice here since a long-lived stream of data will lock us into one message type. So we have the nightmare of having a massive agent message that tries to capture every possible data element, or we end up with a single unmodeled field that we either a) treat unmodeled the entire and pray we can unmarshal at some point in the future or b) model in a different schema or with protobuf but with our own handling logic since it will be out of the gRPC message at that point.

So yeah, with my current understanding I lean towards a more appropriate client/server primitive that can send various different, but modeled, message types. This is what I initially was referring to with two-way-communication. Not to imply gRPC isn't two-way, but that it is initiated in only one direction and with a specific message type for that connection.

We should drink moar beer and discuss in a better forum once you solidify the basic requirements.

Mierdin · 2018-12-15T22:12:24Z

I am a firm 👎 against continuing with an external broker like RMQ. IMO, that was one of the biggest PITA aspects to ToDD to date. Sounds like you're really only talking about the queueing functionality, and not explicitly an external broker, so I won't belabor this further.

Let's assume for the time being that it would be needlessly complex to try to do BOTH gRPC and ZMQ. In that case, it seems there are two main problems with the current approach.

No guarantees for message delivery over stream (noting that regular non-stream RPCs are still reliable).
Streams are single-type, so if we had multiple message types, your predictions would be correct re: either using a massive message type to include all use cases, or use a payload that we unmarshal/marshal outside of gRPC.

Is that about the size of it? Let me know if I missed anything. I think I have ideas to address both of these concerns but want to make sure I have it right.

bigmstone · 2018-12-15T22:23:33Z

Yes, and the heart of that is not that it can't be worked around, but rather should a project re-solve that problem when it's been so well solved with other technologies/libraries. With fear of over-tooting the 0mq horn, you create a socket and stat sending it to the server and it's all pretty much taken care of you.
Yes, and again this isn't a limitation of protobuf, but rather how gRPC leverages within this type of use case. I'd still be +1 to protobuf over 0mq, and have use that architecture before with success.

On really any simple socket we treat as a queue you can label the message with the type of message being returned and unmarshal into a struct for handling via a callback. We'll have to build a lot less scaffolding than what would eventually be required with gRPC.

Ultimately ToDD is a bit reversed from normal gRPC applications. The clients are the things with the resources and the server aggregates that and provides command and control to them. So it needs a more robust communications structure than a server that exclusively provides a resource.

Mierdin · 2018-12-15T22:50:13Z

Hm. In your opinion would it be feasible to use ZMQ for agent comms but gRPC+gateway/REST for everything else? I have other ideas for non-agent clients that might make use of gRPC streaming, and it would be more the canonical gRPC use case where the server is streaming data to an external entity.

Assuming that both gRPC and ZMQ would be model-driven via protobuf, would this be needlessly complex, or not a big deal?

Mierdin · 2018-12-15T23:02:16Z

That is to say, use ZMQ "southbound" to the agents, and grpc/REST northbound to everything else.

Mierdin and others added 10 commits April 6, 2018 12:14

WIP

3a71969

Signed-off-by: Matt Oswalt <[email protected]>

Committing more WIP so I can switch to master

1d26d58

Signed-off-by: Matt Oswalt <[email protected]>

Move to glide

a125b26

Signed-off-by: Matt Oswalt <[email protected]>

Move to dep

bf75080

Signed-off-by: Matt Oswalt <[email protected]>

Rip out old API, begin new API

bc50c1e

Signed-off-by: Matt Oswalt <[email protected]>

Merge branch 'master' into rip-and-tear

5ed4690

Merge branch 'master' into rebuild

3d0e6ca

Merge pull request #177 from toddproject/rebuild

b558332

Move to dep for dependency management

Merge branch 'master' into rip-and-tear

2c21805

Update README.md

cc10bf9

Mierdin changed the title ~~WIP: start new API base~~ WIP: Provide basis for new API; move to environment configuration style Dec 7, 2018

Mierdin added the ToDD Rebuild Tasks related to the effort to rebuild ToDD the right way label Dec 7, 2018

Merge branch 'master' into rip-and-tear

925a9f0

Mierdin mentioned this pull request Dec 7, 2018

Config Copy Efficiency #62

Closed

Mierdin added 8 commits December 10, 2018 09:42

Rip a bunch of deps out

2edae0c

Signed-off-by: Matt Oswalt <[email protected]>

Removed old scripts, added a few

a49f69f

Signed-off-by: Matt Oswalt <[email protected]>

Removed some obsolete packages

df38b04

Signed-off-by: Matt Oswalt <[email protected]>

Build files improvements

e6d6d17

Signed-off-by: Matt Oswalt <[email protected]>

Bunch of API changes, including grpc-stream agent reg and grouping

d010124

Signed-off-by: Matt Oswalt <[email protected]>

Fix groups CLI call

0685c07

Signed-off-by: Matt Oswalt <[email protected]>

Moved agents to long UUID

d9ff7a4

Signed-off-by: Matt Oswalt <[email protected]>

Retrieve IP address from incoming agent request and store in property

b424b4c

Signed-off-by: Matt Oswalt <[email protected]>

Mierdin force-pushed the master branch from cc10bf9 to 3ba8d27 Compare October 24, 2019 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Provide basis for new API; move to environment configuration style #179

WIP: Provide basis for new API; move to environment configuration style #179

Mierdin commented Dec 6, 2018 •

edited

Loading

bigmstone commented Dec 13, 2018

Mierdin commented Dec 14, 2018

Mierdin commented Dec 14, 2018 •

edited

Loading

bigmstone commented Dec 14, 2018

Mierdin commented Dec 14, 2018

bigmstone commented Dec 15, 2018

Mierdin commented Dec 15, 2018

bigmstone commented Dec 15, 2018

Mierdin commented Dec 15, 2018

Mierdin commented Dec 15, 2018

WIP: Provide basis for new API; move to environment configuration style #179

Are you sure you want to change the base?

WIP: Provide basis for new API; move to environment configuration style #179

Conversation

Mierdin commented Dec 6, 2018 • edited Loading

bigmstone commented Dec 13, 2018

Mierdin commented Dec 14, 2018

Mierdin commented Dec 14, 2018 • edited Loading

bigmstone commented Dec 14, 2018

Mierdin commented Dec 14, 2018

bigmstone commented Dec 15, 2018

Mierdin commented Dec 15, 2018

bigmstone commented Dec 15, 2018

Mierdin commented Dec 15, 2018

Mierdin commented Dec 15, 2018

Mierdin commented Dec 6, 2018 •

edited

Loading

Mierdin commented Dec 14, 2018 •

edited

Loading