Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitoring Mediator Service #748

Closed
cvarjao opened this issue Nov 24, 2022 · 5 comments
Closed

Monitoring Mediator Service #748

cvarjao opened this issue Nov 24, 2022 · 5 comments
Labels

Comments

@cvarjao
Copy link
Member

cvarjao commented Nov 24, 2022

When we turned on HPA in OpenShift something broken, and we didn't get any bells going off.

@WadeBarnes
Copy link
Member

WadeBarnes commented Nov 29, 2022

Bells did not go off because there wasn't an issue with the health endpoint for the mediator agent; switching to HA did not adversely impact http endpoints, only the WebSocket connections. We are currently monitoring the agent's admin health endpoints. In particular the /status/ready endpoint.

Mobile apps communicate through the mediator using WebSockets, rather than http. WebSockets are stateful and were being impacted by the default routing scheme of the OpenShift routes and services. The traffic was not being routed over the same path between the client and server. Therefore the connections were breaking.

There are services available that can be used to monitor WebSockets, see the resources section below, however, these services appear to only test the connection and not the long term communication over the Sockets. In the case of the mediators the initial connection is not the issue, the long term communication path over the connection is the issue. The route the communications are taking is affected by the OpenShift resource routing policies as mentioned above.

We are still testing updates to the mediators HA configuration. Since WebSockets are stateful the routing of the communications also needs to be stateful. The communication channel must take the same path back and forth from client to server. In an attempt to accomplish this we have set Routes to use the source load balancing scheme and set session affinity on the services to use ClientIP. The issue with this is it can affect scaling. Listed below in the resource section are some articles on how to horizontally scale WebSocket applications based on the number of WebSocket connections rather than relying on CPU and memory metrics. The one describing how to accomplish this using Prometheus seems the most practical (however, this would require the server (aca-py in this case), to support the collection of the metric), as the other discusses a custom HPA which may not be feasible in our OCP environment without the support of the Platform services team. When it comes to scaling WebSockets, scaling down is the most sensitive, you do not want to scale down prematurely and disconnect clients unexpectedly in the middle of an operation.

So we need two things. A reliable way to monitor the stability of the WebSocket communications channel(s), and a way to reliably scale up and down based on the number of WebSocket connections.

Resources:

@WadeBarnes
Copy link
Member

Adding some additional comments I had regarding monitoring while we were troubleshooting the issues. The ways to monitor the ability to establish a websocket connection with the mediator mentioned in the comments below are with reference to the monitoring services linked in the resources section above.


Circling back to the service monitoring conversation. There are ways to monitor the ability to establish a websocket connection with the mediator, however I don’t think that is going to give us anything we don’t already have with the http based monitoring. The current http based monitoring reaches through the proxy to the agent’s health endpoints (specifically /status/ready). To adequately detect the issues seen yesterday with the multi-instance proxy, I would think we’d need to establish a connection and perform some messaging. I’m under the impression the issues seen yesterday were not with the initial connection, but with the connection being lost at some point during communications. Is that a correct analysis?

If so, does anyone have any thoughts on how we could be testing the websocket communications without a lot of overhead on the agent. I’m assuming we’d have to setup another agent in order to accomplish this goal; an agent monitoring service.

We could then interface that with standard uptime monitoring tools via an api.

Does this sound too complicated or is there a simple way to implement this, or an even simpler way to accomplish the monitoring?

@cvarjao cvarjao added the risk label Oct 30, 2023
@cvarjao
Copy link
Member Author

cvarjao commented Dec 5, 2023

@esune, it there a ticket somewhere else so we can close this one?

@jeffaudette
Copy link

@cvarjao ask Emiliano if they have it we can close

@esune
Copy link
Member

esune commented Dec 8, 2023

I created bcgov/DITP-DevOps#145 to replace this issue. An alternative would be to move this issue to the DITP-DevOps repo if we want to keep the thread as-is rather than referencing it - either way works for me.

@cvarjao cvarjao closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants