Monitoring Mediator Service #748

cvarjao · 2022-11-24T22:26:56Z

When we turned on HPA in OpenShift something broken, and we didn't get any bells going off.

WadeBarnes · 2022-11-29T20:27:20Z

Bells did not go off because there wasn't an issue with the health endpoint for the mediator agent; switching to HA did not adversely impact http endpoints, only the WebSocket connections. We are currently monitoring the agent's admin health endpoints. In particular the /status/ready endpoint.

Mobile apps communicate through the mediator using WebSockets, rather than http. WebSockets are stateful and were being impacted by the default routing scheme of the OpenShift routes and services. The traffic was not being routed over the same path between the client and server. Therefore the connections were breaking.

There are services available that can be used to monitor WebSockets, see the resources section below, however, these services appear to only test the connection and not the long term communication over the Sockets. In the case of the mediators the initial connection is not the issue, the long term communication path over the connection is the issue. The route the communications are taking is affected by the OpenShift resource routing policies as mentioned above.

We are still testing updates to the mediators HA configuration. Since WebSockets are stateful the routing of the communications also needs to be stateful. The communication channel must take the same path back and forth from client to server. In an attempt to accomplish this we have set Routes to use the source load balancing scheme and set session affinity on the services to use ClientIP. The issue with this is it can affect scaling. Listed below in the resource section are some articles on how to horizontally scale WebSocket applications based on the number of WebSocket connections rather than relying on CPU and memory metrics. The one describing how to accomplish this using Prometheus seems the most practical (however, this would require the server (aca-py in this case), to support the collection of the metric), as the other discusses a custom HPA which may not be feasible in our OCP environment without the support of the Platform services team. When it comes to scaling WebSockets, scaling down is the most sensitive, you do not want to scale down prematurely and disconnect clients unexpectedly in the middle of an operation.

So we need two things. A reliable way to monitor the stability of the WebSocket communications channel(s), and a way to reliably scale up and down based on the number of WebSocket connections.

Resources:

WadeBarnes · 2022-11-29T20:35:09Z

Adding some additional comments I had regarding monitoring while we were troubleshooting the issues. The ways to monitor the ability to establish a websocket connection with the mediator mentioned in the comments below are with reference to the monitoring services linked in the resources section above.

Circling back to the service monitoring conversation. There are ways to monitor the ability to establish a websocket connection with the mediator, however I don’t think that is going to give us anything we don’t already have with the http based monitoring. The current http based monitoring reaches through the proxy to the agent’s health endpoints (specifically /status/ready). To adequately detect the issues seen yesterday with the multi-instance proxy, I would think we’d need to establish a connection and perform some messaging. I’m under the impression the issues seen yesterday were not with the initial connection, but with the connection being lost at some point during communications. Is that a correct analysis?

If so, does anyone have any thoughts on how we could be testing the websocket communications without a lot of overhead on the agent. I’m assuming we’d have to setup another agent in order to accomplish this goal; an agent monitoring service.

We could then interface that with standard uptime monitoring tools via an api.

Does this sound too complicated or is there a simple way to implement this, or an even simpler way to accomplish the monitoring?

cvarjao · 2023-12-05T18:24:59Z

@esune, it there a ticket somewhere else so we can close this one?

jeffaudette · 2023-12-05T18:24:59Z

@cvarjao ask Emiliano if they have it we can close

esune · 2023-12-08T19:49:37Z

I created bcgov/DITP-DevOps#145 to replace this issue. An alternative would be to move this issue to the DITP-DevOps repo if we want to keep the thread as-is rather than referencing it - either way works for me.

cvarjao added the risk label Oct 30, 2023

esune mentioned this issue Dec 8, 2023

Mediator Service Monitoring bcgov/DITP-DevOps#145

Open

cvarjao closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring Mediator Service #748

Monitoring Mediator Service #748

cvarjao commented Nov 24, 2022 •

edited

Loading

WadeBarnes commented Nov 29, 2022 •

edited

Loading

WadeBarnes commented Nov 29, 2022

cvarjao commented Dec 5, 2023

jeffaudette commented Dec 5, 2023

esune commented Dec 8, 2023

Monitoring Mediator Service #748

Monitoring Mediator Service #748

Comments

cvarjao commented Nov 24, 2022 • edited Loading

WadeBarnes commented Nov 29, 2022 • edited Loading

WadeBarnes commented Nov 29, 2022

cvarjao commented Dec 5, 2023

jeffaudette commented Dec 5, 2023

esune commented Dec 8, 2023

cvarjao commented Nov 24, 2022 •

edited

Loading

WadeBarnes commented Nov 29, 2022 •

edited

Loading