Architecture suggestion for publishing metrics to Prometheus from many ephemeral workers #484
Replies: 2 comments
-
This smells like the "distributed counter" use case, see the non-goals section in the README.md. Quoting from there: "If you need distributed counting, you could either use the actual statsd in combination with the Prometheus statsd exporter, or have a look at Weavework's aggregation gateway." One could also argue that the workers aren't actually short-lived. "autoscaled" doesn't mean "short-lived" per se. IIUC the workers aren't terminating themselves after each processed workload, but they are only shut down if the queue holds not enough work for the current number of workers. Therefore, you could just scrape the workers normally and accept that you'll underreport a bit whenever a downscaling is happening. If you need exact reporting, one might argue that Prometheus isn't the right system for that. You would rather need an event processing story with some guarantees of completeness (which could then be monitored by Prometheus, in turn). Another way would be to instrument the binaries constituting the queueing service themselves. In any case, the Pushgateway as is is not really made for this kind of use case. |
Beta Was this translation helpful? Give feedback.
-
Thank you for pointing towards the 'non-goals' section. I will also look into the 'weaveworks aggregation gateway' and check for correctness with distributed metric reporting. Do not want to introduce another system, What I also understand from your comment > we could run an additional HTTP server on an async consumer(reasonably long-living process) to expose metrics endpoint to be scraped by Prometheus. The discovery of the async consumer pods can be facilitated using PodMonitor feature of Prometheus operator. In the meanwhile, we have chosen to run a cronjob to delete groups that have not been updated in some threshold time. This helps us to avoid building up large number of stale metric groups over time based on
|
Beta Was this translation helpful? Give feedback.
-
Hello. We have an execution model which is typical of a producer-consumer pattern with a queue/topic in the middle. Currently, the queue holds work of same type from multiple customers/tenants. The consumers/workers are non-http based applications that pick/pop a message from the queue and execute work. These consumers are Kubernetes pods that are spun up using a Deployment. They are configured to autoscale based on the work available on the queue. We would like to know at least two metrics about the performance of the workers/backlog burn up
We were trying to send through Prometheus gateway but looks it has design philosophy which for us
a) can either result in metric overwrites from multiple workers/pods if they try to use same job/group name
b) can result in garbage build up if job/group name is based on instance/pod name as pods come and go over longer periods of time.
What is the best way to publish metrics from these ephemeral workers or offline processing pattern? Could not find an arch/design pattern documented for this problem.
Possible options:
push_time_seconds
gauge metricBeta Was this translation helpful? Give feedback.
All reactions