-
Notifications
You must be signed in to change notification settings - Fork 32
Modernizing probes
Probes are typically small snippets or scripts which run in the Rucio environment to determine some metric or count which is then pushed to our monitoring. CMS uses jobber, a lightweight cron-like program.
The Rucio probes are one of the oldest parts of the system and they typically have three problems.
-
They may use very out of date python constructs. They should be python 3 compatible at this point, but often the code needs to be improved.
-
They may use bare SQL to extract information from the Rucio database. A better approach is the use the DB models from Rucio itself to extract information.
-
They may use a mix of old monitoring code based on statsd while we now want everything to use prometheus at lease as an option.
Let's take a look at a few examples of probes which illustrate these problems (and solutions):
https://github.com/rucio/probes/blob/528aaf376dfa3e46bed4ade5c80bfa436f89d40a/common/check_expired_rules This is part of a pull request and shows how to solve the 3rd problem. Using the PrometheusPusher context manager, we can ensure that all the metrics generated in this context end up in our monitoring. But this code is still using bare SQL as a base.
https://github.com/rucio/probes/blob/master/cms/check_rule_counts on the other hand makes very nice use of the underlying database model in Rucio (see line 123 or so) but uses an older way (with the PROM_SERVERS
and duplicate Gauge
s) to get our metrics to Prometheus. This could be vastly simplified by making the prometheus portion look like the first example.
Worst of all are probes like https://github.com/rucio/probes/blob/master/cms/check_missing_data_at_rse which have no way at all of getting data to Prometheus and use the older SQL syntax.
So the task in modernizing the probes is to bring them all up to the same standard.
And then we come to the problem of visualization. Changes to these probes will change the metrics displayed by the dashboards. All CMS dashboards are here: https://monit-grafana.cern.ch/d/000000530/cms-monitoring-project?orgId=11 (available with a CERN CMS login) and can be found either under the DM Ops tab and/or the Development tab with "Rucio" in the name. You will notice these are in sorry shape, many are already broken for unknown reasons.
For each changed probe, we would need to find the relevant dashboards and update them for possible new metric names, etc. Because of the way the new Rucio monitoring framework works, some metrics will be getting new names. That's unavoidable, but should leave us in a better place because the metrics will be more consistently named.
One last problem is that not every probe is coming from the Rucio core code anymore since we have partial fixes for some of these or other problems. https://github.com/dmwm/CMSRucio/blob/master/docker/rucio-probes/Dockerfile shows how we actually build our probes image. You can see there are many probes coming from pull requests, some of them dating back quite far.
So to change a probe, please start with the version in the Dockerfile, not necessarily the version in the main repository.
Perhaps the easiest way to test a new probe is to run it in the integration environment. First kubectl exec -it [probe container] -- bash
In some way, get your new probe code into the container (git, cut-and-paste, etc) and run it at the command line. The SQLAlchemy Engine mode where all queries are echo
ed or the SQL for a specific query something like this: http://xion.io/post/code/sqlalchemy-query-to-sql.html That will help you make sure the SQL you are getting from the model is correct.