Skip to content

Latest commit



252 lines (176 loc) · 8.07 KB

File metadata and controls

252 lines (176 loc) · 8.07 KB

Analysis Facility metrics collector

AF Metrics Collector collects various metrics about user jobs from kubernetes, batch systems and send the collected metrics as json documents to an https endpoint.



  • Have installed python3 and pip3 system packages

  • Install some python dependencies for afmetrics_collector: pip3 install -U setuptools setuptools_scm wheel importlib_metadata

  • Create a directory for the log files: mkdir /var/log/afmetrics

Install afmetrics_collector:

Clone the git repo onto your server

Navigate to the project directory

(optional) Make changes to setup.cfg

Build with python3: python3 bdist_wheel

Install with pip3: pip3 install dist/afmetrics_collector-0.0*.whl


pip3 uninstall afmetrics-collector

Sending documents

All the documents should be POST to


Each document has to have a token field. To get a token for your AF, please contact Ilija Vukotic. All documents get timestamped on the receiving end so no need to send them. Here an example of each metric.

Batch queue status

this is obtained by parsing output of condor_q command.

    "kind": "condorqueue",
    "cluster": "UC-AF",
    "queue": "all",
    "idle": 1995,
    "running": 1022,
    "held": 3,
    "allocated_cores": 966,
    "allocated_mem": 4145152

SSH users

this is obtained by parsing output of who command.

    "kind": "ssh",
    "ssh_user_count": 4,
    "cluster": "UC-AF",
    "login_node": "",
    "users": [

Condor users

obtained by parsing output of condor_q command.

    "kind": "condorjob",
    "cluster": "UC-AF",
    "users": "brosser",
    "state": "finished",
    "Id": "174100.63",
    "Runtime": 177

Jupyter users

    "kind": "jupyter-ml",
    "cluster": "UC-AF",
    "jupyter_user_count": 4,
    "users": [


One document for each disk reporting used, total and free bytes as seen from /proc/diskstats

    "kind": "DISK",    
    "cluster": "UC-AF",
    "used": 92355981312,
    "total": 1883690205184,
    "free": 1695620259840,
    "login_node": "",
    "utilization": 0.04902928361459448,
    "mount": "/scratch"


Just a one minute load average as reported by os.getloadavg().

    "kind": "CPU",
    "cluster": "UC-AF",
    "login_node": "",
    "load": 0.01


Differential of bytes_sent, bytes_recv and the time interval, as reported by psutil.net_io_counters().

    "kind": "NETWORK",
    "cluster": "UC-AF",
    "login_node": "",
    "network": {
      "interval": 300.064,
      "sent": 6606414,
      "received": 11383868


Host total and available memory as reported by psutil.virtual_memory(). In bytes.

    "kind": "MEM",
    "cluster": "UC-AF",
    "login_node": "",
    "available": 265041854464,
    "total": 269903060992

Usage and examples

In the following replace all instances of <token> with your actual token, likewise for mentions of <cluster>, <domain>, <salt>, and so on. These are placeholders for your real values

A typical full installation will collect ssh, batch (condor), jupyter, and host metrics and the command to run may look something like this:

afmetrics_collector -v -sjb --host -t "<token>" -c "<cluster>"

The associated cron job to run this every 5 minutes (the default and recommended interval) may look like this:

*/5 * * * * root (KUBECONFIG=/etc/kubernetes/admin.conf /usr/local/bin/afmetrics_collector -v -sjb --host -t "<token>" -c "<cluster>") >> /var/log/afmetrics/afmetrics.log 2>&1

Advanced Usage

Batch queue status reporting

By default no batch queue status is reported. To add queue status reporting, add the queue in this format "queuename:condor query contraint expression". For example the following command will collect status of two queues, the all inclusive queue and the short queue which can be queried with queue=="short" constraint.

`afmetrics_collector -vv -b -t -c "" -q all: -q 'short:queue=="short"'

SSH history

Only usable for systems with a version of last command that include -s option

In addition to ssh users currently logged in, adding the -S flag will include users who logged in within the last 5 minutes as well (to account for a possible edge case which includes users that log in and out really fast)

`afmetrics_collector -vv -sjb -S --host -t -c ""

Group Filtering

Useful on systems that may serve as login nodes for many users unaffiliated with what you are interested in, add the -g or --group flag to filter for a specific group

For example, if you are only interested in ssh logins, jupyter, and batch jobs of users in group 'atlas', the command may look like the following:

afmetrics_collector -vv -sjb --host -t <token> -c "<cluster>" -g "atlas"


For debugging, you can opt to output everything to a local file instead of sending it to the logstash server with the -d flag:

afmetrics_collector -d -vv -sjb --host -t <token> -c "<cluster>"
This will output .json files in your current directory, and very verbose (-vv) logs in /var/log/afmetrics/afmetrics.log.
I would recommend to run this from within the /var/log/afmetrics directory so all the stuff to look at is in one place.
A token is not necessary for debugging, so you can use -d before you have one

Data Obfuscation and security

For sites that wish to share usage metrics, but not info such as usernames and hostnames, data obfuscation flags -o, -O, and -z have been added:

-o : user name obfuscation

-O : host name obfuscation, followed by a string domain name, ex.: -O ''

-z : (optional) salt to make user obfuscation more secure, ex.: -o -z '5tKC%>f&%#hg'

Afmetrics_collector can be run as users other than root. If you wish to do this, make sure the ownership/permissions of the /var/log/afmetrics directory is such that the desired user can write to it

A full example using all of the obfuscation and a local debug running as user 'nobody', along with a group filter might look something like this:

su -s /bin/bash -c '(/usr/local/bin/afmetrics_collector -d -vv -sbj --host -t "<token>" -o -O "<domain>" -c "<cluster>" -z "<salt>" -g "<group>") 2>&1' nobody

The associated cron /etc/cron.d/afmetrics.cron running all of the above in non-debug mode may looks like this:

### Afmetrics Collector ###
*/5 * * * * nobody (/usr/local/bin/afmetrics_collector -vv -sbj --host  -t "<token>" -o -O "<domain>" -c "<cluster>" -z "<salt>" -g "<group>") >> /var/log/afmetrics/afmetrics.log 2>&1

How it works:

Username obfuscation simply MD5 hashes username and truncates to the last 8 characters. Salt can be added to the username hash to strengthen against rainbow table attacks. If salt is used, make sure to use the same salt value across all your login nodes, otherwise the same user will be counted as a unique user if they log in on many nodes.

Hostname obfuscation is very basic so may need to be modified to suit your facility. It simply takes your hostname, strips off everything except the numbers, and prepends 'atlas' and appends your provided domain name string.
For example if your host is called and you call the hostname obfuscation flag with -O "" you will get as your obfuscated domain name.
Provided the numbers from all your login hosts are different you should end up with no collisions. Modify in src/afmetrics_collector/ to suit your needs


This project has been set up using PyScaffold 4.2.1. For details and usage information on PyScaffold see