Skip to content

Latest commit

 

History

History
77 lines (54 loc) · 4.99 KB

Prometheus_Reference_Metrics_List.md

File metadata and controls

77 lines (54 loc) · 4.99 KB

Reference Metrics List

The following section lists/documents the metrics collected by various exporters used for Chef Managed Automate HA implementation. Similar metrics may be collected from AWS-hosted deployments.

Disclaimer

The following metrics are recommended to monitor Chef Automate HA implementation. These metrics guide how to use and build monitoring rules and dashboards based on these metrics. However, the actual usage and adoption of metrics depend on each organizational infrastructure monitoring policy.

System Metrics

  1. Refer to the following exporters for the metric details.

  2. The following metrics are configured to generate alerts.

    Component Metrics Expr
    CPU Usage 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance,job) * 100) > 95
    CPU Steal (avg(irate(node_cpu_seconds_total{mode="steal"}[5m]) * 100) by(instance,job))> 20
    System Memory Usage 100 - (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100) > 95
    Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/"}/node_filesystem_size_bytes{mountpoint="/"}*100) > 85
    Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/"}/node_filesystem_size_bytes{mountpoint="/"}*100) > 90
    Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/hab"}/node_filesystem_size_bytes{mountpoint="/hab"}*100) > 85
    Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/hab"}/node_filesystem_size_bytes{mountpoint="/hab"}*100) > 90
    Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/tmp"}/node_filesystem_size_bytes{mountpoint="/tmp"}*100) > 85
    Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/tmp"}/node_filesystem_size_bytes{mountpoint="/tmp"}*100) > 90
    Host Monitoring up == 0

Chef Automate Health Metrics

  1. Refer to the following exporters for the metric details.

  2. The following metrics are configured to generate alerts.

    Component Metrics Expr
    Hab Service Status probe_http_status_code{job=~"chef-server-services.*
    Hab Service Status probe_http_status_code{job=~"chef-server-services.*
    Automate LB 5XX Alert probe_http_status_code{job=~"chef-server-url
    Chef-Server LB 5XX Alert probe_http_status_code{job=~"chef-server-url

OpenSearch Metrics

  1. Refer to the following OpenSearch plugin for the metric details.

  2. The following metrics are configured to generate alerts.

    Component Metrics Expr
    ES Cluster Health Check opensearch_cluster_nodes_number < 2
    ES Heap Usage Factor opensearch_jvm_mem_heap_used_percent > 95
    ES Performance Alert opensearch_index_search_fetch_time_seconds > 30
    ES Performance Alert opensearch_index_search_fetch_time_seconds > 60
    ES Indexing latency Alert opensearch_index_indexing_index_time_seconds > 500
    Elasticsearch Search latency Alert opensearch_index_search_query_time_seconds > 60

PostgreSQL Metrics

  1. Refer to the following OpenSearch plugin for the metric details.

  • The following metrics are configured to generate alerts:

    Component Metrics Expr
    PG Can Connect pg_up != 1
    Connection Exhaustion (sum(pg_stat_database_numbackends{server="10.100.12.36:5432"}) by(instance,job))/(avg(pg_settings_max_connections{server="10.100.12.36:5432"}) by(instance,job)) * 100 > 90
    Connection Exhaustion (sum(pg_stat_database_numbackends{server="10.100.12.36:5432"}) by(instance))/(avg(pg_settings_max_connections{server="10.100.12.36:5432"}) by(instance)) * 100 > 95
    Managed PostgreSQL Write Latency irate(node_disk_write_time_seconds_total{instance=~".*pg.*"}[5m]) / irate(node_disk_writes_completed_total{instance=~".*pg.*"}[5m]) > 300
    Managed PostgreSQL Read Latency irate(node_disk_read_time_seconds_total{instance=~".*pg.*"}[5m]) / irate(node_disk_reads_completed_total{instance=~".*pg.*"}[5m]) > 300