The following section lists/documents the metrics collected by various exporters used for Chef Managed Automate HA implementation. Similar metrics may be collected from AWS-hosted deployments.
The following metrics are recommended to monitor Chef Automate HA implementation. These metrics guide how to use and build monitoring rules and dashboards based on these metrics. However, the actual usage and adoption of metrics depend on each organizational infrastructure monitoring policy.
-
Refer to the following exporters for the metric details.
-
The following metrics are configured to generate alerts.
Component Metrics Expr CPU Usage 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance,job) * 100) > 95 CPU Steal (avg(irate(node_cpu_seconds_total{mode="steal"}[5m]) * 100) by(instance,job))> 20 System Memory Usage 100 - (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100) > 95 Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/"}/node_filesystem_size_bytes{mountpoint="/"}*100) > 85 Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/"}/node_filesystem_size_bytes{mountpoint="/"}*100) > 90 Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/hab"}/node_filesystem_size_bytes{mountpoint="/hab"}*100) > 85 Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/hab"}/node_filesystem_size_bytes{mountpoint="/hab"}*100) > 90 Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/tmp"}/node_filesystem_size_bytes{mountpoint="/tmp"}*100) > 85 Disk Utilization 100 - (node_filesystem_avail_bytes{mountpoint="/tmp"}/node_filesystem_size_bytes{mountpoint="/tmp"}*100) > 90 Host Monitoring up == 0
-
Refer to the following exporters for the metric details.
-
The following metrics are configured to generate alerts.
Component Metrics Expr Hab Service Status probe_http_status_code{job=~"chef-server-services.* Hab Service Status probe_http_status_code{job=~"chef-server-services.* Automate LB 5XX Alert probe_http_status_code{job=~"chef-server-url Chef-Server LB 5XX Alert probe_http_status_code{job=~"chef-server-url
-
Refer to the following OpenSearch plugin for the metric details.
-
The following metrics are configured to generate alerts.
Component Metrics Expr ES Cluster Health Check opensearch_cluster_nodes_number < 2 ES Heap Usage Factor opensearch_jvm_mem_heap_used_percent > 95 ES Performance Alert opensearch_index_search_fetch_time_seconds > 30 ES Performance Alert opensearch_index_search_fetch_time_seconds > 60 ES Indexing latency Alert opensearch_index_indexing_index_time_seconds > 500 Elasticsearch Search latency Alert opensearch_index_search_query_time_seconds > 60
-
Refer to the following OpenSearch plugin for the metric details.
-
The following metrics are configured to generate alerts:
Component Metrics Expr PG Can Connect pg_up != 1 Connection Exhaustion (sum(pg_stat_database_numbackends{server="10.100.12.36:5432"}) by(instance,job))/(avg(pg_settings_max_connections{server="10.100.12.36:5432"}) by(instance,job)) * 100 > 90 Connection Exhaustion (sum(pg_stat_database_numbackends{server="10.100.12.36:5432"}) by(instance))/(avg(pg_settings_max_connections{server="10.100.12.36:5432"}) by(instance)) * 100 > 95 Managed PostgreSQL Write Latency irate(node_disk_write_time_seconds_total{instance=~".*pg.*"}[5m]) / irate(node_disk_writes_completed_total{instance=~".*pg.*"}[5m]) > 300 Managed PostgreSQL Read Latency irate(node_disk_read_time_seconds_total{instance=~".*pg.*"}[5m]) / irate(node_disk_reads_completed_total{instance=~".*pg.*"}[5m]) > 300