The Chef Automate HA equates to reliability, efficiency, and productivity, built on Redundancy and Failover. It aids in addressing significant issues like service failure and zone failure. Please refer to the public documentation of Automate HA for more information.
This document provides guided steps on how to build and integrate Monitoring, Alerting, and Centralized logging tools with Chef Automate HA. Based on our analysis we have selected a few tools which is our recommendation.
-
Datadog
-
Prometheus
-
AWS CloudWatch
-
Pager Duty
-
Slack
-
Microsoft Teams
-
ELK (Elasticsearch, Logstash, and Kibana)
-
Datadog
-
AWS CloudWatch
The Chef engineering team has comprehensively documented the recommended monitoring metrics to offer visibility into the operational health of the Chef Automate HA solution.
As part of the guided steps of integration for the above-mentioned tools, we will capture the below use cases from an integration perspective:
This use case covers the steps to download and configure the tools agent which will be running on the nodes(of the Automate HA infrastructure) and will be responsible for scraping the metrics and logs from those nodes. This section also covers the type of configurations that need to be stepped to scrap various kinds of component-level metrics.
This use case covers the steps to install the agent and any other extra setup that is required to ensure the metrics and logs are covered from each node and a component of Automate HA.
This use case covers the steps of server setup installation and configuration recommendations.
This use case covers the list of recommended dashboards and how to set them up based on various tools and steps. This also covers the various configuration aspects that are required for bringing up the dashboard.
This use case covers the list of recommended metrics for the Automate HA system and various levels of recommended rules to be applied to creating the monitoring based on these metrics. These are just the recommendations only and based on organizational requirements they can add more rules, update these rules, and alerting mechanisms as required.
This use case covers permissions and configuration required for allowing Slack to connect with the tool. This also covers the step-wise setup of alerting groups/channels under monitoring rules to receive alerts based on the threshold logic.
This use case covers permissions and configuration required for allowing Slack to connect with the tool. This also covers the step-wise setup of alerting groups/channels under monitoring rules to receive alerts based on the threshold logic.
-
Datadog Agent Configuration and Installation for Chef Managed nodes
-
Datadog Metrics configuration and Integration with AWS for AWS Managed services
Datadog Centralized Logs Management