Skip to content

Latest commit

 

History

History
129 lines (65 loc) · 6.03 KB

Whitepaper_AutomateHA_Monitoring_and_Alerting.md

File metadata and controls

129 lines (65 loc) · 6.03 KB

Monitoring, Alerting, and Centralised Logging integration support with Chef Automate HA

The Chef Automate HA equates to reliability, efficiency, and productivity, built on Redundancy and Failover. It aids in addressing significant issues like service failure and zone failure. Please refer to the public documentation of Automate HA for more information.

This document provides guided steps on how to build and integrate Monitoring, Alerting, and Centralized logging tools with Chef Automate HA. Based on our analysis we have selected a few tools which is our recommendation.

Abstract:

Monitoring Recommendations

Tools for Monitoring

  1. Datadog

  2. Prometheus

  3. AWS CloudWatch

Tools for Alerting

  1. Pager Duty

  2. Slack

  3. Microsoft Teams

Tools for Centralised Logging

  1. ELK (Elasticsearch, Logstash, and Kibana)

  2. Datadog

  3. AWS CloudWatch

Introduction

The Chef engineering team has comprehensively documented the recommended monitoring metrics to offer visibility into the operational health of the Chef Automate HA solution.

As part of the guided steps of integration for the above-mentioned tools, we will capture the below use cases from an integration perspective:

Agent download and configuration

This use case covers the steps to download and configure the tools agent which will be running on the nodes(of the Automate HA infrastructure) and will be responsible for scraping the metrics and logs from those nodes. This section also covers the type of configurations that need to be stepped to scrap various kinds of component-level metrics.

Agent Installation

This use case covers the steps to install the agent and any other extra setup that is required to ensure the metrics and logs are covered from each node and a component of Automate HA.

Server setup and configuration

This use case covers the steps of server setup installation and configuration recommendations.

Dashboard Setup and Configuration

This use case covers the list of recommended dashboards and how to set them up based on various tools and steps. This also covers the various configuration aspects that are required for bringing up the dashboard.

Metrics Configuration and Monitoring Rules Setup

This use case covers the list of recommended metrics for the Automate HA system and various levels of recommended rules to be applied to creating the monitoring based on these metrics. These are just the recommendations only and based on organizational requirements they can add more rules, update these rules, and alerting mechanisms as required.

Slack Integration with the tool

This use case covers permissions and configuration required for allowing Slack to connect with the tool. This also covers the step-wise setup of alerting groups/channels under monitoring rules to receive alerts based on the threshold logic.

Pager Duty Integration with the tool

This use case covers permissions and configuration required for allowing Slack to connect with the tool. This also covers the step-wise setup of alerting groups/channels under monitoring rules to receive alerts based on the threshold logic.

Datadog integration with Automate HA - Monitoring

  1. Datadog Agent Configuration and Installation for Chef Managed nodes

  2. Datadog Metrics configuration and Integration with AWS for AWS Managed services

  3. Metrics Monitor Configuration and Monitoring Rules Setup

  4. Dashboard Setup and Configuration

Datadog integration with Automate HA - Alerting

  1. Slack Integration

  2. PagerDuty Integration

  3. MS Teams Integration

Datadog integration with Automate HA - Centralized Logging

Datadog Centralized Logs Management

Prometheus integration with Automate HA - Monitoring

  1. Prometheus Server Configuration and Installation

  2. Prometheus Agent Configuration and Installation

  3. Prometheus Metrics and Alertmanager configuration

  4. Dashboard Setup and Configuration

Prometheus integration with Automate HA - Alerting

  1. Slack Integration

  2. PagerDuty Integration

  3. MS Teams Integration

ELK integration with Automate HA - Centralized Logging

  1. ELK - Configuration and Installation

  2. ELK Agent - Filebeat Configuration, Installation, and Logging

CloudWatch integration with Automate HA - Monitoring

  1. Metrics Monitor Configuration and Monitoring Rules Setup

  2. Dashboard Setup and Configuration

CloudWatch integration with Automate HA - Alerting

  1. Slack Integration

  2. PagerDuty Integration

CloudWatch integration with Automate HA - Centralized Logging

  1. AWS CloudWatch Centralized Logs Management