diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 16fd7c6..15fd26a 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -95,6 +95,5 @@ $ sudo docker-compose up --build mkdocs To wipe and reset the `docker-compose` environment simply run the following. ```console -$ sudo docker-compose kill automatron redis -$ sudo docker-compose rm automatron redis tests mkdocs +$ sudo docker-compose down ``` diff --git a/README.md b/README.md index e8892b6..c5b2b29 100644 --- a/README.md +++ b/README.md @@ -1,80 +1,70 @@ [![Build Status](https://travis-ci.org/madflojo/automatron.svg?branch=master)](https://travis-ci.org/madflojo/automatron) [![Coverage Status](https://coveralls.io/repos/github/madflojo/automatron/badge.svg?branch=master)](https://coveralls.io/github/madflojo/automatron?branch=master) + ![Automatron](https://raw.githubusercontent.com/madflojo/automatron/master/docs/img/logo_huge.png) -Automatron **(Ah-Tom-a-tron)** is an open source framework designed to detect and remediate IT systems issues. Meaning, it can be used to monitor systems and when it detects issues; correct them. +Automatron is a framework for creating self-healing infrastructure. Simply put, it detects system events & takes action to correct them. + +The goal of Automatron is to allow users to automate the execution of common tasks performed during system events. These tasks can be as simple as **sending an email** to as complicated as **restarting services across multiple hosts**. ## Features -* Automatically detect and add new systems to monitor -* Monitoring is executed over SSH and completely agent-less -* Policy based Runbooks allow for monitoring policies rather than server specific configurations -* Supports Nagios compliant health check scripts -* Allows arbitrary shell commands for both checks and actions -* Runbook flexibility with **Jinja2** templating support -* Pluggable Architecture that simplifies customization + * Automatically detect and add new systems to monitor + * Monitoring is executed over SSH and completely **agent-less** + * Policy based Runbooks allow for monitoring policies rather than server specific configurations + * Supports Nagios compliant health check scripts + * Allows dead simple **arbitrary shell commands** for both checks and actions + * Runbook flexibility with **Jinja2** templating support + * Pluggable Architecture that simplifies customization ## Runbooks -Automatron's actions are driven by policies called **Runbooks**. These runbooks are used to define what health checks should be executed on a target host and what to do about those health checks when they fail. +The core of Automatron is based around **Runbooks**. Runbooks are policies that define health checks and actions. You can think of them in the same way you would think of a printed runbook. Except with Automatron, the actions are automated. -### A simple Runbook +### A simple Runbook example -The below example is a Runbook that will execute a monitoring plugin to determine the amount of free space on `/var/log` and based on the results execute a corrective action. +The below runbook is a very basic example, it will check if NGINX is running (every 2 minutes) and restart it after 2 unsuccessful checks. -```yaml -name: Verify /var/log -schedule: "*/5 * * * *" -nodes: - - "*" +```yaml+jinja +name: Check NGINX +schedule: "*/2 * * * *" checks: - mem_free: - # Check for the % of disk free create warning with 20% free and critical for 10% free + nginx_is_running: execute_from: target - type: plugin - plugin: systems/disk_free.py - args: --warn=20 --critical=10 --filesystem=/var/log + type: cmd + cmd: service nginx status actions: - logrotate_nicely: + restart_nginx: execute_from: target - trigger: 0 + trigger: 2 frequency: 300 call_on: - WARNING - type: cmd - cmd: bash /etc/cron.daily/logrotate - logrotate_forced: - execute_from: target - trigger: 5 - frequency: 300 - call_on: - CRITICAL + - UNKNOWN type: cmd - cmd: bash /etc/cron.daily/logrotate --force + cmd: service nginx restart ``` -### A Runbook with Jinja2 +The above actions will be performed every 300 seconds (5 minutes) until the health check returns an OK status. This delay allows time for NGINX to restart after each execution. -Jinja2 support was added to Runbooks to allow for extensive customization. The below example shows using Jinja2 to determine which `cmd` to execute based on Automatron's **facts** system. +### A complex Runbook with Jinja2 -This example will detect if `nginx` is running and if not, restart it. +This next runbook example is a more complex version of the above. In this example we will use Jinja2 and Automatron's Facts to enhance our runbook further. -```yaml -name: Verify nginx is running +```yaml+jinja +name: Check NGINX +{% if "prod" in facts['hostname'] %} schedule: - second: "*/30" -nodes: - - "*web*" + second: */20 +{% else %} +schedule: "*/2 * * * *" +{% endif %} checks: nginx_is_running: - # Check if nginx is running execute_from: target type: cmd - {% if "Linux" in facts['os'] %} cmd: service nginx status - {% else %} - cmd: /usr/local/etc/rc.d/nginx status - {% endif %} actions: restart_nginx: execute_from: target @@ -83,46 +73,46 @@ actions: call_on: - WARNING - CRITICAL + - UNKNOWN type: cmd - {% if "Linux" in facts['os'] %} cmd: service nginx restart - {% else %} - cmd: /usr/local/etc/rc.d/nginx restart - {% endif %} + remove_from_dns: + execute_from: remote + trigger: 0 + frequency: 0 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: plugin + plugin: cloudflare/dns.py + args: remove test@example.com apikey123 example.com --content {{ facts['network']['eth0']['v4'][0] }} ``` -For more examples and information on getting started checkout the Automatron [wiki](https://github.com/madflojo/automatron/wiki). +The above example uses **Jinja2** and **Facts** to create a conditional schedule. If our target server has a hostname that contains the word "prod" within it. The schedule for the health check will be every 20 seconds. If not, it will be every 2 minutes. -## Deploying with Docker +Another addition is the `remove_from_dns` action, which will remove the target server's DNS entry using the **CloudFlare DNS** plugin. -Deploying Automatron within Docker is quick and easy. Since Automatron by default uses `redis` as a datastore we must first start a `redis` instance. - -```console -$ sudo docker run -d --restart=always --name redis redis -``` - -Once `redis` is up and running you can start an Automatron instance. - -```console -$ sudo docker run -d --link redis:redis -v /path/to/config:/config --restart=always --name automatron madflojo/automatron -``` +By using **Facts** and **Jinja2** together you can customize a single runbook to cover unique actions for multiple hosts and environments. ## Stay in the loop -Follow [@Automatronio on Twitter](https://twitter.com/automatronio) for the latest Automatron news and join the community in [#Automatron on Gitter](https://gitter.im/madflojo/automatron). +[![Twitter Follow](https://img.shields.io/twitter/follow/automatronio.svg?style=flat-square)](https://twitter.com/automatronio) [![Gitter](https://badges.gitter.im/madflojo/automatron.svg)](https://gitter.im/madflojo/automatron?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) ## License - Copyright 2016 Benjamin Cane +``` + Copyright 2016 Benjamin Cane - Licensed under the Apache License, Version 2.0 (the "License"); - you may not use this file except in compliance with the License. - You may obtain a copy of the License at + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at - http://www.apache.org/licenses/LICENSE-2.0 + http://www.apache.org/licenses/LICENSE-2.0 - Unless required by applicable law or agreed to in writing, software - distributed under the License is distributed on an "AS IS" BASIS, - WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - See the License for the specific language governing permissions and - limitations under the License. + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +``` diff --git a/config/config.yml.example b/config/config.yml.example index 3fece6e..3bd8b9c 100644 --- a/config/config.yml.example +++ b/config/config.yml.example @@ -2,7 +2,7 @@ config_path: config runbook_path: config/runbooks plugin_path: plugins/ -ssh: # SSH Configuration +ssh: user: root gateway: False key: | @@ -10,35 +10,43 @@ ssh: # SSH Configuration fdlkfjasldjfsaldkjflkasjflkjaflsdlkfjs -----END RSA PRIVATE KEY----- -monitoring: # Monitoring configuration +## Checks +monitoring: upload_path: /tmp -actioning: # Actioning configuration +## Actions +actioning: upload_path: /tmp -logging: # Logging Configurations +## Logging +logging: debug: True plugins: console: True syslog: facility: local0 -discovery: # Discovery Configurations +## Host Discovery +discovery: upload_path: /tmp/ vetting_interval: 30 plugins: - webping: # Web Service for HTTP GET or POST requests + # Web Service for HTTP PINGs + webping: ip: 0.0.0.0 - port: 20000 -# nmap: # NMAP Scanning for new hosts + port: 8000 +# nmap: +# # NMAP Scanning for new hosts # target: 10.0.0.1/8 # flags: -sP # interval: 40 -# digitalocean: # Query DO's API +# digitalocean: +# # Query DO's API # url: https://api.digitalocean.com/v2 # api_key: example # interval: 60 -# aws: # Query AWS' API +# aws: +# # Query AWS' API # aws_access_key_id: example # aws_secret_access_key: example # interval: 60 @@ -46,17 +54,18 @@ discovery: # Discovery Configurations # - PublicIpAddress # - PrivateIpAddress # linode: +# # Query Linode's API # url: https://api.linode.com # api_key: example # interval: 60 - -datastore: # Datastore Configurations -## Default Datastore Engine +## Datastore +datastore: + ## Default Datastore Engine engine: redis ## Datastore Specific configuration plugins: - ## Redis + # Redis redis: db: 0 host: redis diff --git a/config/runbooks/examples/disk_free/init.yml b/config/runbooks/examples/disk_free/init.yml index 9af74de..d02df2d 100644 --- a/config/runbooks/examples/disk_free/init.yml +++ b/config/runbooks/examples/disk_free/init.yml @@ -1,7 +1,5 @@ name: Verify /var/log schedule: "*/2 * * * *" -nodes: - - "*" checks: disk_free: # Check for the % of disk free create warning with 20% free and critical for 10% free diff --git a/config/runbooks/examples/docker/clear_dangling_images.yml b/config/runbooks/examples/docker/clear_dangling_images.yml index 9cbd875..ae2528b 100644 --- a/config/runbooks/examples/docker/clear_dangling_images.yml +++ b/config/runbooks/examples/docker/clear_dangling_images.yml @@ -1,7 +1,5 @@ name: Clean up dangling images if they are found schedule: "*/2 * * * *" -nodes: - - "*" checks: find_danglers: execute_from: ontarget diff --git a/config/runbooks/examples/docker/clear_dead_containers.yml b/config/runbooks/examples/docker/clear_dead_containers.yml index 0185f86..d7857e4 100644 --- a/config/runbooks/examples/docker/clear_dead_containers.yml +++ b/config/runbooks/examples/docker/clear_dead_containers.yml @@ -1,7 +1,5 @@ name: Clean up dead containers if they are found schedule: "*/2 * * * *" -nodes: - - "*" checks: find_dead_containers: execute_from: ontarget diff --git a/config/runbooks/examples/docker/init.yml b/config/runbooks/examples/docker/init.yml index 9b50f38..2a1277e 100644 --- a/config/runbooks/examples/docker/init.yml +++ b/config/runbooks/examples/docker/init.yml @@ -1,7 +1,5 @@ name: Verify Docker is running schedule: "* * * * *" -nodes: - - "*" checks: docker_running: execute_from: ontarget diff --git a/config/runbooks/examples/nginx/init.yml b/config/runbooks/examples/nginx/init.yml index aa393c6..c087a8c 100644 --- a/config/runbooks/examples/nginx/init.yml +++ b/config/runbooks/examples/nginx/init.yml @@ -1,7 +1,5 @@ name: Verify nginx is running schedule: "* * * * *" -nodes: - - "*" checks: nginx_is_running: # Check if nginx is running diff --git a/docker-compose.yml b/docker-compose.yml index 868f70c..fe22868 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -10,12 +10,13 @@ services: build: . command: python /tests.py mkdocs: - image: thinkcube/mkdocs + build: + context: . + dockerfile: docs/Dockerfile volumes: - - ./:/automatron + - ./:/tmp/mkdocs ports: - 8000:8000 - working_dir: /automatron command: mkdocs serve -a 0.0.0.0:8000 coverage: build: . diff --git a/docs/Dockerfile b/docs/Dockerfile new file mode 100644 index 0000000..9523eda --- /dev/null +++ b/docs/Dockerfile @@ -0,0 +1,4 @@ +FROM thinkcube/mkdocs +RUN pip install mkdocs-material pygments>=2.2 pymdown-extensions>=2.0 +RUN mkdir -p /tmp/mkdocs +WORKDIR /tmp/mkdocs diff --git a/docs/automatron-in-10-minutes.md b/docs/automatron-in-10-minutes.md deleted file mode 100644 index 7719f05..0000000 --- a/docs/automatron-in-10-minutes.md +++ /dev/null @@ -1,129 +0,0 @@ -With Automatron and a few minutes you can setup a fully autonomous monitoring and remediation system. The below steps will show how to install and configure Automatron to monitor Nginx on all servers with a hostname that matches `*web*` and restart the service if it is not running. - -## Install and Configure Automatron - -Automatron is currently available by cloning the [GitHub Repository](https://github.com/madflojo/automatron/). With the first release candidate it will also be available via a Docker image. - -### Prerequisites - -The below list is a set of base requirements for a running Automatron instance. - - * Python 2.7 or higher - * Redis - * nmap - -### Clone from Github - -First, clone the current repository from GitHub. - -```sh -$ git clone https://github.com/madflojo/automatron.git -$ cd automatron -``` - -### Install required python modules - -Second, install any required python modules. - -```sh -$ sudo pip install -r requirements.txt -$ sudo pip install honcho -``` - -### Setup a base configuration - -Third, create a configuration file using the `config/config.yml.example` file as a base. - -```sh -$ cp config/config.yml.example config/config.yml -$ vi config/config.yml -``` - -##### Defining an SSH Key - -Automatron relies on SSH to perform both monitoring and actioning. To enable this a public SSH key must be deployed on all target servers and the private key stored within the `ssh` section of the configuration file. - -```yaml -ssh: # SSH Configuration - user: root - gateway: False - key: | - -----BEGIN RSA PRIVATE KEY----- - fdlkfjasldjfsaldkjflkasjflkjaflsdlkfjs - -----END RSA PRIVATE KEY----- -``` - -The `gateway` setting can be used to specify a "jump server" for Automatron to connect to. If left as `False` Automatron will simply login to each target host directly. - -##### Setup the nmap Discovery plugin - -Automatron discovers new hosts via two default methods, the first is a web "ping" which can be any HTTP request to the port specified within the configuration file. - -The second method is a `nmap` scan. Within the config file you can specify a custom network subnet for Automatron to scan. - -```yaml -## Use NMAP to find new hosts -nmap: - target: 10.0.0.1/8 - flags: -sP - interval: 40 -``` - -The `flags` configuration is used to pass command line arguments to `nmap`. - -## Writing our first Runbook - -A Runbook is a policy that defines health checks and automated actions to be performed when those health checks return specified states. - -For this example we will create a new Runbook. - -```sh -$ mkdir -p config/runbooks/base/check_nginx -$ vi config/runbooks/base/check_nginx/init.yml -``` - -Once the file is open simply paste the following Runbook policy. - -```yaml -name: Verify nginx is running -schedule: "*/5 * * * *" -nodes: - - "*web*" -checks: - nginx_is_running: - # Check if nginx is running - execute_from: target - type: cmd - cmd: service nginx status -actions: - restart_nginx: - execute_from: target - trigger: 2 - frequency: 300 - call_on: - - WARNING - - CRITICAL - type: cmd - cmd: service nginx restart -``` - -The above policy will run the `service nginx status` command every 5 minutes on any target that has a hostname that matches `*web*`. If that command fails after 2 occurrences the `restart_nginx` action will be "triggered" and executed on the target server. - -## Applying Runbooks to Target hosts - -Within the Runbook above we specified the target nodes that the runbook applies to. There is another level of targeting available within the `config/runbooks/init.yml` file. This provides additional granularity to the application of Runbooks. - -To get started we will replace the contents of this file with settings specific to our current task. - -```yaml -'*': - - base/check_nginx -``` - -## Starting Automatron - -Once our configuration and runbook is defined we can startup Automatron and watch as our webservers are discovered and monitored autonamously. - -```sh -$ honcho start -``` diff --git a/docs/configure.md b/docs/configure.md new file mode 100644 index 0000000..6721c94 --- /dev/null +++ b/docs/configure.md @@ -0,0 +1,75 @@ +Configuration of Automatron is fairly simple and contained within a single file; `config/config.yml`. + +This guide will walk through configuring a basic Automatron instance. + +## Copying the `config.yml.example` file + +The fastest method to configure Automatron is to start with the example configuration file `config/config.yml.example`. This configuration file contains basic default values which can be used in most implementations of Automatron. To use this file we can simply rename it to the default Automatron configuration file `config/config.yml`. + +```sh +$ cp config/config.yml.example config/config.yml +``` + +Once complete, we can now start customizing our configuration file. + +## SSH Details + +Automatron relies on SSH to perform both health checks and actions. Within `config.yml` there is an SSH section which will allow us to define the necessary SSH details such as; `user` to authenticate as, a `gateway` or "jump server" for SSH connections and a Private SSH `key`. + +```yaml+jinja +ssh: # SSH Configuration + user: root + gateway: False + key: | + -----BEGIN RSA PRIVATE KEY----- + this is an example + -----END RSA PRIVATE KEY----- +``` + +If the `gateway` setting is left as `False` Automatron will login to each host directly. To specify a "jump server" simply specify the DNS or IP address of the desired server. + +```yaml+jinja + gateway: 10.0.0.1 +``` + +!!! info + At this time Automatron does not support using sudo or other privilege escalation tools. Any checks or actions will be performed via the user privileges specified in `user`. + +## Enable Auto Discovery + +By default, Automatron will listen on port `8000` for any HTTP requests. When an HTTP request is made to Automatron the IP will be captured and that server will then be identified as a monitoring target. + +There are several plugins that enable other methods for host discovery, in this section we will enable the `nmap` discovery plugin. This configuration is within the `discovery` section of the `config.yml` file. + +```yaml+jinja +discovery: + upload_path: /tmp/ + vetting_interval: 30 + plugins: + # Web Service for HTTP PINGs + webping: + ip: 0.0.0.0 + port: 8000 +``` + +To enable the `nmap` plugin we simply need to append the `nmap` configuration within the `plugins` key. + +```yaml+jinja +discovery: + upload_path: /tmp/ + vetting_interval: 30 + plugins: + # Web Service for HTTP PINGs + webping: + ip: 0.0.0.0 + port: 8000 + # NMAP Scanning + nmap: + target: 10.0.0.1/8 + flags: -sP + interval: 40 +``` + +Each plugin has unique configuration details, the specifics of these plugins can be found in the [plugin](plugins/index.md) documentation. + +At this point Automatron has been configured. We can now move on to creating our own [Runbooks](runbooks/index.md). diff --git a/docs/deploying-with-docker.md b/docs/deploying-with-docker.md deleted file mode 100644 index 1384b51..0000000 --- a/docs/deploying-with-docker.md +++ /dev/null @@ -1,11 +0,0 @@ -Deploying Automatron within Docker is quick and easy. Since Automatron by default uses `redis` as a datastore we must first start a `redis` instance. - -```console -$ sudo docker run -d --restart=always --name redis redis -``` - -Once `redis` is up and running you can start an Automatron instance. - -```console -$ sudo docker run -d --link redis:redis -v /path/to/config:/config --restart=always --name automatron madflojo/automatron -``` \ No newline at end of file diff --git a/docs/facts.md b/docs/facts.md index a7ae366..c5497f0 100644 --- a/docs/facts.md +++ b/docs/facts.md @@ -1,22 +1,23 @@ -Automatron provides the ability to use the Jinja templating language with [Runbooks](runbooks). To support this ability to use templates Automatron also has a **facts** facility. Facts are simply information that has been gathered from the target system. This includes information such as Hostname, OS, Services Running and Network information. - -The below is an example runbook that utilizes the Automatron facts system. +Auomatron leverages the power of [Jinja2](http://jinja.pocoo.org/docs/2.9/), a popular Python based templating language to enhance how runbooks can be used. The below example is a runbook that leverages Jinja2. ```yaml+jinja -name: Verify nginx is running -schedule: "*/5 * * * *" -nodes: - - "*web*" +name: Check NGINX +{% if "prod" in facts['hostname'] %} +schedule: + second: */20 +{% else %} +schedule: "*/2 * * * *" +{% endif %} checks: nginx_is_running: - # Check if nginx is running execute_from: target type: cmd - {% if "Linux" in facts['os'] %} cmd: service nginx status - {% else %} - cmd: /usr/local/etc/rc.d/nginx status - {% endif %} + port_443_is_up: + execute_from: target + type: plugin + plugin: network/tcp_connect.py + args: --host={{ facts['network']['eth0']['v4'][0] }} --port 443 actions: restart_nginx: execute_from: target @@ -25,12 +26,49 @@ actions: call_on: - WARNING - CRITICAL + - UNKNOWN type: cmd - {% if "Linux" in facts['os'] %} cmd: service nginx restart - {% else %} - cmd: /usr/local/etc/rc.d/nginx restart - {% endif %} + remove_from_dns: + execute_from: remote + trigger: 0 + frequency: 0 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: plugin + plugin: cloudflare/dns.py + args: remove test@example.com apikey123 example.com --content {{ facts['network']['eth0']['v4'][0] }} +``` + +The above runbook leverages both Jinja2 and Automatron's internal **Facts**. Facts are variables that Automatron has collected during the Vetting process of each monitored system. + +When Automatron discovers a new host it executes [Vetting Plugins](plugins/#Vetting) on the host. Some plugins are executed remotely, others are executed on the host itself. These plugins return information unique to each host. + +An example of the type of information can be seen in the `ontarget/system_info.py` vetting plugin. This plugin creates facts for OS Distribution, Hostname, Kernel version and Network Information. + +## A Deeper Look + +To get a better understanding of facts, and how they can be used let's look at the facts used in the above example. The below example is an example of using the `hostname` fact to determine if the target is a "production" hostname or not. + +```yaml+jinja +{% if "prod" in facts['hostname'] %} +schedule: + second: */20 +{% else %} +schedule: "*/2 * * * *" +{% endif %} +``` + +This next example uses another fact to determine the IPv4 address of the monitors host. This address is then used as an argument for the `tcp_connect.py` plugin. + +```yaml+jinja +port_443_is_up: + execute_from: target + type: plugin + plugin: network/tcp_connect.py + args: --host={{ facts['network']['eth0']['v4'][0] }} --port 443 ``` -In the example above the `facts['os']` value is checked to determine if the target system is Linux or not. For a full list of available facts please reference the [Vetting Plugins](plugins/#Vetting) documentation. +The above are simple examples of how Jinja and Facts used together can enable the creation of runbooks that can span multiple hosts and use cases. diff --git a/docs/img/automatron.png b/docs/img/automatron.png new file mode 100644 index 0000000..c948156 Binary files /dev/null and b/docs/img/automatron.png differ diff --git a/docs/index.md b/docs/index.md index 9851f73..35a5b1e 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,77 +1,67 @@ ![Automatron](https://raw.githubusercontent.com/madflojo/automatron/master/docs/img/logo_huge.png) -Automatron **(Ah-Tom-a-tron)** is an open source framework designed to detect and remediate IT systems issues. Meaning, it can be used to monitor systems and when it detects issues; correct them. +Automatron is a framework for creating self-healing infrastructure. Simply put, it detects system events & takes action to correct them. + +The goal of Automatron is to allow users to automate the execution of common tasks performed during system events. These tasks can be as simple as **sending an email** to as complicated as **restarting services across multiple hosts**. ## Features -* Automatically detect and add new systems to monitor -* Monitoring is executed over SSH and completely agent-less -* Policy based Runbooks allow for monitoring policies rather than server specific configurations -* Supports Nagios compliant health check scripts -* Allows arbitrary shell commands for both checks and actions -* Runbook flexibility with **Jinja2** templating support -* Pluggable Architecture that simplifies customization + * Automatically detect and add new systems to monitor + * Monitoring is executed over SSH and completely **agent-less** + * Policy based [Runbooks](runbooks/index.md) allow for monitoring policies rather than server specific configurations + * Supports Nagios compliant health check scripts + * Allows dead simple **arbitrary shell commands** for both [checks](runbooks/checks.md) and [actions](runbooks/actions.md) + * Runbook flexibility with **Jinja2** templating support + * Pluggable Architecture that simplifies customization ## Runbooks -Automatron's actions are driven by policies called **Runbooks**. These runbooks are used to define what health checks should be executed on a target host and what to do about those health checks when they fail. +The core of Automatron is based around **Runbooks**. Runbooks are policies that define health checks and actions. You can think of them in the same way you would think of a printed runbook. Except with Automatron, the actions are automated. -### A simple Runbook +### A simple Runbook example -The below example is a Runbook that will execute a monitoring plugin to determine the amount of free space on `/var/log` and based on the results execute a corrective action. +The below runbook is a very basic example, it will check if NGINX is running (every 2 minutes) and restart it after 2 unsuccessful checks. -```yaml -name: Verify /var/log +```yaml+jinja +name: Check NGINX schedule: "*/2 * * * *" -nodes: - - "*" checks: - mem_free: - # Check for the % of disk free create warning with 20% free and critical for 10% free + nginx_is_running: execute_from: target - type: plugin - plugin: systems/disk_free.py - args: --warn=20 --critical=10 --filesystem=/var/log + type: cmd + cmd: service nginx status actions: - logrotate_nicely: + restart_nginx: execute_from: target - trigger: 0 + trigger: 2 frequency: 300 call_on: - WARNING - type: cmd - cmd: bash /etc/cron.daily/logrotate - logrotate_forced: - execute_from: target - trigger: 5 - frequency: 300 - call_on: - CRITICAL + - UNKNOWN type: cmd - cmd: bash /etc/cron.daily/logrotate --force + cmd: service nginx restart ``` -### A Runbook with Jinja2 +The above actions will be performed every 300 seconds (5 minutes) until the health check returns an OK status. This delay allows time for NGINX to restart after each execution. -Jinja2 support was added to Runbooks to allow for extensive customization. The below example shows using Jinja2 to determine which `cmd` to execute based on Automatron's **facts** system. +### A complex Runbook with Jinja2 -This example will detect if `nginx` is running and if not, restart it. +This next runbook example is a more complex version of the above. In this example we will use Jinja2 and Automatron's Facts to enhance our runbook further. ```yaml+jinja -name: Verify nginx is running -schedule: "*/5 * * * *" -nodes: - - "*web*" +name: Check NGINX +{% if "prod" in facts['hostname'] %} +schedule: + second: */20 +{% else %} +schedule: "*/2 * * * *" +{% endif %} checks: nginx_is_running: - # Check if nginx is running execute_from: target type: cmd - {% if "Linux" in facts['os'] %} cmd: service nginx status - {% else %} - cmd: /usr/local/etc/rc.d/nginx status - {% endif %} actions: restart_nginx: execute_from: target @@ -80,19 +70,28 @@ actions: call_on: - WARNING - CRITICAL + - UNKNOWN type: cmd - {% if "Linux" in facts['os'] %} cmd: service nginx restart - {% else %} - cmd: /usr/local/etc/rc.d/nginx restart - {% endif %} + remove_from_dns: + execute_from: remote + trigger: 0 + frequency: 0 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: plugin + plugin: cloudflare/dns.py + args: remove test@example.com apikey123 example.com --content {{ facts['network']['eth0']['v4'][0] }} ``` -## Next Steps +The above example uses **Jinja2** and **Facts** to create a conditional schedule. If our target server has a hostname that contains the word "prod" within it. The schedule for the health check will be every 20 seconds. If not, it will be every 2 minutes. + +Another addition is the `remove_from_dns` action, which will remove the target server's DNS entry using the **CloudFlare DNS** plugin. + +By using **Facts** and **Jinja2** together you can customize a single runbook to cover unique actions for multiple hosts and environments. + +## Follow Automatron -* Follow our quick start guide: [Automatron in 10 minutes](automatron-in-10-minutes) -* Check out [example Runbooks](https://github.com/madflojo/automatron/tree/master/config/runbooks/examples) for automating common tasks -* Read our [Runbook Reference](Runbooks) documentation to better understand the anatomy of a Runbook -* Deploy a [Docker container](https://hub.docker.com/r/madflojo/automatron/) instance of Automatron -* Follow [@Automatronio on Twitter](https://twitter.com/automatronio) to keep up to date -* Join [#Automatron on Gitter](https://gitter.im/madflojo/automatron) for help or just to hang out +[![Twitter Follow](https://img.shields.io/twitter/follow/automatronio.svg?style=flat-square)](https://twitter.com/automatronio) [![Gitter](https://badges.gitter.im/madflojo/automatron.svg)](https://gitter.im/madflojo/automatron?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge) [![GitHub forks](https://img.shields.io/github/forks/madflojo/automatron.svg?style=social&label=Fork)](https://github.com/madflojo/automatron) [![GitHub stars](https://img.shields.io/github/stars/madflojo/automatron.svg?style=social&label=Star)](https://github.com/madflojo/automatron) diff --git a/docs/install/docker.md b/docs/install/docker.md new file mode 100644 index 0000000..76dab8a --- /dev/null +++ b/docs/install/docker.md @@ -0,0 +1,26 @@ +Deploying Automatron within Docker is quick and easy and can be done with two simple `docker` commands. + +## Starting a Redis container + +Since Automatron by default uses `redis` as a datastore we must first start a `redis` container. + +```sh +$ sudo docker run -d --restart=always --name redis redis +``` + +The above `redis` instance will be used as a default datastore for Automatron. + +## Starting the Automatron container + +Once the `redis` instance is up and running we can start an Automatron instance. + +```sh +$ sudo docker run -d --link redis:redis -p 8000:8000 -v /path/to/config:/config --restart=always --name automatron madflojo/automatron +``` + +In the above `docker run` command we are using `-v` to mount a directory from the host to the container as `/config`. This `/config` directory will be the home to Automatron's configuration files and Runbooks. + +With these steps complete, we can now move to [Configuring](/configure.md) Automatron. + +!!! tip + A `docker-compose.yml` file is included in the base repository which can be used to quickly stand up environments using `docker-compose up automatron`. diff --git a/docs/install/index.md b/docs/install/index.md new file mode 100644 index 0000000..c97decc --- /dev/null +++ b/docs/install/index.md @@ -0,0 +1,64 @@ +In this page you will be guided through a basic installation of Automatron. If you wish to deploy Automatron within a container, you can skip this guide and follow the [Docker deployment](docker.md) instructions. + +## Basic Installation + +Currently, the simplest method of installing Automatron is by either cloning the [GitHub Repository](https://github.com/madflojo/automatron/) or [downloading](https://github.com/madflojo/automatron/releases) a specific release and installing dependencies. + +This guide will walk through cloning the GitHub repository and starting an Automatron instance. + +### Prerequisites + +The below list is a set of base requirements for installing and running an Automatron instance. + + * Python 2.7 or higher + * Python-dev Package + * Pip + * Redis + * nmap + * git + * libffi-dev + * libssl-dev + * build-essential + +On Ubuntu systems these can be installed with the following command. + +```sh +$ sudo apt-get install python2.7 python-dev \ + python-pip redis-server \ + nmap git libffi-dev \ + build-essential libssl-dev +``` + +Once installed we can proceed to Automatron's installation + +### Clone from Github + +The first installation step is to simply clone the current repository from GitHub using `git` and change to the newly created directory. + +```sh +$ git clone https://github.com/madflojo/automatron.git +$ cd automatron +``` + +This will place the latest `master` (production ready) branch into the `automatron` directory. + +### Install required python modules + +The second installation step is to install the required python modules using the `pip` command. + +```sh +$ sudo pip install -r requirements.txt +$ sudo pip install honcho +``` + +With the above two steps complete, we can now move to [Configuration](/configure.md). + +## Starting Automatron + +In order to start Automatron you can simply execute the command below. + +```sh +$ honcho start +``` + +To shut down Automatron you can use the `kill` command to send the `SIGTERM` signal to the running processes. diff --git a/docs/plugins.md b/docs/plugins.md deleted file mode 100644 index 34c701c..0000000 --- a/docs/plugins.md +++ /dev/null @@ -1,50 +0,0 @@ -## Discovery - -Plugins for discovering new targets to monitor. - -* [nmap](plugins/discovery/nmap) - Nmap wrapper for scanning Networks -* [Web Ping](plugins/discovery/webping) - Listen for POST or GET requests to identify new targets -* [DigitalOcean](plugins/discovery/digitalocean) - Query DigitalOcean's API - - -## Checks - -Plugins used to perform health checks. - - * Systems - * [Disk Free](plugins/checks/systems/disk_free) - Check file system disk space utilization - * [Memory Free](plugins/checks/systems/mem_free) - Check memory utilization - -## Actions - -Plugins used for Automatron actions. - - * CloudFlare - * [DNS](plugins/actions/cloudflare/dns) - Modify, Add or Delete CloudFlare hosted DNS entries - * Docker - * [Docker Clean](plugins/actions/docker/clean) - Remove all Docker containers and images - * Systems - * [Services](plugins/actions/systems/services) - Perform an action on a specified services - -## Vetting - -Gather information from targets to populate `facts` values. - - * On Target - * [Services](plugins/vetting/ontarget/services) - Identify system services and their current state - * [System Info](plugins/vetting/ontarget/system-info) - Identify basic system information (i.e. Hostname, IP Address, etc.) - * Remote - * [Ping](plugins/vetting/remote/ping) - True or False value to determine if system is ping-able - -## Datastores - -Use custom datastores to store Automatrons internal data. - - * [Redis](plugins/datastores/redis) - Redis data storage and retrieval - -## Logging - -Use custom logging modules for Automatron - - * [Syslog](plugins/logging/syslog) - Log to custom Syslog end points - * [Console](plugins/logging/console) - Log to the executors console diff --git a/docs/plugins/index.md b/docs/plugins/index.md new file mode 100644 index 0000000..080b7c7 --- /dev/null +++ b/docs/plugins/index.md @@ -0,0 +1,52 @@ +## Discovery + +Plugins for discovering new targets to monitor. + +* [nmap](discovery/nmap) - Nmap wrapper for scanning Networks +* [Web Ping](discovery/webping) - Listen for POST or GET requests to identify new targets +* [DigitalOcean](discovery/digitalocean) - Query DigitalOcean's API + + +## Checks + +Plugins used to perform health checks. + + * Systems + * [Disk Free](checks/systems/disk_free) - Check file system disk space utilization + * [Memory Free](checks/systems/mem_free) - Check memory utilization + +It is also possible to import the [Monitoring Plugins Project](https://www.monitoring-plugins.org/) or plugins from [Nagios Exchange](https://exchange.nagios.org/) into Automatron by copying the plugins into the `plugins/checks` directory. Once included the executables can be referenced by Automatron's runbooks. + +## Actions + +Plugins used for Automatron actions. + + * CloudFlare + * [DNS](actions/cloudflare/dns) - Modify, Add or Delete CloudFlare hosted DNS entries + * Docker + * [Docker Clean](actions/docker/clean) - Remove all Docker containers and images + * Systems + * [Services](actions/systems/services) - Perform an action on a specified services + +## Vetting + +Gather information from targets to populate `facts` values. + + * On Target + * [Services](vetting/ontarget/services) - Identify system services and their current state + * [System Info](vetting/ontarget/system-info) - Identify basic system information (i.e. Hostname, IP Address, etc.) + * Remote + * [Ping](vetting/remote/ping) - True or False value to determine if system is ping-able + +## Datastores + +Use custom datastores to store Automatrons internal data. + + * Redis - Redis data storage and retrieval + +## Logging + +Use custom logging modules for Automatron + + * [Syslog](logging/syslog) - Log to custom Syslog end points + * [Console](logging/console) - Log to the executors console diff --git a/docs/runbooks.md b/docs/runbooks.md deleted file mode 100644 index 127fe87..0000000 --- a/docs/runbooks.md +++ /dev/null @@ -1,426 +0,0 @@ -Runbooks within Automatron are used to define the health checks to run on target nodes and the actions to perform based on those health checks. This guide is a reference for the various options used when defining a runbook. - -## Basic Runbook Example - -The below example is a basic Runbook that monitors the status of **nginx** by executing the `service nginx status` command and restarts **nginx** if that command is not successful. - -```yaml -name: Verify nginx is running -schedule: "*/5 * * * *" -nodes: - - "*web*" -checks: - nginx_is_running: - # Check if nginx is running - execute_from: target - type: cmd - cmd: service nginx status -actions: - restart_nginx: - execute_from: target - trigger: 2 - frequency: 300 - call_on: - - WARNING - - CRITICAL - type: cmd - cmd: service nginx restart -``` - -The format of a Runbook is written in [YAML](http://yaml.org/). This format allows for simple parsing of Runbooks while retaining a human friendly format. YAML is also a very common configuration language used by many automation tools which should reduce the ramp up time for those who are experienced with other infrastructure automation tools. - -## Runbook Reference - -The below is a detailed reference for the available options within a Runbook. At this moment all fields specified within this reference are required fields. - -### `name` - -The `name` field is used to provide an arbitrary name to the Runbook. This field is a required field and must have some value. It is suggested that this value be unique and not re-used by other Runbooks. - -### `schedule` - -The `schedule` field is used to provide a cron formatted schedule for health check execution. The specified cron schedule will be used to establish the frequency to execute the health checks defined within the Runbook. - -It is also possible to define the schedule in a key/value based format as well as a cron based format. - -```yaml -schedule: - second: '*/15' - minute: '*' - hour: '*' - day: '*' - month: '*' - day_of_week: '*' -``` - -For key/value based schedules you may omit fields for default values ('*'); however, cron based schedules requires all 5 columns(`* * * * *`). - - -### `nodes` - -The `nodes` field is a YAML list used to specify which target nodes this runbook should be applied to. This list is based on the hostname value of the nodes. - -In the example above the value is `*web*`, since Automatron supports globbing for the `nodes` field this would mean this runbook is applied to any hosts that have the word "web" in their hostname. - -Hostnames are obtained during the target vetting process from the systems itself. If the hostname changes from vetting time to execution time that change will not be reflected in runbook processing. - -As previously mentioned this field is a YAML list which allows for multiple values to be added. The example below shows how to add multiple node targets. - -```yaml -nodes: - - "*web*" - - "*caching*" -``` - -### `checks` - -The `checks` field is a YAML dictionary that contains the health checks to be executed against the specified nodes in the `nodes` list. - -The format of `checks` is as follows. - -```yaml -checks: - name_of_check: - # health check options - another_check: - # health check options -``` - -The name key for each check is arbitrary and mainly used for logging, however it should be unique within the same Runbook. - -#### Health Check Types - -Automatron supports 2 types of health checks; `cmd` and `plugin`. The `cmd` type is used to execute arbitrary shell commands and the `plugin` type is used to upload and execute an executable. - -Depending on the type used, the health check has different required options. - -#### Command Execution - -The `cmd` Check type is used to execute arbitrary shell commands as a health check. The results of this check is determined based on the exit code of the last command executed. - -Below is an example of a `cmd` based health check. - -```yaml -checks: - nginx_is_running: - # Check if nginx is running - execute_from: target - type: cmd - cmd: service nginx status -``` - -This health check will execute the `service nginx status` command and validate the exit code returned to determine the status of the health check. - -With the `cmd` type health check there are 3 main options; `execute_from`, `type` and `cmd`. - -##### `type` - -The `type` field is used to specify what type of health check this check is. Acceptable values are `cmd` or `plugin`. This field is required for all health checks. - -##### `execute_from` - -The `execute_from` field is used to specify where to run the health check. Acceptable values for this field are `target` which is used to execute the health check on the node itself and `remote`. The `remote` setting will tell Automatron to execute the health check from the system running Automatron's `monitoring.py` service. - -##### `cmd` - -The `cmd` field is used to specify the shell command to execute. In the example above the command is simply `service nginx status` however, this field support much more complicated commands such as the below example. - -```yaml -checks: - http_is_accessible: - execute_from: remote - type: cmd - cmd: /usr/bin/curl -Lw "Response %{http_code}\\n" http://10.0.0.1 -o /dev/null | egrep "Response [200|301]" -``` - -#### Plugin Execution - -The `plugin` health check is used to copy a health check plugin from Automatron's plugin path to the target and execute that plugin. - -The below example shows a `plugin` based health check. - -```yaml -checks: - disk_free: - # Check for the % of disk free create warning with 20% free and critical for 10% free - execute_from: target - type: plugin - plugin: systems/disk_free.py - args: --warn=20 --critical=10 --filesystem=/var/log -``` - -This health check will login to the target node, upload the `systems/disk_free.py` file to a temporary location and execute it providing the arguments specified in the the `args` field. - -With `plugin` health checks there are 4 parameters to be set; `execute_from`, `type`, `plugin` and `args`. - -##### `type` - -The `type` field is used to specify what type of health check this check is. Acceptable values are `cmd` or `plugin`. This field is required for all health checks. - -##### `execute_from` - -The `execute_from` field is used to specify where to run the health check. Acceptable values for this field are `target` which is used to execute the health check on the node itself and `remote`. The `remote` setting will tell Automatron to execute the health check from the system running the `monitoring.py` service of Automatron. - -##### `plugin` - -The `plugin` field is used to specify the location of the plugin file. This is a relative file path starting from the value of the `plugin_path` parameter. - -For example, a plugin located at `/path/to/plugins/checks/mycheck/mycheck.pl` would require the value of `mycheck/mycheck.pl`. - -##### `args` - -The `args` field is used to specify the arguments to provide the plugin executable. In the example above the plugin will be executed as follows by Automatron - -```console -$ /path/to/plugin/disk_free.py --warn=20 --critical=10 --filesystem=/var/log -``` - -#### Exit Codes - -Automatron follows the **Nagios** model for health check exit codes. When a health check is executed the exit code is used to inform Automatron of the results. The below list is a map of acceptable exit codes and how they relate to Automatron health check status. - - * `OK`: Requires a successful exit code of `0` - * `WARNING`: Is indicated by an exit code of `1` - * `CRITICAL`: Is indicated by an exit code of `2` - * `UNKNOWN`: Is indicated by any other exit code - -### `actions` - -Like `checks` the `actions` field is a YAML dictionary that contains actions to be executed based on health check status. The `actions` field also follows a similar format to the `checks` field. - -```yaml -actions: - name_of_action: - # Action options - another_action: - # Action options -``` - -Also like health checks the name of actions should be unique within a runbook but otherwise arbitrary. - -#### Action Types - -Actions also have two types, `cmd` and `plugin`. A `cmd` action is used to execute a shell command and `plugin` actions are used to execute an executable plugin file. Both `cmd` and `plugin` have unique options as well as common options. The below section covers the two action types and the options for those types. - -#### Command Execution - -The `cmd` action type is designed to execute an arbitrary shell command. The below is an example of a `cmd` action. - -```yaml -actions: - restart_nginx: - execute_from: target - trigger: 2 - frequency: 300 - call_on: - - WARNING - - CRITICAL - type: cmd - cmd: service nginx restart -``` - -##### `execute_from` - -The `execute_from` field is used to specify where to run the action. Acceptable values for this field are `target`, `remote` and `host`. - - * `target` - The `target` value will specify that the action should be executed on the monitored host. - * `remote` - This value will specify that the action is executed from the Automatron server running the actioning process. - * `host` - This value will specify that the action is executed from another specified host. - -The alternative host can be specified via a key named `host`. Below is an example of a `host` based action. - -```yaml -actions: - restart_mysql: - execute_from: host - host: 10.0.0.1 - trigger: 0 - frequency: 300 - call_on: - - WARNING - - CRITICAL - type: cmd - cmd: service mysql restart -``` - -##### `trigger` - -The `trigger` field is used to specify the number of times a health check returns the state specified within `call_on`. This number must be reached consecutively. If for example, the health check returns `WARNING` and then `OK`; Automatron's internal counter will be reset. - -##### `frequency` - -The `frequency` field is used to specify the time (in seconds) between action execution. In the above example the action will be executed every `300` seconds until either the `call_on` or `trigger` conditions are no longer met. - -If you wish to execute an action every time, simply set this value to `0` seconds. - -##### `call_on` - -The `call_on` field is a YAML list which is used to list the states that should trigger this action. Valid options are `OK`, `WARNING`, `CRITICAL` & `UNKNOWN`. - -##### `type` - -The `type` field is used to specify what type of action this action is. Acceptable values are `cmd` or `plugin`. This field is required for all actions. - -##### `cmd` - -The `cmd` field is used to specify the shell command to execute as part of this action. In the example above the command is simply `service nginx restart`. - -#### Plugin Execution - -A `plugin` action type is used to execute Automatron actioning plugins. These plugins are simply executables that are copied to a temporary location and then executed with the specified arguments. - -Below is an example of a plugin action that adds a domain's DNS record to CloudFlare. - -```yaml -actions: - add_dns_record: - execute_from: remote - trigger: 0 - frequency: 300 - call_on: - - OK - type: plugin - plugin: cloudflare/dns.py - args: add email@example.com api_key example.com www.example.com A 10.0.0.1 -``` - -##### `execute_from` - -The `execute_from` field is used to specify where to run the action. Acceptable values for this field are `target`, `remote` and `host`. - - * `target` - The `target` value will specify that the action should be executed on the monitored host. - * `remote` - This value will specify that the action is executed from the Automatron server running the actioning process. - * `host` - This value will specify that the action is executed from another specified host. - -The alternative host can be specified via a key named `host`. Below is an example of a `host` based action. - -```yaml -actions: - restart_mysql: - execute_from: host - host: 10.0.0.1 - trigger: 0 - frequency: 300 - call_on: - - WARNING - - CRITICAL - type: cmd - cmd: service mysql restart -``` - -##### `trigger` - -The `trigger` field is used to specify the number of times a health check returns the state specified within `call_on`. This number must be reached consecutively. If for example, the health check returns `WARNING` and then `OK`; Automatron's internal counter will be reset. - -##### `frequency` - -The `frequency` field is used to specify the time (in seconds) between action execution. In the above example the action will be executed every `300` seconds until either the `call_on` or `trigger` conditions are no longer met. - -If you wish to execute an action every time, simply set this value to `0` seconds. - -##### `call_on` - -The `call_on` field is a YAML list which is used to list the states that should trigger this action. Valid options are `OK`, `WARNING`, `CRITICAL` & `UNKNOWN`. - -##### `type` - -The `type` field is used to specify what type of action this action is. Acceptable values are `cmd` or `plugin`. This field is required for all actions. - -##### `plugin` - -The `plugin` field is used to specify the location of the plugin file. This is a relative file path starting from the value of the `plugin_path` parameter. - -For example, a plugin located at `/path/to/plugins/actions/myaction/myaction.pl` would require the value of `myaction/myaction.pl`. - -##### `args` - -The `args` field is used to specify the arguments to provide the plugin executable. In the example above the plugin will be executed as follows by Automatron - -```console -$ /path/to/plugin/dns.py add email@example.com api_key example.com www.example.com A 10.0.0.1 -``` - -## More Runbook Examples - -### Check HTTP Status - -This Runbook validates an HTTP service is accessible and will restart the system and remove DNS entries on failure. - -```yaml+jinja -name: Verify HTTP is responding to GET requests on target system -schedule: "*/2 * * * *" -nodes: - - "*" -checks: - http_is_accessible: - execute_from: remote - type: cmd - cmd: /usr/bin/curl -Lw "Response %{http_code}\\n" http://{{ facts['network']['eth0']['v4'][0] }} -o /dev/null | egrep "Response [200|301]" -actions: - restart_http: - execute_from: target - trigger: 0 - frequency: 300 - call_on: - - WARNING - - CRITICAL - - UNKNOWN - type: cmd - cmd: service nginx restart - remove_dns: - execute_from: remote - trigger: 0 - frequency: 300 - call_on: - - WARNING - - CRITICAL - - UNKNOWN - type: plugin - plugin: cloudflare/dns.py - args: remove someone@example.com 12345 example.com --content {{ facts['network']['eth0']['v4'][0] }} - reboot: - execute_from: target - trigger: 5 - frequency: 900 - call_on: - - WARNING - - CRITICAL - - UNKNOWN - type: cmd - cmd: reboot -``` - -### Check `/var/logs` available space - -This example will validate the free space on the `/var/log` filesystem and if necessary execute a `logrotate` task - -```yaml+jinja -name: Verify /var/log -schedule: "*/2 * * * *" -nodes: - - "*" -checks: - disk_free: - # Check for the % of disk free create warning with 20% free and critical for 10% free - execute_from: target - type: plugin - plugin: systems/disk_free.py - args: --warn=20 --critical=10 --filesystem=/var/log -actions: - logrotate_nicely: - execute_from: target - trigger: 0 - frequency: 300 - call_on: - - WARNING - type: cmd - cmd: bash /etc/cron.daily/logrotate - logrotate_forced: - execute_from: target - trigger: 5 - frequency: 300 - call_on: - - CRITICAL - type: cmd - cmd: bash /etc/cron.daily/logrotate --force -``` diff --git a/docs/runbooks/actions.md b/docs/runbooks/actions.md new file mode 100644 index 0000000..af9fca6 --- /dev/null +++ b/docs/runbooks/actions.md @@ -0,0 +1,151 @@ +When a runbook health check returns a state; Automatron will check the runbooks definition to determine if an action should be taken. Like health checks, actions come in two flavors. **Arbitrary shell commands** and **Plugin executables**. In this guide we will be defining two actions, one of each type. + +During this guide we will be building runbook actions for the below runbook. + +```yaml+jinja +name: Check NGINX +schedule: "*/2 * * * *" +checks: + nginx_is_running: + execute_from: target + type: cmd + cmd: service nginx status + port_443_is_up: + execute_from: target + type: plugin + plugin: network/tcp_connect.py + args: --host=localhost --port 443 +actions: + restart_nginx: + execute_from: target + trigger: 2 + frequency: 300 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: cmd + cmd: service nginx restart + remove_from_dns: + execute_from: remote + trigger: 0 + frequency: 0 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: plugin + plugin: cloudflare/dns.py + args: remove test@example.com apikey123 example.com --content 10.0.0.1 +``` + +Within this runbook there are two actions; `restart_nginx` and `remove_from_dns`. In this guide we will be breaking down these two actions to gain a better understanding of how they work. + +## A command based actions + +Like health checks, Automatron actions also support arbitrary shell commands. When executing this type of action Automatron simply logs into the target system and executes the defined command. + +The below example is a simple action that logs into the target system and executes the `service nginx restart` command. + +```yaml+jinja +restart_nginx: + execute_from: target + trigger: 2 + frequency: 300 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: cmd + cmd: service nginx restart +``` + +This action has 6 main fields defined; `execute_from`, `trigger`, `frequency`, `call_on`, `type`, and `cmd`. Let's break down what each of these fields specify and control about action execution. + +### Execute from + +The `execute_from` field is used to specify where to run the action. Acceptable values for this field are `target`, `remote` and `host`. + + * `target` - This value will specify that the action should be executed on the monitored host. + * `remote` - This value will specify that the action is executed from the Automatron server. + * `host` - This value will specify that the action is executed from another specified host. + +When using the `host` value, the alternative host must be specified via a key named `host`. Below is an example of a `host` based action. + +```yaml+jinja +actions: + restart_mysql: + execute_from: host + host: 10.0.0.2 + trigger: 0 + frequency: 300 + call_on: + - WARNING + - CRITICAL + type: cmd + cmd: service mysql restart +``` + +The above action will result in Automatron logging into `10.0.0.2` and executing `service mysql restart`. + +### Trigger + +The `trigger` field is used to specify the number of times a health check returns the state specified within `call_on`. This number **must be reached consecutively**. If for example, the health check returns `WARNING` and then `OK`; Automatron's internal counter will be reset. + +### Frequecy + +The `frequency` field is used to specify the time (in seconds) between action execution. In the above example the action will be executed every `300` seconds until either the `call_on` or `trigger` conditions are no longer met. + +If you wish to execute an action every time, simply set this value to `0` seconds. + +### Call on + +The `call_on` field is a YAML list which is used to list the states that should trigger this action. Valid options are `OK`, `WARNING`, `CRITICAL` & `UNKNOWN`. + +### Type + +The `type` field is used to specify what type of action will be performed. Acceptable values are `cmd` or `plugin`. This field is required for all actions. Since our action above is a command based action we will specify `cmd`. + +### Command + +The `cmd` field is used to specify the shell command to execute as part of this action. In the example above the command is simply `service nginx restart`. When this action is executed, Automatron will login to the host specified and execute that command. + +## A plugin based action + +When a command based action is being executed Automatron will login to the target host and execute the command specified. With plugin based actions, Automatron will upload the plugin executable and execute it giving the specified arguments. + +Below is a sample Runbook using a plugin based action. + +```yaml+jinja +remove_from_dns: + execute_from: remote + trigger: 0 + frequency: 0 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: plugin + plugin: cloudflare/dns.py + args: remove test@example.com apikey123 example.com --content 10.0.0.1 +``` + +This action has 7 main fields defined; `execute_from`, `trigger`, `frequency`, `call_on`, `type`, `plugin` and `args`. Let's break down what each of these fields specify and control about action execution. + +### Execute from, Trigger, Frequency, Call on & Type + +As `execute_from`, `trigger`, `frequency`, `call_on`, and `type` are common fields for every runbook. The way they are applied for plugin actions is the same as the way they are applied for command based actions. As such we will skip repeating these fields in this section. + +### Plugin + +The `plugin` field is used to specify the location of the plugin executable. This is a relative file path starting from the value of the `plugin_path` parameter located within the `config/config.yml` configuration file. + +For example, a plugin located at `/path/to/plugins/actions/myaction/myaction.pl` would require the value of `myaction/myaction.pl`. + +### Plugin Arguments + +The `args` field is used to specify the arguments to provide the plugin executable. In the example above the plugin will be executed as follows by Automatron + +```sh +$ /path/to/plugins/checks/ncloudflare/dns.py remove test@example.com apikey123 example.com --content 10.0.0.1 +``` diff --git a/docs/runbooks/checks.md b/docs/runbooks/checks.md new file mode 100644 index 0000000..bb8bcfe --- /dev/null +++ b/docs/runbooks/checks.md @@ -0,0 +1,125 @@ +Automatron determines whether a runbook action should be performed based on the results of a health check. There are two types of health checks within Automatron. **Arbitrary shell commands** and **Plugin executables**. In this guide we will walk through defining two health checks, one of each type. + +The below runbook is a sample that this guide will be based on. + +```yaml+jinja +name: Check NGINX +schedule: "*/2 * * * *" +checks: + nginx_is_running: + execute_from: target + type: cmd + cmd: service nginx status + port_443_is_up: + execute_from: target + type: plugin + plugin: network/tcp_connect.py + args: --host=localhost --port 443 +actions: + restart_nginx: + execute_from: target + trigger: 2 + frequency: 300 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: cmd + cmd: service nginx restart + remove_from_dns: + execute_from: remote + trigger: 0 + frequency: 0 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: plugin + plugin: cloudflare/dns.py + args: remove test@example.com apikey123 example.com --content 10.0.0.1 +``` + +In the above example, there are two health checks defined `nginx_is_running` and `port_443_is_up`. In the below section we will break down each of these health checks to better understand how health checks are defined. + +## A command based health check + +Command based health checks are one of the simplest concepts in Automatron. This type of health check allows users to define a command that is executed to determine the health status of a target. + +This is accomplished by Automatron simply logging into the target system over SSH and executing the defined command. The exit code of the executed command is then used to determine the status of the health check. + +The below sample is the `nginx_is_running` command based health check. + +```yaml+jinja + nginx_is_running: + execute_from: target + type: cmd + cmd: service nginx status +``` + +In this sample we can see that there are 3 values required for command based health checks. Those values are `execute_from`, `type`, and `cmd`. Let's go ahead and break down these values to gain a better understanding of what they mean and tell Automatron to do. + +### Execute from + +The `execute_from` field is used to specify where to run the health check. Acceptable values for this field are `target` which is used to execute the health check on the monitored node itself and `remote`. The `remote` setting will tell Automatron to execute the health check from the system running Automatron itself. + +In our case the command we wish to execute can only be executed from the monitored system itself, as such the value of this field will be `target`. + +### Type + +The `type` field is used to specify what type of health check this check is. Acceptable values are `cmd` or `plugin`. In this case, since we are defining a command based health check our value is set to `cmd`. + +### Command + +The `cmd` field is used to specify the shell command to execute. In our example the command is simply `service nginx status`. However, this field can support much more complicated commands such as the below example. + +```yaml+jinja +cmd: /usr/bin/curl -Lw "Response %{http_code}\\n" http://10.0.0.1 -o /dev/null | egrep "Response [200|301]" +``` + +It is not uncommon to use multiple commands connected with output redirection and conditionals within a runbook. + +## A plugin based health check + +Plugin based health checks are similar to Command Based health checks in that the exit code is used to determine status. Where these checks differ is that Automatron will copy an executable to the target system and then execute that executable with the specified arguments. + +Below is an example Plugin health check. + +```yaml+jinja +port_443_is_up: + execute_from: target + type: plugin + plugin: network/tcp_connect.py + args: --host=localhost --port 443 +``` + +Plugin type health checks have 4 configuration items `execute_from`, `type`, `plugin` & `args`. Let's go ahead and break down these values to gain a better understanding of what they mean and tell Automatron to do. + +### Execute from & Type + +The `execute_from` and `type` fields are common fields for every runbook. The way they are applied for plugin health checks is the same as the way they are applied for command based health checks. As such we will skip repeating these fields in this section. + +### Plugin + +The `plugin` field is used to specify the location of the plugin executable. This is a relative file path starting from the value of the `plugin_path` parameter located within the `config/config.yml` configuration file. + +For example, a plugin located at `/path/to/plugins/checks/mycheck/mycheck.pl` would require the value of `mycheck/mycheck.pl`. + +### Plugin Arguments + +The `args` field is used to specify the arguments to provide the plugin executable. In the example above the plugin will be executed as follows by Automatron + +```sh +$ /path/to/plugins/checks/network/tcp_connect.py --host=localhost --port 443 +``` + +## Using Exit Codes to relay health check status + +Automatron follows the **Nagios** model for health check exit codes. When a health check is executed the exit code is used to inform Automatron of the results. The below list is a map of acceptable exit codes and how they relate to Automatron health check status. + + * `OK`: Requires a successful exit code of `0` + * `WARNING`: Is indicated by an exit code of `1` + * `CRITICAL`: Is indicated by an exit code of `2` + * `UNKNOWN`: Is indicated by any other exit code + +!!! tip + Since Automatron supports the **Nagios** exit code strategy most Nagios compliant health checks can also be used with Automatron. diff --git a/docs/runbooks/index.md b/docs/runbooks/index.md new file mode 100644 index 0000000..78e515d --- /dev/null +++ b/docs/runbooks/index.md @@ -0,0 +1,167 @@ +The core of Automatron is based around **Runbooks**. Runbooks are policies that define health checks and actions. You can think of them in the same way you would think of a printed runbook. Except with Automatron, the actions are automated. + +Below is a very simple Runbook example. + +```yaml+jinja +name: Check NGINX +schedule: "*/2 * * * *" +checks: + nginx_is_running: + execute_from: target + type: cmd + cmd: service nginx status +actions: + restart_nginx: + execute_from: target + trigger: 2 + frequency: 300 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: cmd + cmd: service nginx restart +``` + +This guide will walk through creating the above runbook as well as applying this runbook to all monitored hosts. + +## Creating the Runbook YAML file + +By default, Runbooks are specified within the `config/runbooks` directory. The runbook we will be creating is used to manage the NGINX service. We will want this runbook to be easy to find. An easy way to do that would be to create the runbook with a similar name as the service it manages. We can do so in one of two ways. + +We can either create a file `config/runbooks/nginx.yml` or `config/runbooks/nginx/init.yml`. Either option are acceptable for the next steps. For this guide we will create the file as `config/runbooks/nginx/init.yml`. + +```sh +$ mkdir -p config/runbooks/nginx +$ vi config/runbooks/nginx/init.yml +``` + +To get started let's go ahead and create the runbook by inserting our example runbook. + +```yaml+jinja +name: Check NGINX +schedule: "*/2 * * * *" +checks: + nginx_is_running: + execute_from: target + type: cmd + cmd: service nginx status +actions: + restart_nginx: + execute_from: target + trigger: 2 + frequency: 300 + call_on: + - WARNING + - CRITICAL + - UNKNOWN + type: cmd + cmd: service nginx restart +``` + +## The Anatomy of a Runbook + +A runbook consists of 4 major parameters; `name`, `schedule`, `checks`, & `actions`. + +### Name + +The `name` field is used to provide an arbitrary name for the runbook. This field is a required field and must have some value. It is required that this value be unique and not re-used by other runbooks as this name will be referenced internally within Automatron. + +### Schedule + +The `schedule` field is used to provide a cron formatted schedule for health check execution. A cron formatted schedule of `*/2 * * * *` will result in the health checks being executed every 2 minutes. + +!!! warning + Due to YAML formatting the cron schedule should be encased in single or double quotes such as `'*/2 * * * *'`. Failure to do so will result in a parsing error from YAML. + +#### Alternative schedule format + +It is also possible to define a schedule in a key/value based cron format such as the example below. + +```yaml+jinja +schedule: + second: '*/15' + minute: '*' + hour: '*' + day: '*' + month: '*' + day_of_week: '*' +``` + +Using this format you may omit keys that have a value of `*` as this is the default value. For example, the above schedule could also be represented as the below. + +```yaml+jinja +schedule: + second: '*/15' +``` + +!!! warning + When using the key/value based format it is important to specify the `second` parameter, as a default value of `*` would result in checks being run every second. + +### Checks + +The `checks` field is a YAML dictionary that contains the health checks to be executed against monitored hosts. The format of `checks` is as follows. + +```yaml+jinja +checks: + name_of_check: + # health check options + another_check: + # health check options +``` + +For more details around required health check parameters please read the [Checks](checks.md) section. + +### Actions + +Like `checks`, the `actions` field is a YAML dictionary that contains actions to be executed based on health check status. The `actions` field also follows a similar format to the `checks` field. + +```yaml+jinja +actions: + name_of_action: + # Action options + another_action: + # Action options +``` + +For more details around required action parameters please read the [Actions](actions.md) section. + +## Applying the Runbook + +By creating the `config/runbooks/nginx/init.yml` we have only defined the runbook itself. This runbook however will not be applied to any monitored hosts until we specify which hosts it should be applied to. + +To do this we will need to edit the `config/runbooks/init.yml` file. This file is a master list of any runbook to host mappings. To apply our runbook to all hosts we can simply insert the following into this file. + +```yaml+jinja +'*': + - nginx +``` + +The first field `'*'` is a Glob based matching used against the target hostname. In this case since the value is `*`, all hosts will be matched. + +If we wished to limit this runbook to severs with naming scheme of `web001.example.com` we could do so with the following modification. + +```yaml+jinja +'web*': + - nginx +``` + +### Specifying multiple targets and runbooks + +It is possible to specify multiple host and runbook mappings such as the above. The below is an example of what an `runbooks/init.yml` may look like for a environment hosting a two tier web application. + +```yaml+jinja +'*': + - cpu + - mem_free + - disk_free + - ntp + - ssh +'web*': + - nginx + - uwsgi +'db*': + - mysql +``` + +At this point we have a basic runbook that is being applied to all hosts. To make these changes take effect, simply restart Automatron. diff --git a/mkdocs.yml b/mkdocs.yml index 807374d..63709d8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,21 +1,39 @@ site_name: Automatron -site_url: https://www.automatron.io +site_url: https://automatron.io repo_url: https://github.com/madflojo/automatron -site_description: Autonomous IT Systems Monitoring and Remediation +docs_dir: docs +site_dir: site +site_description: A framework for creating self-healing infrastructure. It simply detects system events & takes action. site_author: Benjamin Cane site_favicon: img/favicon.ico -theme: readthedocs +theme: material +extra: + logo: 'img/automatron.png' + social: + - type: twitter + link: 'https://twitter.com/automatronio' + - type: github + link: 'https://github.com/madflojo/automatron' +markdown_extensions: + - codehilite + - admonition + - toc: + permalink: true + pages: - - Overview: index.md - - Getting Started: - - Automatron in 10 Minutes: automatron-in-10-minutes.md - - Deploying with Docker: deploying-with-docker.md + - Introduction: index.md + - Installation: + - Basic Installation: install/index.md + - Deploy with Docker: install/docker.md + - Configuration: configure.md - Runbooks: - - Overview: runbooks.md - - Facts: facts.md + - Basics: runbooks/index.md + - Checks: runbooks/checks.md + - Actions: runbooks/actions.md + - Facts: facts.md - Plugins: - - Index: plugins.md + - Index: plugins/index.md - Discovery: - AWS: plugins/discovery/aws.md - Digital Ocean: plugins/discovery/digitalocean.md @@ -46,3 +64,10 @@ pages: - System Info: plugins/vetting/ontarget/system-info.md - Remote: - Ping: plugins/vetting/remote/ping.md +# - Customizing +# - Actions +# - Checks +# - Datastores +# - Discovery +# - Logging +# - Vetting diff --git a/monitoring.py b/monitoring.py index e0dca90..da6dd61 100644 --- a/monitoring.py +++ b/monitoring.py @@ -11,8 +11,6 @@ ''' -import fnmatch -import os import sys import signal import json @@ -133,20 +131,11 @@ def schedule(scheduler, runbook, target, config, dbc, logger): month=task_schedule['month'], day_of_week=task_schedule['day_of_week'], ) - should_schedule = False - for node in target['runbooks'][runbook]['nodes']: - logger.debug("Checking if target {0} is {1} from list".format(target['hostname'], node)) - if fnmatch.fnmatch(target['hostname'], node): - should_schedule = True - - if should_schedule: - return scheduler.add_job( - monitor, - trigger=cron, - args=[runbook, target, config, dbc, logger] - ) - else: - return False + return scheduler.add_job( + monitor, + trigger=cron, + args=[runbook, target, config, dbc, logger] + ) def listen(scheduler, config, dbc, logger): ''' Listen for new events and schedule runbooks ''' diff --git a/tests/unit/test_monitoring_schedule.py b/tests/unit/test_monitoring_schedule.py index aeebf09..803e464 100644 --- a/tests/unit/test_monitoring_schedule.py +++ b/tests/unit/test_monitoring_schedule.py @@ -33,7 +33,6 @@ def tearDown(self): class TestCronSchedule(ScheduleTest): ''' Test when a cron based schedule is provided ''' @mock.patch('monitoring.CronTrigger') - @mock.patch('monitoring.fnmatch.fnmatch', new=mock.MagicMock(return_value=True)) def runTest(self, mock_triggered): ''' Execute test ''' scheduler = mock.Mock(**{ @@ -43,9 +42,6 @@ def runTest(self, mock_triggered): 'runbooks' : { 'test' : { 'schedule' : "* * * * *", - 'nodes' : [ - 'tes*' - ] } } }) @@ -66,7 +62,6 @@ def runTest(self, mock_triggered): class TestSpecificSchedule(ScheduleTest): ''' Test when a cron based schedule is provided ''' @mock.patch('monitoring.CronTrigger') - @mock.patch('monitoring.fnmatch.fnmatch', new=mock.MagicMock(return_value=True)) def runTest(self, mock_triggered): ''' Execute test ''' scheduler = mock.Mock(**{ @@ -83,9 +78,6 @@ def runTest(self, mock_triggered): 'month' : 1, 'day_of_week' : 1 }, - 'nodes' : [ - 'tes*' - ] } } }) @@ -107,7 +99,6 @@ def runTest(self, mock_triggered): class TestNoSchedule(ScheduleTest): ''' Test when a cron based schedule is provided ''' @mock.patch('monitoring.CronTrigger') - @mock.patch('monitoring.fnmatch.fnmatch', new=mock.MagicMock(return_value=True)) def runTest(self, mock_triggered): ''' Execute test ''' scheduler = mock.Mock(**{ @@ -116,9 +107,6 @@ def runTest(self, mock_triggered): self.target.update({ 'runbooks' : { 'test' : { - 'nodes' : [ - 'tes*' - ] } } })