Skip to content

Commit

Permalink
Merge pull request #49 from wjayesh/docs
Browse files Browse the repository at this point in the history
add prometheus exporter
  • Loading branch information
wjayesh authored Aug 5, 2020
2 parents 5eed8bb + 9379bc9 commit d95b788
Showing 1 changed file with 46 additions and 0 deletions.
46 changes: 46 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@ A binary and packages to perform health checks on pods and services running on K

* [**Workflow**](https://github.com/wjayesh/coredns-healthcheck/tree/main/#workflow)

* [**Prometheus Monitoring**]()

* [**Deployment**](https://github.com/wjayesh/coredns-healthcheck/tree/main/#deployment)

* [**Milestones**](https://github.com/WJayesh/coredns-healthcheck/tree/main#milestones-)
Expand Down Expand Up @@ -67,6 +69,50 @@ Firstly, the binary queries the CoreDNS pods from the host namespace and checks

If the service is unavailable from any namespace, the `etc/resolv.conf` file is then inspected to look for possible causes of failure.


## Prometheus Exporter

A exporter library is implemented at [`pkg/exporter`]() that takes values from the application and registers them with Prometheus using the golang client.

The exporter will help determine the number of times the remedies were required, how often the pods failed, the primary reasons for the failures (ascertained by the type of remedy that fixed it) among other things.

### Available Groups Of Data

Name | Description
-----| ------
remedy | This group has metrics related to the remedial measures taken when the pods fail, such as restarting pods or increasing memory allocation.
dns | This group has metrics that deal with dns queries made by the application and their response.


Remedy group has the following available metrics:

Name | Metric Type | Exposed Information
---- | ---- | ---
`oom_count` | Counter | Counts the number of OOM errors encountered
`restart_count` | Counter | Counts the number of restarts performed on the pods
`total_failures` | Counter | Counts the total number of failures of the pods under check


DNS group has the following available metrics:

Name | Metric Type | Exposed Information
---- | ---- | ---
`dns_query_count` | Counter | Counts the number of DNS queries made
`dns_query_response_time` | Histogram | The time it takes to finish dns queries

### Visuals

A sample run with an orchestrated OOM error:

![OOM](https://user-images.githubusercontent.com/37150991/89416485-f10e0100-d74a-11ea-8eeb-07ae7bda3a22.png)

This shows the number of restarts that fixed the pods and the total number of failures.

![Restarts and Failures](https://user-images.githubusercontent.com/37150991/89416967-afca2100-d74b-11ea-8873-99905b42b57a.png)


More visuals will be added as tests proceed.


## Deployment

Expand Down

0 comments on commit d95b788

Please sign in to comment.