Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented Prometheus Rule for automated alerts #193

Merged
merged 4 commits into from
Mar 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 7 additions & 10 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,12 @@ docs: ## Generate charts' docs using helm-docs
(echo "Please, install https://github.com/norwoodj/helm-docs first" && exit 1)

.PHONY: schema
schema: ## Generate charts' schema usign helm schema-gen plugin
@helm schema-gen charts/cloudnative-pg/values.yaml > charts/cloudnative-pg/values.schema.json || \
(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)
schema: cloudnative-pg-schema cluster-schema ## Generate charts' schema using helm-schema-gen

.PHONY: pgbench-deploy
pgbench-deploy: ## Installs pgbench chart
helm dependency update charts/pgbench
helm upgrade --install pgbench --atomic charts/pgbench
cloudnative-pg-schema:
@helm schema-gen charts/cloudnative-pg/values.yaml | cat > charts/cloudnative-pg/values.schema.json || \
(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)

.PHONY: pgbench-uninstall
pgbench-uninstall: ## Uninstalls cnpg-pgbench chart if present
@helm uninstall pgbench
cluster-schema:
@helm schema-gen charts/cluster/values.yaml | cat > charts/cluster/values.schema.json || \
(echo "Please, run: helm plugin install https://github.com/karuppiah7890/helm-schema-gen.git" && exit 1)
12 changes: 7 additions & 5 deletions charts/cluster/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,9 +88,9 @@ Additionally you can specify the following parameters:
```yaml
backups:
scheduledBackups:
- name: daily-backup
schedule: "0 0 0 * * *" # Daily at midnight
backupOwnerReference: self
- name: daily-backup
schedule: "0 0 0 * * *" # Daily at midnight
backupOwnerReference: self
```

Each backup adapter takes it's own set of parameters, listed in the [Configuration options](#Configuration-options) section
Expand Down Expand Up @@ -149,8 +149,10 @@ refer to the [CloudNativePG Documentation](https://cloudnative-pg.io/documentat
| cluster.instances | int | `3` | Number of instances |
| cluster.logLevel | string | `"info"` | The instances' log level, one of the following values: error, warning, info (default), debug, trace |
| cluster.monitoring.customQueries | list | `[]` | |
| cluster.monitoring.enablePodMonitor | bool | `false` | |
| cluster.postgresql | string | `nil` | Configuration of the PostgreSQL server See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration |
| cluster.monitoring.enabled | bool | `false` | |
| cluster.monitoring.podMonitor.enabled | bool | `true` | |
| cluster.monitoring.prometheusRule.enabled | bool | `true` | |
| cluster.postgresql | object | `{}` | Configuration of the PostgreSQL server See: https://cloudnative-pg.io/documentation/current/cloudnative-pg.v1/#postgresql-cnpg-io-v1-PostgresConfiguration |
| cluster.primaryUpdateMethod | string | `"switchover"` | Method to follow to upgrade the primary server during a rolling update procedure, after all replicas have been successfully updated. It can be switchover (default) or in-place (restart). |
| cluster.primaryUpdateStrategy | string | `"unsupervised"` | Strategy to follow to upgrade the primary server during a rolling update procedure, after all replicas have been successfully updated: it can be automated (unsupervised - default) or manual (supervised) |
| cluster.priorityClassName | string | `""` | |
Expand Down
49 changes: 49 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHACritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
CNPGClusterHACritical
=====================

Meaning
-------

The `CNPGClusterHACritical` alert is triggered when the CloudNativePG cluster has no ready standby replicas.

This can happen during either a normal failover or automated minor version upgrades in a cluster with 2 or less
instances. The replaced instance may need some time to catch-up with the cluster primary instance.

This alarm will be always triggered if your cluster is configured to run with only 1 instance. In this case you
may want to silence it.

Impact
------

Having no available replicas puts your cluster at a severe risk if the primary instance fails. The primary instance is
still online and able to serve queries, although connections to the `-ro` endpoint will fail.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
51 changes: 51 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHAWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
CNPGClusterHAWarning
====================

Meaning
-------

The `CNPGClusterHAWarning` alert is triggered when the CloudNativePG cluster ready standby replicas are less than `2`.

This alarm will be always triggered if your cluster is configured to run with less than `3` instances. In this case you
may want to silence it.

Impact
------

Having less than two available replicas puts your cluster at risk if another instance fails. The cluster is still able
to operate normally, although the `-ro` and `-r` endpoints operate at reduced capacity.

This can happen during a normal failover or automated minor version upgrades. The replaced instance may need some time
to catch-up with the cluster primary instance which will trigger the alert if the operation takes more than 5 minutes.

At `0` available ready replicas, a `CNPGClusterHACritical` alert will be triggered.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
CNPGClusterHighConnectionsCritical
==================================

Meaning
-------

This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 95% of its capacity.

Impact
------

At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Mitigation
----------

* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter.
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
CNPGClusterHighConnectionsWarning
=================================

Meaning
-------

This alert is triggered when the number of connections to the CloudNativePG cluster instance exceeds 85% of its capacity.

Impact
------

At 100% capacity, the CloudNativePG cluster instance will not be able to accept new connections. This will result in a service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Mitigation
----------

* Increase the maximum number of connections by increasing the `max_connections` PostgreSQL parameter.
* Use connection pooling by enabling PgBouncer to reduce the number of connections to the database.
31 changes: 31 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterHighReplicationLag.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CNPGClusterHighReplicationLag
=============================

Meaning
-------

This alert is triggered when the replication lag of the CloudNativePG cluster exceed `1s`.

Impact
------

High replication lag can cause the cluster replicas become out of sync. Queries to the `-r` and `-ro` endpoints may return stale data.
In the event of a failover, there may be data loss for the time period of the lag.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

High replication lag can be caused by a number of factors, including:
* Network issues
* High load on the primary or replicas
* Long running queries
* Suboptimal PostgreSQL configuration, in particular small numbers of `max_wal_senders`.

```yaml
kubectl exec --namespace <namespace> --stdin --tty services/<cluster_name>-rw -- psql -c "SELECT * from pg_stat_replication;"
```

Mitigation
----------
28 changes: 28 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterInstancesOnSameNode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
CNPGClusterInstancesOnSameNode
============================

Meaning
-------

The `CNPGClusterInstancesOnSameNode` alert is raised when two or more database pods are scheduled on the same node.

Impact
------

A failure or scheduled downtime of a single node will lead to a potential service disruption and/or data loss.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Mitigation
----------

1. Verify you have more than a single node with no taints, preventing pods to be scheduled there.
2. Verify your [affinity](https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/) configuration.
3. For more information, please refer to the ["Scheduling"](https://cloudnative-pg.io/documentation/current/scheduling/) section in the documentation
31 changes: 31 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceCritical.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CNPGClusterLowDiskSpaceCritical
===============================

Meaning
-------

This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either:

* the PVC hosting the `PGDATA` (`storage` section)
* the PVC hosting WAL files (`walStorage` section), where applicable
* any PVC hosting a tablespace (`tablespaces` section)

Impact
------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
in downtime and data loss.

Diagnosis
---------

Mitigation
----------

If you experience issues with the WAL (Write-Ahead Logging) volume and have
set up continuous archiving, ensure that WAL archiving is functioning
correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal`
folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically
ensuring that the number of `ready` files does not increase linearly.
31 changes: 31 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterLowDiskSpaceWarning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
CNPGClusterLowDiskSpaceWarning
==============================

Meaning
-------

This alert is triggered when the disk space on the CloudNativePG cluster exceeds 90%. It can be triggered by either:

* the PVC hosting the `PGDATA` (`storage` section)
* the PVC hosting WAL files (`walStorage` section), where applicable
* any PVC hosting a tablespace (`tablespaces` section)

Impact
------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Excessive disk space usage can lead fragmentation negatively impacting performance. Reaching 100% disk usage will result
in downtime and data loss.

Diagnosis
---------

Mitigation
----------

If you experience issues with the WAL (Write-Ahead Logging) volume and have
set up continuous archiving, ensure that WAL archiving is functioning
correctly. This is crucial to avoid a buildup of WAL files in the `pg_wal`
folder. Monitor the `cnpg_collector_pg_wal_archive_status` metric, specifically
ensuring that the number of `ready` files does not increase linearly.
43 changes: 43 additions & 0 deletions charts/cluster/docs/runbooks/CNPGClusterOffline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
CNPGClusterOffline
==================

Meaning
-------

The `CNPGClusterOffline` alert is triggered when there are no ready CloudNativePG instances.

Impact
------

Having an offline cluster means your applications will not be able to access the database, leading to potential service
disruption.

Diagnosis
---------

Use the [CloudNativePG Grafana Dashboard](https://grafana.com/grafana/dashboards/20417-cloudnativepg/).

Get the status of the CloudNativePG cluster instances:

```bash
kubectl get pods -A -l "cnpg.io/podRole=instance" -o wide
```

Check the logs of the affected CloudNativePG instances:

```bash
kubectl logs --namespace <namespace> pod/<instance-pod-name>
```

Check the CloudNativePG operator logs:

```bash
kubectl logs --namespace cnpg-system -l "app.kubernetes.io/name=cloudnative-pg"
```

Mitigation
----------

Refer to the [CloudNativePG Failure Modes](https://cloudnative-pg.io/documentation/current/failure_modes/)
and [CloudNativePG Troubleshooting](https://cloudnative-pg.io/documentation/current/troubleshooting/) documentation for
more information on how to troubleshoot and mitigate this issue.
Loading
Loading