Skip to content

Commit

Permalink
chore: add loki alert (#58)
Browse files Browse the repository at this point in the history
send alert if loki tasks arent running
  • Loading branch information
jlangy authored Feb 5, 2025
1 parent ac648cb commit e0adcae
Show file tree
Hide file tree
Showing 5 changed files with 95 additions and 2 deletions.
6 changes: 4 additions & 2 deletions .github/workflows/terraform.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ jobs:
RETENTION_PERIOD=168h
SUBNET_A=Web_Dev_aza_net
SUBNET_B=Web_Dev_azb_net
RC_PREFIX=SANDBOX
EOF
- name: Set env to production
Expand All @@ -108,7 +108,7 @@ jobs:
RETENTION_PERIOD=180d
SUBNET_A=Web_Prod_aza_net
SUBNET_B=Web_Prod_azb_net
RC_PREFIX=PRODUCTION
EOF
- name: Configure AWS Credentials
Expand Down Expand Up @@ -143,6 +143,8 @@ jobs:
subnet_a="${{env.SUBNET_A}}"
subnet_b="${{env.SUBNET_B}}"
loki_tag="${{env.LOKI_TAG}}"
rc_prefix="${{env.RC_PREFIX}}"
rc_webhook="${{secrets.LOKI_WEBHOOK}}"
EOF
- name: Terraform Plan
Expand Down
16 changes: 16 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,22 @@ GitHub CD pipeline scripts are triggered based on the directory that has changed

The terraform account for deployment is restricted to the required resource types for this repository. If adding new resources not currently required, you will get a permission denied error. Expand the permissions on the `sso-dashboard-boundary` as needed.

When doing an initial webhook setup to integrate with [AWS SNS](https://aws.amazon.com/sns) you need to confirm the url you gave is correct. AWS will send a link to the provided URL to confirm. You can find it in the `content_raw.SubscribeURL` parameter to confirm. e.g for rocket chat the script:

``` javascript
class Script {
process_incoming_request({ request }) {
return {
content:{
text: `@here ${JSON.parse(request.content_raw).SubscribeURL}`
}
};
}
}
```

Would output the url to follow.

## Service accounts

Service accounts are already generated and added to github secrets, see below for the related OC secret to see the token value. If needing to recreate the service account, see the [service-account-generator directory](/service-account-generator/README.md) for how to do so.
Expand Down
5 changes: 5 additions & 0 deletions terraform-ecs/ecs.tf
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
resource "aws_ecs_cluster" "sso_ecs_cluster" {
name = "loki-cluster"

setting {
name = "containerInsights"
value = "enhanced"
}
}

resource "aws_ecs_task_definition" "loki_write" {
Expand Down
60 changes: 60 additions & 0 deletions terraform-ecs/sns.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
locals {
ecs_services = ["loki-write-service", "loki-read-service"]
}


resource "aws_sns_topic" "rocket_chat" {
name = "rocketchat"
}

resource "aws_sns_topic_subscription" "rocket_chat_subscription" {
topic_arn = aws_sns_topic.rocket_chat.arn
protocol = "https"
endpoint = var.rc_webhook
}

resource "aws_cloudwatch_metric_alarm" "loki_tasks_low" {
for_each = toset(local.ecs_services)
alarm_name = "${var.rc_prefix}: ${each.key} tasks low"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 1
threshold = 0
alarm_description = "Alarm if DesiredTaskCount is greater than the RunningTaskCount for the ECS service ${each.key}."

metric_query {
id = "running_task_count"
metric {
metric_name = "RunningTaskCount"
namespace = "ECS/ContainerInsights"
period = 300 # 5 minute
stat = "Average"
dimensions = {
ClusterName = "loki-cluster"
ServiceName = each.key
}
}
}

metric_query {
id = "desired_task_count"
metric {
metric_name = "DesiredTaskCount"
namespace = "ECS/ContainerInsights"
period = 300 # 5 minute
stat = "Average"
dimensions = {
ClusterName = "loki-cluster"
ServiceName = each.key
}
}
}

metric_query {
id = "task_deficit"
expression = "desired_task_count - running_task_count"
label = "Difference between Desired and Running Task Counts"
return_data = true
}

alarm_actions = [aws_sns_topic.rocket_chat.arn]
}
10 changes: 10 additions & 0 deletions terraform-ecs/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -58,3 +58,13 @@ variable "retention_period" {
type = string
default = "180d"
}

variable "rc_webhook" {
type = string
sensitive = true
}

variable "rc_prefix" {
type = string
default = "SANDBOX"
}

0 comments on commit e0adcae

Please sign in to comment.