BFD-3565: Simplify BFD Server error alerting infrastructure #2534

malessi · 2025-01-29T17:08:38Z

JIRA Ticket:
BFD-3565

What Does This PR Do?

Due to resource name changes, the bfd_server_error_alerts module will need to be destroyed and then re-apply'd in all environments. This has been done in test, and will be done out-of-band in all other environments once this PR is merged.

This PR greatly simplifies the BFD Server's 500 errors alerting solution and infrastructure. The "scheduler Lambda" has been removed outright along with its supporting Terrraform. The "alerter Lambda"'s Terraform has been simplified and its architecture updated to arm64 for (likely negligible) reduced cost. See diagram below for more details.

For context, there was originally some fear that running Log Insights queries every 5 minutes to check for errors on our logs would be prohibitively expensive, so we introduced some complexity to the alerting infrastructure to ensure that we were only running those queries when there were errors. Sometime after that a change was made that inadvertently disabled that safeguard such that the Log Insights queries (specifically, the "alerter Lambda") began executing, in every environment, every 5 minutes regardless of whether there were errors. We have since noticed no real increase in CloudWatch spend. So, we realized that this complexity was unnecessary and can be removed resulting in this PR.

New Infrastructure Diagram

What Should Reviewers Watch For?

If you're reviewing this PR, please check for these things in particular:

What Security Implications Does This PR Have?

Please indicate if this PR does any of the following:

~~Adds any new software dependencies~~
~~Modifies any security controls~~
~~Adds new transmission or storage of data~~
~~Any other changes that could possibly affect security?~~
I have considered the above security implications as it relates to this PR. (If one or more of the above apply, it cannot be merged without the ISSO or team security engineer's (@sb-benohe) approval.)

Validation

Have you fully verified and tested these changes? Is the acceptance criteria met? Please provide reproducible testing instructions, code snippets, or screenshots as applicable.

terraform applying these changes to server in test, verifying that:
- Changes are apply'd successfully with no errors
- All resources are created as expected
Waiting 5 minutes for the rate() schedule to execute the alerter Lambda, verifying that the Lambda executes as expected and there are no errors

…resources; use arm64 instead of x86 for alerter lambda

…docs

malessi · 2025-01-29T17:11:17Z

.github/scripts/pre-commit.sh

@@ -62,7 +62,7 @@ runShellCheckForCommitFiles() {

      # Skip binary formats
      case "$extension" in
-        "zip" | "p12" | "pfx" | "cer" | "pem")
+        "zip" | "p12" | "pfx" | "cer" | "pem" | "png" | "jpg")


Trying to commit the changes to the diagram PNG resulted in an invalid byte sequence error from sed, so I opted to exclude some image types from pre-commit hooks entirely.

Sounds like something that someone putting MBIs in a JPEG would say...

malessi added 3 commits January 29, 2025 11:55

Remove scheduler lambda and supporting Terraform; simplify remaining …

1409ab9

…resources; use arm64 instead of x86 for alerter lambda

Update diagram with new infrastructure; update README; run terraform-…

4be9b18

…docs

Exclude some image file types from pre-commit hook

05ca97e

malessi requested review from mjburling, dondevun and aschey-forpeople as code owners January 29, 2025 17:08

malessi commented Jan 29, 2025

View reviewed changes

aschey-forpeople approved these changes Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BFD-3565: Simplify BFD Server error alerting infrastructure #2534

BFD-3565: Simplify BFD Server error alerting infrastructure #2534

malessi commented Jan 29, 2025 •

edited

Loading

malessi Jan 29, 2025

aschey-forpeople Jan 29, 2025

BFD-3565: Simplify BFD Server error alerting infrastructure #2534

Are you sure you want to change the base?

BFD-3565: Simplify BFD Server error alerting infrastructure #2534

Conversation

malessi commented Jan 29, 2025 • edited Loading

What Does This PR Do?

New Infrastructure Diagram

What Should Reviewers Watch For?

What Security Implications Does This PR Have?

Validation

malessi Jan 29, 2025

Choose a reason for hiding this comment

aschey-forpeople Jan 29, 2025

Choose a reason for hiding this comment

malessi commented Jan 29, 2025 •

edited

Loading