BFD-3565: Simplify BFD Server error alerting infrastructure #2534
+72
−498
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
JIRA Ticket:
BFD-3565
What Does This PR Do?
Due to resource name changes, the
bfd_server_error_alerts
module will need to be destroyed and then re-apply
'd in all environments. This has been done intest
, and will be done out-of-band in all other environments once this PR is merged.This PR greatly simplifies the BFD Server's
500
errors alerting solution and infrastructure. The "scheduler Lambda" has been removed outright along with its supporting Terrraform. The "alerter Lambda"'s Terraform has been simplified and its architecture updated toarm64
for (likely negligible) reduced cost. See diagram below for more details.For context, there was originally some fear that running Log Insights queries every 5 minutes to check for errors on our logs would be prohibitively expensive, so we introduced some complexity to the alerting infrastructure to ensure that we were only running those queries when there were errors. Sometime after that a change was made that inadvertently disabled that safeguard such that the Log Insights queries (specifically, the "alerter Lambda") began executing, in every environment, every 5 minutes regardless of whether there were errors. We have since noticed no real increase in CloudWatch spend. So, we realized that this complexity was unnecessary and can be removed resulting in this PR.
New Infrastructure Diagram
What Should Reviewers Watch For?
If you're reviewing this PR, please check for these things in particular:
What Security Implications Does This PR Have?
Please indicate if this PR does any of the following:
Adds any new software dependenciesModifies any security controlsAdds new transmission or storage of dataAny other changes that could possibly affect security?I have considered the above security implications as it relates to this PR. (If one or more of the above apply, it cannot be merged without the ISSO or team security engineer's (
@sb-benohe
) approval.)Validation
Have you fully verified and tested these changes? Is the acceptance criteria met? Please provide reproducible testing instructions, code snippets, or screenshots as applicable.
terraform apply
ing these changes toserver
intest
, verifying that:apply
'd successfully with no errorsrate()
schedule to execute the alerter Lambda, verifying that the Lambda executes as expected and there are no errors