Skip to content

Alert Response

elisa lee edited this page Aug 16, 2022 · 24 revisions

SimpleReport is actively monitored by Azure's Application Insights. In the event that abnormal application behavior is detected, an alert will automatically be sent to on-call engineering personnel for resolution.

Are you on-call? Lucky you! Here are some common alerts, and how to respond to them.

Table of Contents


10+ DB queries with durations over 1.25s in the past 5 minutes

[WIP] What Went Wrong? A number of known issues can cause slow DB query responses. If there isn't an issue opened for the slow DB query please feel free to open a new issue.

What Should You Do? These alerts usually resolve themselves, however, it is recommended to check the following:


Prod alert when an ExperianAuthException is seen

Affected Component: SimpleReport backend LiveExperianService

What Went Wrong? We use Experian to verify users' identity during user signup. Before we submit a request to Experian, we must first fetch an activation token from Experian using our credentials. If we see this alert, it means there was a problem fetching the token and the identity verification steps couldn't be completed for the user. More context on how we use Experian to verify identity can be found here.

What Should You Do?

  • View the alert in the Azure portal. Query the exceptions table for ExperianAuthExceptions in the time period of the alert to get the stack trace, which will include the response from Experian.
  • Possible Experian API responses when fetching a token are documented here.
    • Most exceptions have historically been because of intermittent 500 responses from Experian which are not actionable and resolve themselves.
  • Query requests in Azure to see if this is a one-off and we've since had successful requests to /identity-verification/get-questions or /identity-verification/submit-answers endpoints or if all requests are failing.
  • This alert can be triggered if Experian doesn't recognize our credentials, which has happened in the past when they expired the application password without notifying us. If this appears to be the cause, first verify that we haven't made any changes to our credentials or the LiveExperianService code. If not, the resolution is to contact Experian for help.

QueueBatchedReportStreamUploader failed to successfully complete

Affected Component: reportstream-batched-publisher-prod function app

What Went Wrong? The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader. The function was successfully triggered, but failed to either pull messages from the queue, or properly perform an upload.

What Should You Do?

  • Check the function history. You can see at a glance what the most recent set of runs looks like.
  • For the failed run, take note of the Operation Id. You can cross-reference this value in Application Insights to get a better picture of what caused the failure.
  • If necessary, reach out to the ReportStream team. We will need to confirm whether the issue is on the SimpleReport side, or whether it originates from ReportStream.

QueueBatchedReportStreamUploader is not triggering on schedule

Affected Component: reportstream-batched-publisher-prod function app

What Went Wrong? The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader. If this alert fires, chances are high that the code for the function is missing or corrupt.

What Should You Do?

  • Check the function history. You can see at a glance what the most recent set of runs looks like. Runs should take place every two minutes; a gap of longer than this confirms that the fired alert is valid.
  • Take a look at the Code + Test pane. Ensure that the files present here match what currently exists in the codebase.
  • If there are discrepancies between what files should be present, and what files are present, re-deploy the functions using the corresponding GitHub Action.

Twilio Alert

Impact: Twilio message sends

Issues: Twilio tracks errors created when trying to send messages. Twilio may be experiencing a high error rate related to but not limited to sending messages to landlines, unreachable carriers, HTTP errors, unknown handsets, or spam filtering messages sent by SimpleReport.

Actions to take:

  • Check the Twilio error logs to see what the problems are. It is possible to filter the results and narrow or expand the displayed time frame.
  • Check Twilio status page for outtage.
  • Check individual errors by clicking into an error, then clicking the RESOURCE SID link to get more information. Navigating to this view will allow us to see the number we tried to send to, the body of the message, and a complete historical record of that message within Twilio.
  • Possible corrective actions:
    • Update user records within SimpleReport.
    • Submit a Twilio support ticket. Twilio suggests that we do this if we have an example of three or more filtered messages that we believe we legitimate sends.

Local development

Setup

How to

Development process and standards

Oncall

Technical resources

How-to guides

Environments/Azure

Misc

?

Clone this wiki locally