-
Notifications
You must be signed in to change notification settings - Fork 55
Alert Response
SimpleReport is actively monitored by Azure's Application Insights. In the event that abnormal application behavior is detected, an alert will automatically be sent to on-call engineering personnel for resolution.
Are you on-call? Lucky you! Here are some common alerts, and how to respond to them.
Affected Component: reportstream-batched-publisher-prod
function app
What Went Wrong?
The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader
. The function was successfully triggered, but failed to either pull messages from the queue, or properly perform an upload.
What Should You Do?
- Check the function history. You can see at a glance what the most recent set of runs looks like.
- For the failed run, take note of the Operation Id. You can cross-reference this value in Application Insights to get a better picture of what caused the failure.
- If necessary, reach out to the ReportStream team. We will need to confirm whether the issue is on the SimpleReport side, or whether it originates from ReportStream.
Affected Component: reportstream-batched-publisher-prod
function app
What Went Wrong?
The ReportStream Batched Publisher has a built-in timer function, QueueBatchedReportStreamUploader
. If this alert fires, chances are high that the code for the function is missing or corrupt.
What Should You Do?
- Check the function history. You can see at a glance what the most recent set of runs looks like. Runs should take place every two minutes; a gap of longer than this confirms that the fired alert is valid.
- Take a look at the Code + Test pane. Ensure that the files present here match what currently exists in the codebase.
- If there are discrepancies between what files should be present, and what files are present, re-deploy the functions using the corresponding GitHub Action.
Impact: Twilio message sends
Issues: Twilio tracks errors created when trying to send messages. Twilio may be experiencing a high error rate related to but not limited to sending messages to landlines, unreachable carriers, HTTP errors, unknown handsets, or spam filtering messages sent by SimpleReport.
Actions to take:
- Check the Twilio error logs to see what the problems are. It is possible to filter the results and narrow or expand the displayed time frame.
- Check Twilio status page for outtage.
- Check individual errors by clicking into an error, then clicking the
RESOURCE SID
link to get more information. Navigating to this view will allow us to see the number we tried to send to, the body of the message, and a complete historical record of that message within Twilio. - Possible corrective actions:
- Update user records within SimpleReport.
- Submit a Twilio support ticket. Twilio suggests that we do this if we have an example of three or more filtered messages that we believe we legitimate sends.
Affected Component: SimpleReport backend LiveExperianService
What Went Wrong? We use Experian to verify users' identity during user signup. Before we submit a request to Experian, we must first fetch an activation token from Experian using our credentials. If we see this alert, it means there was a problem fetching the token and the identity verification steps couldn't be completed for the user. More context on how we use Experian to verify identity can be found here.
What Should You Do?
- View the alert in the Azure portal. Query the exceptions table for ExperianAuthExceptions in the time period of the alert to get the stack trace, which will include the response from Experian.
- Possible Experian API responses when fetching a token are documented here.
- Most exceptions have historically been because of intermittent 500 responses from Experian which are not actionable and resolve themselves.
- Query requests in Azure to see if this is a one-off and we've since had successful requests to
/identity-verification/get-questions
or/identity-verification/submit-answers
endpoints or if all requests are failing. - This alert can be triggered if Experian doesn't recognize our credentials, which has happened in the past when they expired the application password without notifying us. If this appears to be the cause, first verify that we haven't made any changes to our credentials or the
LiveExperianService
code. If not, the resolution is to contact Experian for help.
- Getting Started
- [Setup] Docker and docker compose development
- [Setup] IntelliJ run configurations
- [Setup] Running DB outside of Docker (optional)
- [Setup] Running nginx locally (optional)
- [Setup] Running outside of docker
- Accessing and testing weird parts of the app on local dev
- Accessing patient experience in local dev
- API Testing with Insomnia
- Cypress
- How to run e2e locally for development
- E2E tests
- Database maintenance
- MailHog
- Running tests
- SendGrid
- Setting up okta
- Sonar
- Storybook and Chromatic
- Twilio
- User roles
- Wiremock
- CSV Uploader
- Log local DB queries
- Code review and PR conventions
- SimpleReport Style Guide
- How to Review and Test Pull Requests for Dependabot
- How to Review and Test Pull Requests with Terraform Changes
- SimpleReport Deployment Process
- Adding a Developer
- Removing a developer
- Non-deterministic test tracker
- Alert Response - When You Know What is Wrong
- What to Do When You Have No Idea What is Wrong
- Main Branch Status
- Maintenance Mode
- Swapping Slots
- Monitoring
- Container Debugging
- Debugging the ReportStream Uploader
- Renew Azure Service Principal Credentials
- Releasing Changelog Locks
- Muting Alerts
- Architectural Decision Records
- Backend Stack Overview
- Frontend Overview
- Cloud Architecture
- Cloud Environments
- Database ERD
- External IDs
- GraphQL Flow
- Hibernate Lazy fetching and nested models
- Identity Verification (Experian)
- Spring Profile Management
- SR Result bulk uploader device validation logic
- Test Metadata and how we store it
- TestOrder vs TestEvent
- ReportStream Integration
- Feature Flag Setup
- FHIR Resources
- FHIR Conversions
- Okta E2E Integration
- Deploy Application Action
- Slack notifications for support escalations
- Creating a New Environment Within a Resource Group
- How to Add and Use Environment Variables in Azure
- Web Application Firewall (WAF) Troubleshooting and Maintenance
- How to Review and Test Pull Requests with Terraform Changes