RDSS employs the time of a Runner in each sprint, to ensure that errors encountered in any of our applications are ticketed and appropriately prioritized, so that bugs and other issues are routinely addressed. The runner checks Honeybadger and DataDog regularly for errors.
Every Monday and Tuesday from 5:30 - 8:30AM EST is a planned maintenance window with patches and other maintenance work done by the PUL IT Operations team. Patch Monday applies to staging servers. Patch Tuesday applies to production and QA servers. Alerts caused by planned maintenance work may appear in our tools during these windows. Questions about work being done in this window can best be directed to the #infrastructure channel on Slack.
There is one runner per sprint. The new runner is selected during the Sprint Planning meeting, which occurs once every two weeks at the beginning of a new sprint.
The runner's duties are as follows:
Each workday check:
-
Honeybadger for errors in the following RDSS applications:
-
Honeybadger for uptime errors in PDC Globus. We should be getting emails when an application goes down, but it does not hurt to check as part of the runner dutties.
-
Check DataDog for errors in RDSS applications.
-
Create a ticket for each error encountered, if one does not already exist.
-
Bring error tickets to the RDSS team's attention at check-in for prioritization.
-
If a ticket is a work-stopping issue/considered high priority and is therefore assigned to the current sprint, it receives the "unplanned work" label in GitHub.
-
In the ticket, include a checkbox to mark the issue resolved in Honeybadger as part of the acceptance criteria if appropriate (example here).
-
Check the failed jobs queue for PDC Describe in production. See details
-
Review and merge dependabot PRs in our repositories. Make sure they pass CI and deploy them to staging before merging them.