-
Notifications
You must be signed in to change notification settings - Fork 199
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'learning-software-engineering:main' into main
- Loading branch information
Showing
11 changed files
with
917 additions
and
42 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
# Retries in web services | ||
|
||
Especially relevant for webserver applications, but useful for others, retries are really tricky to get right. | ||
Retries are the practice of _retrying_ a network request, usually over HTTP or HTTPS, when it fails. It relies | ||
on the assumption that most failures are intermittent, meaning only happen rarely. | ||
|
||
Retries and throttling are both terms used to talk about the _flow_ of traffic into a service. Often the | ||
operators/developers of that service want to make guarantees about the rate of that flow, or otherwise direct | ||
traffic. | ||
|
||
## Why retry at all? | ||
|
||
Intermittent failure can happen at any level. This can be within a single host if its on-host disk, memory, or | ||
CPU fails, or often in the communication between two hosts over a network. Networks, especially over the public | ||
internet using TCP/IP, are known to have periodic failures due to high load/congestion or network infrastructure | ||
hardware failure. | ||
|
||
Retries are a really simple, easy answer to these intermittent failures. If the error only happens rarely, then | ||
trying a task again is a really effective way to ensure that the message goes through. This often manifests | ||
itself as retrying API calls. | ||
|
||
Some services have built up language around these retries to control them. For example, calling AWS APIs returns | ||
metadata about the request itself. For example, there is a `$retryable` field in most of the AWS SDKv2's APIs, | ||
the most common client used to make AWS API calls. If this field is set `true`, the server is hinting that the | ||
failure was intermittent and that the client should retry. If the field is set to `false`, the server is hinting | ||
to the client that the failure is likely going to happen again. | ||
|
||
## What are the problems with retries? | ||
|
||
Since retries are so simple to implement and elegant, they are usually the first tool that developers reach for | ||
when a dependency of theirs has intermittent failures, but how can this go wrong? | ||
|
||
Consider a case where 4 distinct software teams each build products that depend on one another, in a chain like: | ||
``` | ||
A -> B -> C -> D | ||
``` | ||
|
||
That is, service A is calling service B's APIs, and so on. Since B's APIs are known to fail occasionally, A has | ||
configured an automatic retry count of 3. Underneath the hood, B depends on C. Service A may or may not know this | ||
about B. But since C has a flaky API too, B also has a retry count of 3. And the same for C. | ||
|
||
This works fine, and will usually work. If all services are sufficiently scaled up to handle the load they are | ||
given, there are no problems. | ||
|
||
However, imagine a case where service D is down. Though it is at the end of the chain of dependencies, in theory, | ||
the services should be able to stay up despite their dependencies being down. This type of engineering is called | ||
fault-tolerance. | ||
|
||
The next time that Service A takes a request, it forwards it to B, which forwards it to C, which tries to call D's | ||
API, which fails. C then, tries again 3 times before reporting a failure back to B, which also triggers a retry. | ||
That means C tries _another 3 times_. | ||
|
||
Retries deep into services grow multiplicatively, and a single API call to A has caused | ||
``` | ||
A: 3 | ||
B: 9 | ||
C: 27 | ||
``` | ||
different API calls to fail. C is handling 27x more load than it is used to, and might start failing itself, further | ||
exacerbating the problem. | ||
|
||
That is, D has become a single point of failure for all other services, and even if they don't outright fail, the | ||
load on B, C, and D, are highly needlessly increased. | ||
|
||
## What can we do about retries? | ||
|
||
Clients calling services will nearly always have retries configured. However, internal services should rarely | ||
implement retries while calling other internal services, for precisely this reason. | ||
|
||
Another technique to get around excessive retries is to utilize more caching. If service C had cached the responses | ||
from service D, it's possible that service D going down would have affected the top-level services at all, and | ||
everything would have worked as normal. The downside to this approach is that caches are often trick to get right, | ||
and sometimes introduce modal behavior in services [1], usually a bad thing. | ||
|
||
## So should I retry? | ||
|
||
As always in software engineering, it depends. A good rule of thumb is the external/internal, where external | ||
dependencies are wrapped in retries, but internal dependencies aren't. It's much easier to control the behavior of | ||
internal dependencies, either by directly contributing to their product, or speaking to the owners of that product | ||
itself. Retries are a rough band-aid, and more precise solutions are often better. For example, it might be more work, | ||
but fixing the root-cause of intermittent failures avoids the problems with retries in the first place, and also | ||
produces a more stable product. | ||
|
||
Retries are also more acceptable when they aren't in the _critical path_ of a service. For an `AddTwoNumbers` | ||
service, having retries on dependencies within the main `AddTwoNumbers` API call might not be a good idea. However, | ||
for backup jobs, batch processing, or other non-performance-critical work, retries are often a simple, | ||
engineering-efficient way to ensure reliability. | ||
|
||
## How should I retry? | ||
|
||
For most popular programming languages, retries are built into common dependencies. For example, | ||
1. Rust has `tower`, a generic HTTP service abstraction that offers automatic retries: https://github.com/tower-rs/tower [2], | ||
2. JavaScript and Typescript have `retry`: https://www.npmjs.com/package/retry [3], and | ||
3. Go has `retry-go`: https://github.com/avast/retry-go [4] | ||
|
||
Each library works slightly differently, but can be used in simple or complex ways. For example, it could be as simple | ||
as immediately retrying the network request upon failure, or more complicated, including concepts like jitter (making sure | ||
many concurrent clients don't all retry at the same time), exponential backoff (clients retrying less and less over time), | ||
or other concepts [1]. | ||
|
||
## References | ||
1. https://brooker.co.za/blog/2021/05/24/metastable.html | ||
2. https://github.com/tower-rs/tower | ||
3. https://www.npmjs.com/package/retry | ||
4. https://github.com/avast/retry-go |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Sprint Planning Meeting | ||
|
||
## Introduction | ||
In the Scrum framework, sprints function as short, consistent cycles during which a set amount of work is completed. Since they are the heartbeat of the framework, ensuring sprints are well planned, or ensuring a successful Sprint Planning Meeting, is a crucial part of the framework especially in large teams where effective coordination and communication are key. | ||
|
||
|
||
## What does a Sprint Planning Meeting Encompass? | ||
|
||
| Scrum Artifacts | Definition | | ||
| -------- | ------- | | ||
| Product Backlog | an ordered list of tasks essential for improving the product and satisfying the stakeholders | | ||
| Sprint Backlog | a list of tasks selected from the Product Backlog to be completed during a specific sprint | | ||
|
||
* The meeting is held with the entire Agile team (including the ScrumMaster and Product Owner) and aims to conclude with a set of agreed upon **Product Backlog** items to be completed for the current sprint commitment. | ||
* The ScrumMaster is responsible for leading the Planning Meeting and the Product Owner is responsible for detailing the Product Backlog items and ensuring alignment with the product’s goals. | ||
* The planning addresses the value the sprint can bring to the stakeholders and the team collaborates to define a **Sprint Goal** before the end of the Sprint Planning meeting. | ||
* Upon discussion with the Product Owner, the developers select items from the Product Backlog to include in the **Sprint Backlog** and determine this based on their prior and future Sprint performance. | ||
* For each Sprint Backlog item that is selected, the developers on the team decide how to break down and complete tasks for the sprint based on the team’s definition of “Done”. | ||
* Disagreements are resolved through open communication and consensus-seeking and facilitated by the ScrumMaster to ensure discussions stay within the sprint's topic. | ||
|
||
## Integrating Sprint Planning Meetings in CSC301 | ||
*Consider an example of how a Scrum Planning Meeting can be integrated into our course’s workflow.* | ||
|
||
### Before the Planning Meeting: | ||
After a Product Manager and Scrum Master are chosen in Deliverable 1, consider each deliverable (deliverable 2 to 5) equivalent to a sprint, with each sprint having one dedicated Sprint Planning Meeting. As a ScrumMaster and Product Manager, organize an hour long meeting within the next week after the Deliverable handout is released. Prior to the Scrum Planning Meeting, you both should also read through the handout and identify features/Product Backlog items for the upcoming sprint and add them to the Product Backlog list on Jira (or the team’s chosen product management tool). | ||
|
||
### During the Planning Meeting: | ||
Start by reminding the team of the upcoming Deliverable and review the definition of “Done”, which in this case would be meeting both the rubric and your partner's requirements. As a team, take this opportunity to brainstorm more tasks that need to be completed based on what your partner has asked and add them into the Product Backlog list. | ||
|
||
Discuss how much time would be needed for each Product Backlog item listed on Jira and based on this estimate, determine items for the Sprint Backlog. Following, select items from the Product Backlog and add them into the Sprint Backlog, allow team members to assign themselves to sprint tasks (including yourself) and discuss any further action items as a team. During this time, if any conflicts arise (for example, disagrements on which tasks to complete for the upcoming sprint or the estimated time for a particular task), the Product Manager should act as a facilitator and the ScrumMaster should encourage an inclusive environment where team members actively listen to each other's opinions. | ||
|
||
Once the Sprint Backlog items are finalized, as the ScrumMaster, call for a group consensus on the plan and conclude the meeting. | ||
|
||
## Key Takeaway | ||
The Sprint Planning Meeting lays the foundation for the sprint, helps the team understand their tasks and sets the stage for effective Scrum meetings in the weeks ahead. This meeting ensures everyone is aligned on deadlines and tasks, making future collaborations smoother and more productive. | ||
|
||
## Resources | ||
- [What is Sprint Planning](https://www.scrum.org/resources/what-is-sprint-planning) | ||
- [Sprint Planning Learning Series](https://www.scrum.org/learning-series/sprint-planning) | ||
- [Sprint Planning Meeting: A Simple Cheat Sheet](https://www.leadingagile.com/2012/08/simple-cheat-sheet-to-sprint-planning-meeting/) |
Oops, something went wrong.