Add Retries.md document (#135)

learning-software-engineering · Nov 24, 2023 · e66363e · e66363e
2 parents 39511f4 + c2dd01d
commit e66363e
Showing 1 changed file with 105 additions and 0 deletions.
diff --git a/Topics/Software_Engineering/Retries.md b/Topics/Software_Engineering/Retries.md
@@ -0,0 +1,105 @@
+# Retries in web services
+
+Especially relevant for webserver applications, but useful for others, retries are really tricky to get right.
+Retries are the practice of _retrying_ a network request, usually over HTTP or HTTPS, when it fails. It relies
+on the assumption that most failures are intermittent, meaning only happen rarely.
+
+Retries and throttling are both terms used to talk about the _flow_ of traffic into a service. Often the 
+operators/developers of that service want to make guarantees about the rate of that flow, or otherwise direct 
+traffic.
+
+## Why retry at all?
+
+Intermittent failure can happen at any level. This can be within a single host if its on-host disk, memory, or 
+CPU fails, or often in the communication between two hosts over a network. Networks, especially over the public 
+internet using TCP/IP, are known to have periodic failures due to high load/congestion or network infrastructure 
+hardware failure.
+
+Retries are a really simple, easy answer to these intermittent failures. If the error only happens rarely, then 
+trying a task again is a really effective way to ensure that the message goes through. This often manifests 
+itself as retrying API calls.
+
+Some services have built up language around these retries to control them. For example, calling AWS APIs returns
+metadata about the request itself. For example, there is a `$retryable` field in most of the AWS SDKv2's APIs, 
+the most common client used to make AWS API calls. If this field is set `true`, the server is hinting that the 
+failure was intermittent and that the client should retry. If the field is set to `false`, the server is hinting 
+to the client that the failure is likely going to happen again.
+
+## What are the problems with retries?
+
+Since retries are so simple to implement and elegant, they are usually the first tool that developers reach for 
+when a dependency of theirs has intermittent failures, but how can this go wrong?
+
+Consider a case where 4 distinct software teams each build products that depend on one another, in a chain like:
+```
+A -> B -> C -> D
+```
+
+That is, service A is calling service B's APIs, and so on. Since B's APIs are known to fail occasionally, A has 
+configured an automatic retry count of 3. Underneath the hood, B depends on C. Service A may or may not know this 
+about B. But since C has a flaky API too, B also has a retry count of 3. And the same for C. 
+
+This works fine, and will usually work. If all services are sufficiently scaled up to handle the load they are 
+given, there are no problems.
+
+However, imagine a case where service D is down. Though it is at the end of the chain of dependencies, in theory, 
+the services should be able to stay up despite their dependencies being down. This type of engineering is called 
+fault-tolerance.
+
+The next time that Service A takes a request, it forwards it to B, which forwards it to C, which tries to call D's 
+API, which fails. C then, tries again 3 times before reporting a failure back to B, which also triggers a retry. 
+That means C tries _another 3 times_.
+
+Retries deep into services grow multiplicatively, and a single API call to A has caused
+```
+A: 3
+B: 9
+C: 27
+```
+different API calls to fail. C is handling 27x more load than it is used to, and might start failing itself, further 
+exacerbating the problem. 
+
+That is, D has become a single point of failure for all other services, and even if they don't outright fail, the 
+load on B, C, and D, are highly needlessly increased.
+
+## What can we do about retries?
+
+Clients calling services will nearly always have retries configured. However, internal services should rarely 
+implement retries while calling other internal services, for precisely this reason.
+
+Another technique to get around excessive retries is to utilize more caching. If service C had cached the responses 
+from service D, it's possible that service D going down would have affected the top-level services at all, and 
+everything would have worked as normal. The downside to this approach is that caches are often trick to get right, 
+and sometimes introduce modal behavior in services [1], usually a bad thing.
+
+## So should I retry?
+
+As always in software engineering, it depends. A good rule of thumb is the external/internal, where external 
+dependencies are wrapped in retries, but internal dependencies aren't. It's much easier to control the behavior of 
+internal dependencies, either by directly contributing to their product, or speaking to the owners of that product 
+itself. Retries are a rough band-aid, and more precise solutions are often better. For example, it might be more work,
+but fixing the root-cause of intermittent failures avoids the problems with retries in the first place, and also 
+produces a more stable product.
+
+Retries are also more acceptable when they aren't in the _critical path_ of a service. For an `AddTwoNumbers` 
+service, having retries on dependencies within the main `AddTwoNumbers` API call might not be a good idea. However,
+for backup jobs, batch processing, or other non-performance-critical work, retries are often a simple, 
+engineering-efficient way to ensure reliability.
+
+## How should I retry?
+
+For most popular programming languages, retries are built into common dependencies. For example, 
+1. Rust has `tower`, a generic HTTP service abstraction that offers automatic retries: https://github.com/tower-rs/tower [2],
+2. JavaScript and Typescript have `retry`: https://www.npmjs.com/package/retry [3], and
+3. Go has `retry-go`: https://github.com/avast/retry-go [4]
+
+Each library works slightly differently, but can be used in simple or complex ways. For example, it could be as simple
+as immediately retrying the network request upon failure, or more complicated, including concepts like jitter (making sure
+many concurrent clients don't all retry at the same time), exponential backoff (clients retrying less and less over time),
+or other concepts [1].
+
+## References
+1. https://brooker.co.za/blog/2021/05/24/metastable.html
+2. https://github.com/tower-rs/tower
+3. https://www.npmjs.com/package/retry
+4. https://github.com/avast/retry-go