diff --git a/primers/_index.md b/primers/README.md
similarity index 92%
rename from primers/_index.md
rename to primers/README.md
index d209d3be4..7178118b9 100644
--- a/primers/_index.md
+++ b/primers/README.md
@@ -1,7 +1,6 @@
diff --git a/primers/distributed-software-systems-architecture/reliable-rpcs.md b/primers/distributed-software-systems-architecture/1-reliable-rpcs/readme.md
similarity index 99%
rename from primers/distributed-software-systems-architecture/reliable-rpcs.md
rename to primers/distributed-software-systems-architecture/1-reliable-rpcs/readme.md
index c269fe1a0..e0f6526a3 100644
--- a/primers/distributed-software-systems-architecture/reliable-rpcs.md
+++ b/primers/distributed-software-systems-architecture/1-reliable-rpcs/readme.md
@@ -1,6 +1,6 @@
diff --git a/primers/distributed-software-systems-architecture/rpcs.webp b/primers/distributed-software-systems-architecture/1-reliable-rpcs/rpcs.webp
similarity index 100%
rename from primers/distributed-software-systems-architecture/rpcs.webp
rename to primers/distributed-software-systems-architecture/1-reliable-rpcs/rpcs.webp
diff --git a/primers/distributed-software-systems-architecture/2-state/leader-follower-diagram.png b/primers/distributed-software-systems-architecture/2-state/leader-follower-diagram.png
new file mode 100644
index 000000000..dfd98cb8b
Binary files /dev/null and b/primers/distributed-software-systems-architecture/2-state/leader-follower-diagram.png differ
diff --git a/primers/distributed-software-systems-architecture/state.md b/primers/distributed-software-systems-architecture/2-state/readme.md
similarity index 67%
rename from primers/distributed-software-systems-architecture/state.md
rename to primers/distributed-software-systems-architecture/2-state/readme.md
index 70883dc96..d0660c7b5 100644
--- a/primers/distributed-software-systems-architecture/state.md
+++ b/primers/distributed-software-systems-architecture/2-state/readme.md
@@ -1,102 +1,149 @@
# 2
-## State {#section-2-state}
+## State
**State*ful* and state*less***
-Components in distributed systems are often divided into stateful and stateless services. Stateless services don’t store any state between serving requests. Stateful services, such as databases and caches, do store state between requests. State _stays_.
+> Components in distributed systems are often divided into stateful and stateless services. Stateless services don’t store any state between serving requests. Stateful services, such as databases and caches, do store state between requests. State _stays_.
When we need to scale up a stateless service we simply run more instances and loadbalance between them. Scaling up a stateful service is different: we need to deal with splitting up or replicating our data, and keeping things in sync.
### Caching {#caching}
-Caches are an example of a stateful service. A [cache service](https://aws.amazon.com/caching/) is a high-performance storage service that stores only a subset of data, generally the data most recently accessed by your service. A cache can generally serve data faster than the primary data store (for example, a database), because a cache stores a small set of data in RAM whereas a database stores all of your data on disk, which is slower.
+Caches are an example of a stateful service. A [cache service](https://aws.amazon.com/caching/) is a high-performance storage service that stores only a subset of data, generally the data most recently accessed by your service. A cache can generally serve data faster than the primary data store (for example, a database). A cache stores a small set of data in RAM whereas a database stores all of your data on disk, which is slower.
-However, caches are not _durable_ - when an instance of your cache service restarts, it does not hold any data. Until the cache has filled with a useful _working set_ of data, all requests will be _cache misses_, meaning that they need to be served from the primary datastore. The _hit rate_ is the percentage of cacheable data that can be served from the cache. Hit rate is an important aspect of cache performance which should be monitored.
+However, caches are not _durable_. When an instance of your cache service restarts, it does not hold any data. Until the cache has filled with a useful _working set_ of data, all requests will be _cache misses_, meaning that they need to be served from the primary datastore. The _hit rate_ is the percentage of cacheable data that can be served from the cache. Hit rate is an important aspect of cache performance which should be monitored.
-There are different strategies for getting data into your cache. The most common is _lazy loading_: your application always tries to load data from the cache first when it needs it. If it isn’t in the cache, the data is fetched from the primary data store and copied into the cache. You can also use a _write through_ strategy: every time your application writes data, it writes it to the cache at the same time as writing it to the datastore. However, when using this strategy, you have to deal with cases where the data isn’t in the cache (for instance, via lazy loading). Read about the [pros and cons](https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/Strategies.html#Strategies.WriteThrough) of these cache-loading strategies.
+There are different strategies for getting data into your cache.
-#### Why a cache service? {#why-a-cache-service}
+#### Lazy Loading
-Cache services, such as [memcached](https://www.memcached.org/) or [redis](https://redis.io/), are very often used in web application architectures to improve read performance, as well as to increase system throughput (the number of requests that can be served given a particular hardware configuration). You can also cache data in your application layer – but this adds complexity, because in order to be effective, requests affecting the same set of data must be routed to the same instance of your application each time. This means that your loadbalancer has to support the use of _[sticky sessions](https://www.linode.com/docs/guides/configuring-load-balancer-sticky-session/)._
+The most common strategy is _lazy loading_: your application always tries to load data from the cache first when it needs it. If it isn’t in the cache, the data is fetched from the primary data store and copied into the cache.
-In-application caches are less effective when your application is restarted frequently, because restarting your application means that all cached data is lost and the cache will be _cold_: it does not have a useful working set of data\_. \_In many organisations, web applications get restarted very frequently. There are two main reasons for this.
+#### Write through
+
+You can also use a _write through_ strategy: every time your application writes data, it writes it to the cache at the same time as writing it to the datastore. However, when using this strategy, you have to deal with cases where the data isn’t in the cache (for instance, via lazy loading). Read about the [pros and cons](https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/Strategies.html#Strategies.WriteThrough) of these cache-loading strategies.
+
+### Why a cache service?
+
+Cache services, such as [memcached](https://www.memcached.org/) or [redis](https://redis.io/), are very often used in web application architectures to improve read performance, as well as to increase system throughput (the number of requests that can be served given a particular hardware configuration).
+
+#### Sticky sessions
+
+You can also cache data in your application layer – but this adds complexity, because in order to be effective, requests affecting the same set of data must be routed to the same instance of your application each time. This means that your loadbalancer has to support the use of _[sticky sessions](https://www.linode.com/docs/guides/configuring-load-balancer-sticky-session/)._
+
+#### Restarts
+
+In-application caches are less effective when your application is restarted frequently, because restarting your application means that all cached data is lost and the cache will be _cold_: it does not have a useful working set of data. In many organisations, web applications get restarted very frequently. There are two main reasons for this.
1. Deployment of new code. Many organisations using modern [CI/CD (Continuous Integration/Continuous Delivery)](https://en.wikipedia.org/wiki/CI/CD) deploy new code on every change, or if not on every change, many times a day.
2. The use of _[autoscaling](https://en.wikipedia.org/wiki/Autoscaling)_. Autoscaling is the use of automation to adjust the number of running instances in a software service. It is very common to use autoscaling with stateless services, particularly when running on a cloud provider. Autoscaling is an easy way to scale your service up to serve a peak load, while not paying for unused resources at other times. However, autoscaling also means that the lifespan of instances of your service will be reduced.
The use of a separate cache service means that the stateless web application layer can be deployed frequently and can autoscale according to workload while still using a cache effectively. Cache services typically are not deployed frequently and are less likely to use autoscaling.
-#### Hazards of using cache services {#hazards-of-using-cache-services}
+#### Hazards of using cache services
-Operations involving cache services must be done carefully. Any operation that causes all your caches to restart will leave your application with an entirely _cold_ cache - a cache with no data in it. All reads that your cache would normally have served will go to your primary datastore. There are multiple possible outcomes in this scenario.
+Operations involving cache services must be done carefully. **Any operation that causes all your caches to restart will leave your application with an entirely _cold_ cache** - a cache with no data in it. All reads that your cache would normally have served will go to your primary datastore.
+
+There are multiple possible outcomes in this scenario.
+
+##### First case
In the first case, your primary datastore can handle the increased read load and the only consequence for your application will be an increase in duration of requests served. Your application will also consume additional system resources: it will hold more open transactions to your primary datastore than usual, and it will have more requests in flight than usual. Over time your cache service fills with a useful working set of data and everything returns to normal. Some systems offer a feature called _[cache warmup](https://github.com/facebook/mcrouter/wiki/Cold-cache-warm-up-setup)_ to speed this process up.
-In the second case, even though your datastore can handle the load, your application cannot handle the increase in in-flight requests. It may run out of resources such as worker threads, datastore connections, CPU, or RAM - this depends on the specific web application framework you are using. Applications in this state will serve errors or fail to respond to requests in a timely fashion.
+##### Second case
+
+In the second case, even though your datastore can handle the load, your application cannot handle the increase in in-flight requests. It may run out of resources such as worker threads, datastore connections, CPU, or RAM. (This depends on the specific web application framework you are using.) Applications in this state will serve errors or fail to respond to requests in a timely fashion.
+
+##### Third case
In the third case, your primary datastore is unable to handle the entire load, and some requests to your datastore begin to fail or to time out. As discussed in the previous section, you should be setting a deadline for your datastore requests. In the absence of deadlines, your application will wait indefinitely for your overloaded datastore, and is likely to run out of system resources and then grind to a halt, until it is restarted. Setting deadlines allows your application to serve what requests it can without grinding to a halt. After some time your cache will fill and your application should return to normal.
Read about an incident at Slack involving failure of a caching layer: [Slack’s Incident on 2-22-22](https://slack.engineering/slacks-incident-on-2-22-22/).
-#### Cache Invalidation {#cache-invalidation}
+### Cache Invalidation
+
+> There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.
+
+Cache invalidation means removing or updating an entry in a cache.
+
+Cache invalidation is hard because caches are optimised towards fast reads, without synchronising with the primary datastore. If a cached item is updated then the application must either tolerate stale cached data, or update all cache replicas that reference the updated item.
-There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors.
+Caches generally support specifying a Time-To-Live (TTL). After the TTL passes, or expires, the item is removed from the cache and must be fetched again from the main datastore. This lets you put an upper limit on how stale cached data may be.
-Cache invalidation means removing or updating an entry in a cache. The reason that cache invalidation is hard is that caches are optimised towards fast reads, without synchronising with the primary datastore. If a cached item is updated then the application must either tolerate stale cached data or update all cache replicas that reference the updated item.
+It is useful to add some ‘jitter’ when specifying TTLs. This means varying the TTL duration slightly. For example, instead of always using a 60 second TTL, you might randomly choose a duration in the range 54 to 66 seconds, up to ten percent higher or lower. This reduces the likelihood of load spikes on backend systems as a result of coordinated waves of cache evictions.
-Caches generally support specifying a Time-To-Live (TTL). After the TTL passes, or expires, the item is removed from the cache and must be fetched again from the main datastore. This lets you put an upper limit on how stale cached data may be. It is useful to add some ‘jitter’ when specifying TTLs. This means varying the TTL duration slightly - for example, instead of always using a 60 second TTL, you might randomly choose a duration in the range 54 to 66 seconds, up to ten percent higher or lower. This reduces the likelihood of load spikes on backend systems as a result of coordinated waves of cache evictions.
+#### Immutable data
-##### Immutable data {#immutable-data}
+One simple strategy to manage cache invalidation problems is to avoid updating data where possible. Instead, we can create new versions of the data.
-One of the simplest strategies to manage cache invalidation problems is to avoid updating data where possible. Instead, we can create new versions of the data. For example, if we are building an application that has profile pictures for users, if the user updates their profile picture we create a new profile picture with a new unique ID, and update the user data to refer to the new profile picture rather than the old one. Doing this means that there is no need to invalidate the old cached profile picture - we just stop referring to it. Eventually, the old picture will be removed from the cache, as the cache removes entries that have not been accessed recently.
+For example, if we are building an application that has profile pictures for users, when the user updates their profile picture we:
-In this strategy the profile picture data is immutable - meaning that it never changes. However, the user data does change, meaning that cached copies must be invalidated, or must have a short TTL so that users will see new profile pictures in a reasonable period of time.
+1. create a new profile picture with a new unique ID
+2. update the user data to refer to the new profile picture rather than the old one.
+
+Doing this means that there is no need to invalidate the old cached profile picture: we just stop referring to it. Eventually, the old picture will be removed from the cache, as the cache removes entries that have not been accessed recently.
+
+In this strategy the profile picture data is immutable; it never changes. However, the user data _does_ change, meaning that cached copies must be invalidated, or must have a short TTL so that users will see new profile pictures in a reasonable period of time.
Read more about [Ways to Maintain Cache Consistency](https://redis.com/blog/three-ways-to-maintain-cache-consistency/) in this article.
-#### Scaling Caches {#scaling-caches}
+### Scaling Caches
-##### Replicated Caches {#replicated-caches}
+#### Replicated Caches
In very large systems, it may not be possible to serve all cache reads from one instance of a cache. Caches can run out of memory, connections, network capacity, or CPU. And with only one cache instance, if we lose or even update the instance, we will lose all of our cache capacity if we lose that single instance, or if we need to update it.
We can solve these problems by running multiple cache instances. We could split the caches up according to the type of data to be cached. For example, in a collaborative document-editing system, we might have one cache for Users, one cache for Documents, one cache for Comments, and so on. This works, but we may still have more requests for one or more of these data types than a single cache instance can handle.
-A way to solve this problem is to replicate the caches. In a replicated cache setup, we would have two or more caches serving the same data - for instance, we might choose to run three replicas of the Users cache. Reads for Users data can go to any of the three replicas. However, when Users data is updated, we must either:
+A way to solve this problem is to replicate the caches. In a replicated cache setup, we would have two or more caches serving the same data. For instance, we might choose to run three replicas of the Users cache. Reads for Users data can go to any of the three replicas. However, when Users data is updated, we must either:
- invalidate all three instances of the Users cache
- tolerate stale data until the TTL expires (as described above)
-The need to invalidate all instances of the Users cache adds cost to every write operation: more latency, because we must wait for the slowest cache instance to respond; as well as higher use of bandwidth and other computing resources, because we have to do work on every cache instance that stores the data being invalidated. This cost increases the more replicated instances we run.
+##### The cost of invalidation
+
+The need to invalidate all instances of the Users cache adds cost to every write operation. It costs us in more latency, because we must wait for the slowest cache instance to respond. It costs us in higher use of bandwidth and other computing resources, because we have to do work on every cache instance that stores the data being invalidated. This cost increases the more replicated instances we run.
+
+##### Inconsistency is always a possibility
-There is an additional complication: what happens if we write to the Users database table, but cannot connect to one or more of the cache servers? In practice, this can happen and it means that inconsistency between datastores and caches is always a possibility in distributed systems.
+There is another complication: what happens if we write to the Users database table, but cannot connect to one or more of the cache servers? In practice, this can happen and it means that inconsistency between datastores and caches is always a possibility in distributed systems.
There is further discussion of this problem below in the section on [CAP theorems](#the-cap-theorem).
-##### Sharded Caches {#sharded-caches}
+#### Sharded Caches
Another approach to scaling cache systems is to shard the data instead of replicating it. This means that your data is split across multiple machines, instead of each instance of your cache storing all of the cached items. This is a good choice when the working set of recently accessed data is too large for any one machine. Each machine hosts a _partition_ of the data set. In this setup there is usually a router service that is responsible for proxying cache requests to the correct instance: the instance that stores the data to be accessed.
Data sharding can be vertical or horizontal. In vertical sharding, we store different fields on different machines (or sets of machines). In horizontal sharding, we split the data up by rows.
-In the case of horizontal sharding, we can shard algorithmically - meaning that the shard to store a given row is determined by a function - or dynamically, meaning that the system maintains a lookup table that tracks where data ranges are stored.
+In the case of horizontal sharding, we can shard algorithmically or dynamically.
+
+##### Algorithmic sharding
+
+Algorithmic sharding means the shard to store a given row is determined by a function.
+
+##### Dynamic sharding
+
+Dynamic sharding means that the system maintains a lookup table that tracks where data ranges are stored.
Read [Introduction to Distributed Data Storage by Quentin Truong](https://towardsdatascience.com/introduction-to-distributed-data-storage-2ee03e02a11d) for more on sharding.
+#### Problems with sharding
+
A simple algorithmic sharding implementation might route requests to one of _N_ routers using a modulo operation: `cache_shard = id % num_shards`. The problem is that whenever a shard is added or removed, most of the keys will now be routed to a different cache server. This is equivalent to restarting all of our caches at once and starting cold. As discussed above, this is bad for performance, and potentially could cause an outage.
This problem is usually solved using a technique called [consistent hashing](https://en.wikipedia.org/wiki/Consistent_hashing). A consistent hash function maps a key to a data partition, and it has the very useful property that when the number of partitions changes, most keys do not get remapped to a different partition. This means that we can add and remove cache servers safely without risking reducing our cache hit rate by too much. Consistent hashing is a very widely used technique for load balancing across stateful services.
You can read about [how Pinterest scaled its Cache Infrastructure](https://medium.com/pinterest-engineering/scaling-cache-infrastructure-at-pinterest-422d6d294ece) - it’s a useful example of a real architecture. They use an open-source system called [mcrouter](https://github.com/facebook/mcrouter/wiki) to horizontally scale their use of memcached. Mcrouter is one of the most widely-used distributed cache solutions in industry.
-#### Questions: {#questions}
+### Questions
- What is a cache hit rate and why is it important?
- What do we mean by ‘cold’ or ‘warm’ when we discuss caches?
@@ -106,60 +153,64 @@ You can read about [how Pinterest scaled its Cache Infrastructure](https://mediu
- What is a TTL used for in caching?
- Why should we cache in a standalone cache service, as opposed to within our application?
-### Databases {#databases}
+### Databases
Most web applications we build include at least one database. Databases don’t have to be distributed systems: you can run a database on one machine. However, there are reasons that you might want to run a distributed database across two or more machines. The logic is similar to the rationale for running more than one cache server, as described above.
-1. Reliability: you don’t want to have an interruption in service if your database server experiences a hardware failure, power failure, or network failure.
-2. Capacity: you might run a distributed database to handle more load than a single database instance can serve.
+1. **Reliability:** you don’t want to have an interruption in service if your database server experiences a hardware failure, power failure, or network failure.
+2. **Capacity:** you might run a distributed database to handle more load than a single database instance can serve.
To do either of these things we must _replicate_ the data, which means to copy the data to at least one other database instance.
-#### The CAP Theorem {#the-cap-theorem}
+#### The CAP Theorem
-Before discussing distributed datastores further, we must introduce the CAP Theorem.
+> **C**onsistency **A**vailability **P**artition tolerance
-The [CAP Theorem](https://en.wikipedia.org/wiki/CAP_theorem) is a computer science concept that states that any distributed data store can provide at most two of the following three properties:
+The [CAP Theorem](https://en.wikipedia.org/wiki/CAP_theorem) is a computer science concept that states that any distributed data store can provide at most **two** of the following three properties:
-- Consistency: every read should receive the most recently written data or else an error
-- Availability: every request receives a response, but there is no guarantee that it is based on the most recently written data
-- (Network) Partition tolerance: the system continues to operate even if network connectivity between some or all of the computers running the distributed data store is disrupted
+1. Consistency: every read should receive the most recently written data or else an error
+2. Availability: every request receives a response, but there is no guarantee that it is based on the most recently written data
+3. (Network) Partition tolerance: the system continues to operate even if network connectivity between some or all of the computers running the distributed data store is disrupted
This seems complicated, but let’s break it down.
-###### Network Partition {#network-partition}
+##### Network Partition
+
+Imagine that you are running your service in two datacenters, and the fiber optic cables between them are dug up. This is a network partition.
-A network partition means that your computers cannot all communicate over the network. Imagine that you are running your service in two datacenters, and the fiber optic cables between them are dug up (there should be redundancy in these paths, of course, but accidents do happen that can take out multiple connections). This is a network partition. Your servers won’t be able to communicate with servers in the opposite datacenter. Configuration problems, or simply too much network traffic can also cause network partitions.
+A network partition means that your computers cannot all communicate over the network. Your servers won’t be able to communicate with servers in the opposite datacenter. Configuration problems, or simply too much network traffic can also cause network partitions.
-Network partitions do occur, and there is no way that you can avoid this unpleasant fact of life. So in practice, distributed datastores have to choose between being consistent when network failures occur, or being available. It’s not a matter of choosing any two properties: you must choose either consistency and network partition tolerance or availability and network partition tolerance.
+Network partitions do happen, and there is no way that you can avoid this unpleasant fact of life. So in practice, distributed datastores have to choose between being _consistent_ when network failures occur, or being _available_. It’s not a matter of choosing any two properties: you must choose either consistency and network partition tolerance _or_ availability and network partition tolerance.
-###### Consistency and availability {#consistency-and-availability}
+##### Consistency and availability
Choosing consistency means that your application won’t be available on one side of the partition (or it might be read-only and serving old, stale data). Choosing availability means that your application will remain available, including for writes, but when the network partition is healed, you need a way of merging the writes that happened on different sides of the partition.
-The CAP Theorem is a somewhat simplified model of distributed datastores, and it has been [criticised](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html) on that basis. However, it remains a reasonably useful model for learning about distributed datastore concepts. In general, replicated traditional datastores like MySQL choose consistency in the event of network partitions; [NoSQL datastores](https://www.mongodb.com/nosql-explained) such as Cassandra and Riak tend to choose availability.
+The CAP Theorem is a simplified model of distributed datastores, and it has been [criticised](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html) on that basis. It remains a reasonably useful model for learning about distributed datastore concepts.
-(However, the behaviour of systems in the event of network failure can also vary based on how that specific datastore is configured, in terms of how .)
+In general, replicated traditional datastores like MySQL choose consistency in the event of network partitions; [NoSQL datastores](https://www.mongodb.com/nosql-explained) such as Cassandra and Riak tend to choose availability.
-#### Leader/Follower Datastore Replication {#leader-follower-datastore-replication}
+(However, the behaviour of systems in the event of network failure can also vary based on how that specific datastore is configured.)
-The most straightforward distributed datastore is the leader/follower datastore, also known as primary/secondary or single leader replication. The aim is to increase the availability and durability of the data: in other words, if we lose one of the datastore machines, we should not lose data and we should still be able to run the system.
+#### Leader/Follower Datastore Replication
-###### Synchronous Replication {#synchronous-replication}
+The most straightforward distributed datastore is the leader/follower datastore, also known as primary/secondary or single leader replication. The aim is to increase the availability and durability of the data. In other words, if we lose one of the datastore machines, we should not lose data and we should still be able to run the system.
+
+##### Synchronous Replication
To do something synchronously means to do something at the same time as something else.
-In synchronous replication, the datastore client sends all writes to the leader. The leader has one or more followers. For every write, the leader sends those writes to its followers, and waits for all its followers to acknowledge completion of the write before the leader sends an acknowledgement to the client. Think of a race where all the runners are tied together with a rope. Leader and followers commit the data as part of the same database operation. Reads can go either to the leader, or to a follower, depending on how the datastore is configured. Reading from followers means that the datastore can serve a higher read load, which can be useful in applications that are serving a lot of traffic.
+In synchronous replication, the datastore client sends all writes to the leader. The leader has one or more followers. For every write, the leader sends those writes to its followers, and waits for all its followers to acknowledge completion of the write before the leader sends an acknowledgement to the client.
+
+Think of a race where all the runners are tied together with a rope. Leader and followers commit the data as part of the same database operation. Reads can go either to the leader, or to a follower, depending on how the datastore is configured. Reading from followers means that the datastore can serve a higher read load, which can be useful in applications that are serving a lot of traffic.
There is one problem with synchronous replication: availability. Not only must the leader be available, but _all of the followers_ must be available as well. This is a problem: the system’s availability for writes will actually be lower than a single machine, because the chances of one of _N_ machines being unavailable are by definition higher than the chances of one machine being unavailable, because we have to add up the probabilities of downtime.
For example, if one machine has 99.9% uptime and 0.1% downtime, and we have three machines, then we would expect the availability for all three together to be closer to 99.7% (i.e. 100% - 3 x 0.1%). Adding more replicas makes this problem worse. Response time is also a problem with synchronous replication, because the leader has to wait for all followers. This means that the system cannot commit writes faster than the slowest follower.
-
>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>
-
-![alt_text](images/image2.png "image_tooltip")
+![alt_text](./assets/leader-follower-diagram.png "leader follower diagram")
-###### Asynchronous and Semisynchronous Replication {#asynchronous-and-semisynchronous-replication}
+###### Asynchronous and Semisynchronous Replication
We can solve these performance and availability problems by not requiring the leader to write to _all_ the followers before returning to the client. Asynchronous replication means that the leader can commit the transaction before replicating to any followers. Semisynchronous replication means that the leader must confirm the write at a subset of its followers before committing the transaction.
diff --git a/primers/distributed-software-systems-architecture/scaling-stateless-services.md b/primers/distributed-software-systems-architecture/3-scaling-stateless-services/readme.md
similarity index 71%
rename from primers/distributed-software-systems-architecture/scaling-stateless-services.md
rename to primers/distributed-software-systems-architecture/3-scaling-stateless-services/readme.md
index a0a039e34..9035be6b7 100644
--- a/primers/distributed-software-systems-architecture/scaling-stateless-services.md
+++ b/primers/distributed-software-systems-architecture/3-scaling-stateless-services/readme.md
@@ -6,21 +6,23 @@ forhugo-->
# 3
-## Scaling Stateless Services {#section-3-scaling-stateless-services}
+## Scaling Stateless Services
-### Microservices or monoliths {#microservices-or-monoliths}
+### Microservices or monoliths
A monolith is a single large program that encapsulates all the logic for your application. A microservice architecture, on the other hand, splits application functionality across a number of smaller programs, which are composed together to form your application. Both have advantages and drawbacks.
-Read [Microservices versus Monoliths](https://www.atlassian.com/microservices/microservices-architecture/microservices-vs-monolith) for a discussion of microservices and monoliths. Optionally, watch - a comedy about the extremes of microservice-based architectures. The middle ground between a single monolith and many tiny microservices is to break the application into a more moderate number of services, each of which have high cohesion (or relatedness). Some call this approach [‘macroservices’](https://www.geeksforgeeks.org/software-engineering-coupling-and-cohesion/).
+Read [Microservices versus Monoliths](https://www.atlassian.com/microservices/microservices-architecture/microservices-vs-monolith) for a discussion of microservices and monoliths. Optionally, watch - [a comedy about the extremes of microservice-based architectures](https://www.youtube.com/watch?v=y8OnoxKotPQ).
-### Horizontal Scaling and Load Balancing {#horizontal-scaling-and-load-balancing}
+The middle ground between a single monolith and many tiny microservices is to break the application into a more moderate number of services, each of which have high cohesion (or relatedness). Some call this approach [‘macroservices’](https://www.geeksforgeeks.org/software-engineering-coupling-and-cohesion/).
-We have seen how stateless and stateful services can be scaled horizontally - i.e. run across many separate machines to provide a scalable service that can serve effectively unlimited load - with the correct architecture.
+### Horizontal Scaling and Load Balancing
+
+We have seen how stateless and stateful services can be scaled horizontally – i.e. run across many separate machines to provide a scalable service that can serve effectively unlimited load – with the correct architecture.
Load balancers are an essential component of horizontal scaling. For stateless services, the load balancers are general-purpose proxies (like Nginx, Envoy Proxy, or HAProxy). For scaling datastores, proxies are typically specialised, like mcrouter or the Vitess vtgates.
-### How load balancers work {#how-load-balancers-work}
+### How load balancers work
It is worth understanding a little of how load balancers work. Load balancers today are typically based on software, although hardware load balancer appliances are still available. The biggest split is between Layer 4 load balancers, which focus on balancing load at the level of TCP/IP packets, and Layer 7 load balancers, which understand HTTP.
@@ -34,7 +36,7 @@ Read [Introduction to modern network load balancing and proxying by Matt Klein](
- What is a Service Mesh? What are the advantages and disadvantages of a service mesh?
- What is a VIP? What is Anycast?
-#### Round Robin and other loadbalancing algorithms {#round-robin-and-other-loadbalancing-algorithms}
+#### Round Robin and other loadbalancing algorithms
Many load balancers use Round Robin to allocate connections or requests to backend servers. This means that they assign the first connection or request to the first backend, the second to the second, and so on until they loop back around to the first backend. This has the virtue of being simple to understand, doesn’t need a lot of complex state or feedback to be managed, and it’s difficult for anything to go very wrong with it.
@@ -44,63 +46,69 @@ Now consider the weighted response time loadbalancing algorithm, which sends the
If one server is misconfigured and happens to be very very rapidly serving errors or empty responses, then loadbalancers configured to use weighted response time algorithms would send more traffic to the faulty server.
-##### DNS {#dns}
+#### DNS
There is one more approach to load balancing that is worth knowing about: [DNS Load Balancing](https://www.cloudflare.com/en-gb/learning/performance/what-is-dns-load-balancing/). DNS-based load balancing is often used to route users to a specific set of servers that is closest to their location, in order to minimize latency (page load time).
-### Performance: Edge Serving and CDNs {#performance-edge-serving-and-cdns}
+### Performance: Edge Serving and CDNs
Your users may be anywhere on Earth, but quite often, your serving infrastructure (web applications, datastores, and so on) is all in one region. This means that users in other continents may find your application slow. A network round-trip between Sydney in Australia and most locations in the US or Europe takes around 200 milliseconds. 200ms is not overly long, but the problem is that serving a user request may take several round trips.
-##### Round trips {#round-trips}
+#### Round trips
First, the user may need to look up the DNS name for your site (this may be cached nearby). Next, they need to open a TCP connection to one of your servers, which requires a network round trip. Finally, the user must perform a [SSL handshake](https://zoompf.com/blog/2014/12/optimizing-tls-handshake/) with your server, which also requires one or two network round trips, depending on configuration of client and server (a recent session may be resumed in one round trip if both client and server support TLS 1.3).
-All of this takes place before any data may be returned to the user, and, unless there is already an open TCP connection between the user and the website, involves an absolute minimum of three network round trips before the first byte of data can be received by the client.
+All of this takes place before any data may be returned to the user, and, unless there is already an open TCP connection between the user and the website, involves an absolute minimum of _three network round trips_ before the first byte of data can be received by the client.
-##### SSL Termination at the Edge {#ssl-termination-at-the-edge}
+#### SSL Termination at the Edge
SSL termination _at the edge_ is the solution to this issue. This involves running some form of proxy much nearer to the user which can perform the SSL handshake with the user. If the user need only make network round trips of 10 milliseconds to a local Point of Presence (PoP), as opposed to 200 milliseconds to serving infrastructure in a different continent, then a standard TCP connection initiation and SSL handshake will take only around 60 milliseconds, as opposed to 1.2 seconds.
Of course, the edge proxies must still relay requests to your serving infrastructure, which remains 200 milliseconds of network latency away. However, the edge proxies will have persistent encrypted network connections to your servers: there is no per-request overhead for connection setup or handshakes.
-##### Content Delivery Networks {#content-delivery-networks}
+##### Content Delivery Networks
Termination at the edge is a service often performed by [Content Delivery Networks](https://en.wikipedia.org/wiki/Content_delivery_network) (CDNs). CDNs can also be used to cache static assets such as CSS or images close to your users, in order to reduce site load time as well as reducing load on your origin servers. You can also run your own compute infrastructure close to your users. However, this is an area of computing where being big is an advantage: it is hard to beat the number of edge locations that large providers like Cloudflare, Fastly, and AWS operate.
[Edge Regions in AWS](https://www.lastweekinaws.com/blog/what-is-an-edge-location-in-aws-a-simple-explanation/) is worth reading to get an idea of the scale of Amazon’s edge presence.
-### QUIC {#quic}
+### QUIC
It is worth being aware of [QUI](https://en.wikipedia.org/wiki/QUIC)C, an emerging network protocol that is designed to be faster for modern Internet applications than TCP/IP. While it is by no means ubiquitous yet, it is certainly an area that the largest Internet companies are investing in. HTTP/3, the next major version of the HTTP protocol, uses QUIC.
Read about [HTTP over QUIC](https://blog.cloudflare.com/http3-the-past-present-and-future/) in the context of the development of the HTTP protocol.
-### Autoscaling {#autoscaling}
+### Autoscaling
Aside from load balancing, the other major component of successful horizontal scaling is autoscaling. Autoscaling means to scale the number of instances of your service up and down according to the load that the service is experiencing. This can be more cost-effective than sizing your service for expected peak loads.
On AWS, for example, you can create an Autoscaling Group (ASG) which acts as a container for your running EC2 instances. ASGs can be [configured](https://docs.aws.amazon.com/autoscaling/ec2/userguide/scale-your-group.html) to scale up or scale down the number of instances based on a schedule, based on[ predicted load](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-predictive-scaling.html) (based on the past two weeks of history), or based on [current metrics](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html). Kubernetes [Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) (HPA) is a tool similar to ASGs scaling policies for Kubernetes.
-##### CPU utilisation {#cpu-utilisation}
+#### CPU utilisation
+
+CPU utilisation is very commonly used as a scaling signal. It is a good general proxy for how 'busy' a particular instance is. It’s a generic metric that your platform understands, as opposed to an application-specific metric.
-CPU utilisation is very commonly used as a scaling signal. It is a good general proxy for how ‘busy’ a particular instance is, and it’s a ‘generic’ metric that your platform understands, as opposed to an application-specific metric.
+The CPU utilisation target should not be set too high. Your service will take some time to scale up when load increases. You might experience failure of a subset of your infrastructure (think of a bad code push to a subset of your instances, or a failure of an AWS Availability Zone).
-The CPU utilisation target should not be set too high. Your service will take some time to scale up when load increases. You might experience failure of a subset of your infrastructure (think of a bad code push to a subset of your instances, or a failure of an AWS Availability Zone). Your service probably cannot serve reliably at close to 100% CPU utilisation. 40% utilisation is a common target.
+Your service probably cannot serve reliably at close to 100% CPU utilisation. 40% utilisation is a common target.
-### Autoscaling and Long-Running Connections {#autoscaling-and-long-running-connections}
+### Autoscaling and Long-Running Connections
One case where autoscaling does not help you to manage load is when your application is based on long-running connections, such as websockets or gRPC streaming. Managing these at scale can be challenging. Read the article [Load balancing and scaling long-lived connections in Kubernetes](https://learnk8s.io/kubernetes-long-lived-connections).
- Why doesn’t autoscaling work to redistribute load in systems with long-lived connections?
- How can we make these kinds of systems robust?
-### Project work for this section {#project-work-for-this-section}
+## Project work for this section
+
+- **[Multiple Servers](projects/multiple-servers)**
+
+### Stretch
-- [https://github.com/CodeYourFuture/immersive-go-course/tree/main/multiple-servers](https://github.com/CodeYourFuture/immersive-go-course/tree/main/multiple-servers)
-- Optional: do this HashiCorp tutorial. It will give you hands-on experience with seeing health checks can be used to manage failure. [https://learn.hashicorp.com/tutorials/consul/introduction-chaos-engineering?in=consul/resiliency](https://learn.hashicorp.com/tutorials/consul/introduction-chaos-engineering?in=consul/resiliency)
-- Optional: demonstrate circuit breaking. [https://learn.hashicorp.com/tutorials/consul/service-mesh-circuit-breaking?in=consul/resiliency](https://learn.hashicorp.com/tutorials/consul/service-mesh-circuit-breaking?in=consul/resiliency)
-- Optional: see different kinds of load-balancing algorithms in use: [https://learn.hashicorp.com/tutorials/consul/load-balancing-envoy?in=consul/resiliency](https://learn.hashicorp.com/tutorials/consul/load-balancing-envoy?in=consul/resiliency)
-- Optional: do this tutorial which demonstrates autoscaling with minikube (you will need to install minikube on your computer if you don’t have it). It will give you hands-on experience in configuring autoscaling, plus some exposure to Kubernetes configuration.
+- [Consul and Chaos Engineering](https://learn.hashicorp.com/tutorials/consul/introduction-chaos-engineering?in=consul/resiliency) : This HashiCorp tutorial will give you hands-on experience with seeing health checks can be used to manage failure.
+- [Implement Circuit Breaking in Consul Service Mesh with Envoy](https://learn.hashicorp.com/tutorials/consul/service-mesh-circuit-breaking?in=consul/resiliency) : Demonstrate circuit breaking.
+- [Load Balancing Envoy](https://learn.hashicorp.com/tutorials/consul/load-balancing-envoy?in=consul/resiliency) : See different kinds of load-balancing algorithms in use.
+- Do the LearnDevOps tutorial which demonstrates autoscaling with minikube.
+ You will need to install minikube on your computer if you don’t have it. It will give you hands-on experience in configuring autoscaling, plus some exposure to Kubernetes configuration.
- You may need to run this first: `minikube addons enable metrics-server`
- - [https://devops.novalagung.com/kubernetes-minikube-deployment-service-horizontal-autoscale.html](https://devops.novalagung.com/kubernetes-minikube-deployment-service-horizontal-autoscale.html)
+ - [Kubernetes - Deploy App into Minikube Cluster using Deployment controller, Service, and Horizontal Pod Autoscaler](https://learndevops.novalagung.com/kubernetes-minikube-deployment-service-horizontal-autoscale.html)
diff --git a/primers/distributed-software-systems-architecture/asynchronous-work-and-pipelines.md b/primers/distributed-software-systems-architecture/4-asynchronous-work-and-pipelines/readme.md
similarity index 98%
rename from primers/distributed-software-systems-architecture/asynchronous-work-and-pipelines.md
rename to primers/distributed-software-systems-architecture/4-asynchronous-work-and-pipelines/readme.md
index a977c6ed6..c4f74f94b 100644
--- a/primers/distributed-software-systems-architecture/asynchronous-work-and-pipelines.md
+++ b/primers/distributed-software-systems-architecture/4-asynchronous-work-and-pipelines/readme.md
@@ -1,6 +1,6 @@
diff --git a/primers/distributed-software-systems-architecture/distributed-locking-and-distributed-consensus.md b/primers/distributed-software-systems-architecture/5-distributed-locking-and-distributed-consensus/readme.md
similarity index 100%
rename from primers/distributed-software-systems-architecture/distributed-locking-and-distributed-consensus.md
rename to primers/distributed-software-systems-architecture/5-distributed-locking-and-distributed-consensus/readme.md
diff --git a/primers/distributed-software-systems-architecture/_index.md b/primers/distributed-software-systems-architecture/README.md
similarity index 100%
rename from primers/distributed-software-systems-architecture/_index.md
rename to primers/distributed-software-systems-architecture/README.md
diff --git a/primers/distributed-software-systems-architecture/rpcs.jpeg b/primers/distributed-software-systems-architecture/rpcs.jpeg
deleted file mode 100644
index 917fb6a49..000000000
Binary files a/primers/distributed-software-systems-architecture/rpcs.jpeg and /dev/null differ
diff --git a/primers/distributed-software-systems-architecture/rpcs.png b/primers/distributed-software-systems-architecture/rpcs.png
deleted file mode 100644
index 8279e0241..000000000
Binary files a/primers/distributed-software-systems-architecture/rpcs.png and /dev/null differ
diff --git a/primers/troubleshooting/_index.md b/primers/troubleshooting/README.md
similarity index 96%
rename from primers/troubleshooting/_index.md
rename to primers/troubleshooting/README.md
index 1a56e59a7..860c60f54 100644
--- a/primers/troubleshooting/_index.md
+++ b/primers/troubleshooting/README.md
@@ -2,7 +2,7 @@
+++
title="Troubleshooting Primer"
author="Laura Nolan and Radha Kumari"
-date="28 Dec 2022 12:22:11 BST"
+date="28 Dec 2022 12:22:11 BST"
+++
forhugo-->
diff --git a/primers/troubleshooting/primer.md b/primers/troubleshooting/primer.md
deleted file mode 100644
index 769bd66a8..000000000
--- a/primers/troubleshooting/primer.md
+++ /dev/null
@@ -1,310 +0,0 @@
-
-
-# Troubleshooting Primer
-
-## About this document {#about-this-document}
-
-This document is a crash course in troubleshooting. It is aimed at people with some knowledge of computing and programming, but without significant professional experience in operating distributed software systems. This document does not aim to prepare readers to be oncall or be an incident responder. It aims primarily to describe the skills needed to make progress in day-to-day software operations work, which often involves a lot of troubleshooting in development and test environments.
-
-## Learning objectives:
-
-- Explain troubleshooting and how it differs from debugging
-- Name common troubleshooting methods
-- Experience working through some example scenarios.
-- Use commonly used tools to troubleshoot
-
-## Troubleshooting Versus Debugging {#troubleshooting-versus-debugging}
-
-Troubleshooting is related to debugging, but isn’t the same thing.
-
-Debugging relates to a codebase into which the debugger has full visibility. In general, the scope of the debugging is limited to a single program; in rare cases it might extend to libraries or dependent systems. In general, debugging techniques involve the use of debuggers, logging, and code inspection.
-
-In troubleshooting we investigate and resolve problems in complex systems. The troubleshooter may not know which subsystem the fault originates with. The troubleshooter may not know the full set of relationships and dependencies in the system exhibiting the fault, and they may not be able to inspect all of the source code (or familiarity with the codebases involved).
-
-We can use debugging to fix a system of limited complexity that we know completely. When many programs interact in a complex system that we can’t know completely, we must use troubleshooting.
-
-Examples of troubleshooting:
-
-- finding the cause of unexpectedly high load in a distributed system
-- finding why packets are being dropped in a network between two hosts
-- determining why a program is failing to run correctly on a host.
-
-## General Troubleshooting Methods {#general-troubleshooting-methods}
-
-Because troubleshooting involves a wide variety of systems, some of which may be unknown, we cannot create a comprehensive troubleshooting curriculum. Troubleshooting is not just learning a bank of answers but learning how to ask, and answer, questions methodically. Troubleshooters use documentation, tooling, logic, and discussion to analyse system behaviours and close in on problems. Being able to recognize gaps in your knowledge and to fill them is a key troubleshooting skill. Filling in these gaps becomes easier the wider your knowledge becomes over time.
-
-There are some approaches to troubleshooting that are generally useful:
-
-- Defining the problem
-- Understanding the request path
-- Bisecting the problem space
-- Generating hypotheses and proving or disproving them - this step is iterative, in other words, you keep doing this until you have found the problem
-
-### Defining the problem {#defining-the-problem}
-
-To troubleshoot a problem, you must be able to understand it and describe it. This is a starting point for discussion with technical and non-technical stakeholders.
-
-Begin your notes as you begin your investigation. Take lots of notes, and record each step you take in your journey towards understanding. If things get difficult, it’s valuable to have a record of what your theories were, how you tested them, and what happened. The first step is to make your first definition of the problem, which you will update very often.
-
-This step is necessary because problem reports are often vague and poorly-defined. It’s unusual for _reports_ of a problem to accurately _diagnose_ the problem. More often, they describe symptoms. Where reports do attribute causes, this is often without systematic reasoning and might be completely wrong.
-
-For example, a user might report that your website is slow. You verify this using your monitoring systems, which display graphs of page load times showing an increase in the past day. Based on your knowledge of the system, you know that slow requests can happen when the database is having issues. Using tooling (such as a [monitoring system](https://www.rabbitmq.com/monitoring.html), or[ logging slow queries](https://severalnines.com/blog/how-identify-postgresql-performance-issues-slow-queries/)) you can verify whether the database is indeed having performance issues, and from this you have your base definition of the problem. You can share this, and use it to build upon. We still do not know \_why \_the database latency is high, but it is a specific problem that we can investigate.
-
-Often we do not know the actual cause of the problem, we just see the impact on other systems. It is like a rock thrown into a lake. The rock is out of sight, but we can still see its impact, as ripples on the water. The ripples help us guess the size and location of the rock in the lake.
-
-- What do we already know about the issue?
-- Can we describe the problem to others?
-- Can we find proof that validates our assumptions?
-- Can we reproduce the issue?
-- What are the effects of the issue?
-- Can we see the effects in a graph somewhere?
-- Is it affecting all or only some operations?
-
-It can be difficult to untangle cause and effect in computer systems. A database seeming slow may be an effect \_caused \_by additional load exhausting database resources. Another commonly-observed example is systems experiencing spikes in the number of incoming requests due to retries when there are problems with an application. If an application is very slow or serving errors, clients and users will naturally retry, creating unusually high load on the system. The additional load may cause further problems; but some other problem has triggered the application’s incorrect behaviour.
-
-### Understanding the request path {#understanding-the-request-path}
-
-Often the first challenge in any troubleshooting exercise is to figure out what is supposed to be happening in the system that is exhibiting a fault.
-
-First figure out what is _meant_ to be happening, and then determine what is _actually_ happening.
-
-Like when you order food, and the driver turns up at the wrong address. Where in the flow did things divert from what was expected?
-
-- Was the wrong address provided?
-- Was the address read incorrectly?
-- Are there two identically named streets?
-- Has GPS broken?
-
-Making a mental model of the request path helps you navigate through the issue and narrow down the component(s) in the request path that might be having issues. It can be helpful to sketch an actual diagram on paper to help you build your mental model.
-
-Depending on the situation, there may be documentation that can help you understand the intended operation of your system and build that mental model. Open-source software often comes with extensive manuals. In-house developed systems may also have documentation. However, often you need to use troubleshooting tools and experimentation to understand what is happening (there is a short catalogue of useful tools in a later section).
-
-### Finding the cause of the problem: generating and validating hypotheses {#finding-the-cause-of-the-problem-generating-and-validating-hypotheses}
-
-Once we determine and observe that there is a problem, we have supporting evidence to say that the [happy path](https://en.wikipedia.org/wiki/Happy_path) is not being taken. However, we still do not understand the cause, or how to fix it.
-
-Once we believe that we know what might be causing the problem, we now have a hypothesis. We want to use our skills, or the skills of others, to verify that what we think is the problem, is actually the problem. With this we can mitigate or fix the problem.
-
-#### Looking for causes, or “what happened earlier” {#looking-for-causes-or-“what-happened-earlier”}
-
-When a problem has a distinct ‘starting time’, it can be worth checking for changes or events that occurred at that time. Not all problems are a consequence of recent events or changes - sometimes latent problems in systems that have existed for years can be triggered. However, changes (such as deployments and configuration updates) are the most common cause of problems in software systems, so recent changes and events should not be ignored.
-
-For example, we have a graph that lines up with us getting alerted at 13:00, but we can see that the graph started getting worse at 12:00. If you leave a tap running, it isn’t a problem at first, but eventually the sink overflows. The alert is when the sink overflows, the _cause_ is leaving the tap running.
-
-So when we have a better understanding of when things started to get worse, we need to see if anything changed around the same time.
-
-- Did a deploy go out?
-- Was a feature flag changed?
-- Was there a sudden increase in traffic?
-- Did a server vanish?
-
-We can see that a deployment went out at 12:00. This feels like a hypothesis to dig into.
-
-#### Examining possible causes {#examining-possible-causes}
-
-So now we can find the changes that were included in that deployment. Either ourselves or a subject matter expert can help confirm that the changes might be related to the problem.
-
-#### Testing solutions {#testing-solutions}
-
-If we believe that the recent change might be the problem, then we may be able to revert the change. In stateless systems reverting recent changes is generally low-risk. Reverting changes to stateful systems must only be done with great care: an older codebase may not be able to read data written by a newer version of the system and data loss or corruption may result. However, in most production software systems, the great majority of changes that are made are safe to revert, and it is often the quickest way to fix a problem.
-
-If reverting a recent change makes the problem go away, then it is a very strong signal that those changes were indeed the cause of the issue (although coincidences do occur). It does not explicitly prove or disprove our hypothesis that the recent change was the problem.
-
-The next step is to examine those changes more closely and determine exactly how they caused the alert, which would definitively prove the hypothesis.
-
-It is often easier to disprove a hypothesis than to prove it. For example, if one of the changes in the recent deployment (that we just rolled back) introduced a dependency on a new service, and we hypothesise that this new service’s performance might be the cause of the problem, we can inspect the monitoring dashboard for that microservice. If the monitoring shows that the new service is performing well (low latency and low error rates) then the hypothesis would be disproved. We can then generate new hypotheses and try to [falsify](https://en.wikipedia.org/wiki/Falsifiability) those.
-
-### Finding problems by iteratively reducing the possibility space {#finding-problems-by-iteratively-reducing-the-possibility-space}
-
-It is not always possible to track down problems by taking the fast path of looking for recent breaking changes. In this case, we need to use our knowledge of the system (including information that we gain by using debugging tools) to zero in on parts of our system that are not behaving as they should be.
-
-_All swans are white vs no swan is black_
-
-It is more efficient to find a way to disprove your hypothesis or falsify your proposition, if you can. This is because you only need to disprove something once to discard it, but you may _apparently_ verify a hypothesis many times in many different ways and still be wrong.
-
-Every time we disprove a hypothesis we reduce the scope of our problem, which is helpful. We close in on something.
-
-For example, let us imagine that we are troubleshooting an issue in a system that sends notifications to users at a specific times. We know that the system works as follows:
-
-1. The user configures a notification using a web service
-2. The notification configuration is written into a database
-3. A scheduler reads from that database and then sends a push notification to the user’s device
-
-In this situation, the problem could be in any of these steps. Perhaps:
-
-1. The web service silently failed to write to the database for some reason
-2. The database lost data due to a hardware failure
-3. The scheduler had a bug and failed to operate correctly
-4. The push notification could not be sent to the user for some reason; perhaps network-related
-
-A good place to start would be by checking the scheduler program’s logs to determine whether it did attempt to send the notification or not. In order to do this, you would need some specifics about the request, such as a user ID, in order to identify the correct log lines.
-
-Finding the log line (or not finding it) will tell you if the scheduler attempted to send the notification. If the scheduler did attempt to send the notification, then it eliminates database write failures, data loss, and some types of scheduler bug from your search, and indicates that you should focus on what happens on the part of the request path involving the scheduler sending the notification. You may find error details in the scheduler log lines.
-
-If you do not find a log line for the specific request in question - and you do see log lines relating to other notifications around that same time - then it should signal you to examine the first three steps in the process and iterate. Checking whether the notification configuration is present in the database would further reduce the space of possible problems.
-
-### USE Method {#use-method}
-
-[Brendan Gregg’s USE](https://www.brendangregg.com/usemethod.html) (Utilisation, Saturation, Errors) method is particularly helpful for troubleshooting performance problems. Slowness and performance problems can be more difficult to debug than full breakage of some component, because when something is completely down (unresponsive and not serving requests) it’s generally more obvious than something being slow.
-
-Performance problems are generally a result either of some component in the system being highly utilised or saturated, or of errors somewhere that are going unnoticed.
-
-#### Bottlenecks {#bottlenecks}
-
-Imagine a wine bottle and a wide tumbler of water, both holding the same amount of water. Turn the bottle and the glass upside down. The water in the glass falls at once. The water in the bottle empties more slowly. It sits above the narrow neck of the bottle; the speed of the pour is limited by the capacity of the bottle neck.
-
-When a component - either physical, such as CPU or network, or logical, such as locks or cloud API quota - is too highly utilised, other parts of the system will end up waiting for the heavily-loaded component. This occurs because of queuing: when a system is under very heavy load, requests cannot usually be served quickly and on average, will have to wait. The heavily loaded component is known as a bottleneck.
-
-The performance problem then tends to spread beyond the original bottleneck. The clients of the bottleneck system serve requests more slowly (because they are waiting for the bottleneck system), and this in turn affects clients of those systems. This is why heavy utilisation and saturation are very important signals in your systems.
-
-## Scenarios {#scenarios}
-
-### Slack Client Crashes {#slack-client-crashes}
-
-Let’s look at [Slack's Secret STDERR Messages](https://brendangregg.com/blog/2021-08-27/slack-crashes-secret-stderr.html) by Brendan Gregg. This is a great piece not only because of Gregg’s expertise but because of how clearly Gregg describes his process.
-
-Here, Gregg is attempting to troubleshoot why his Slack client seems to be crashing. He doesn’t know much about the inner workings of the software, and he doesn’t have the code, so he has to treat it as a black box. However, in this case, he does have a fairly clear probable location of the fault: the Slack program itself.
-
-He knows that the program is exiting, so he starts there. He has a false start looking for a [core dump](https://linux-audit.com/understand-and-configure-core-dumps-work-on-linux/) to attach a debugger to, but this doesn’t work out so he moves on to try a tool called [exitsnoop](https://manpages.ubuntu.com/manpages/jammy/en/man8/exitsnoop-bpfcc.8.html). Exitsnoop is an eBBF-based tool that traces process termination. Gregg finds that the Slack client is exiting because it receives a SIGABRT signal (which is generally handled by termination).
-
-Gregg still doesn’t know why the program is receiving this signal. He tries some more tracing tools - trying to get a stack trace - but draws a blank. He moves on to looking for logs instead.
-
-He has no idea where Slack logs to, so he uses the lsof tool. Specifically, he runs
-
-```
-lsof -p `pgrep -n slack` | grep -i log
-```
-
-This command line does the following:
-
-1. pgrep -n slack: find the process ID of the most recently started slack process.
-2. `pgrep -n slack`: the use of backticks in a command line means ‘run the commands in the backticks first, and then substitute the result
-3. lsof -p `pgrep -n slack`: this runs lsof -p <PID of the most recently running slack process>. lsof with the ‘-p’ flag lists all the open files that belong to the given PID.
-4. grep -i log: this searches for the word ‘log’ in the text that the command is given. The ‘-i’ flag just makes it case insensitive.
-5. |: this is the pipe symbol. It takes the output of the commands to the left, and send it as input to the command on the right.
-
-The overall command line searches for any file that the most recently started slack process has open, with the word ‘log’ (case insensitive) in the name.
-
-This sort of command line is a result of the [UNIX tools philosophy](https://www.linuxtopia.org/online_books/gnu_linux_tools_guide/the-unix-tools-philosophy.html): commandline tools should be flexible and composable using mechanisms like the pipe (|). This also means that in a Linux/UNIX system, there are often many ways to achieve the same result. Likewise, Linux largely makes system state available to administrators. System state is all the state of the running kernel and its resources - such as which files are being held open by which process, which TCP/IP sockets are listening, which processes are running, and so forth. This state is generally exposed via the [/proc filesystem](https://www.kernel.org/doc/html/latest/filesystems/proc.html), as well as a variety of commandline tools.
-
-Gregg doesn’t find any log files, but he realises that the logs might still exist, having been opened by another slack process. He tries a program called [pstree](https://man7.org/linux/man-pages/man1/pstree.1.html) which displays the entire tree of slack processes. It turns out that there are a number of slack processes, and he tries lsof again with the oldest. This time he finds log files.
-
-Gregg is an expert troubleshooter, but we see that he attempts and abandons several methods for understanding the reason for this program’s failure. This is fairly typical of troubleshooting, unfortunately. He is also using his knowledge of core Linux concepts - processes, open files - to increase his knowledge about how the slack program works.
-
-Once more, Gregg finds that the slack program logs don’t reveal the reason for the program’s crashing. However, he notices that the [stderr stream](https://www.howtogeek.com/435903/what-are-stdin-stdout-and-stderr-on-linux/) isn’t present in the log files.
-
-Gregg knows of another tool, shellsnoop, that he uses to read the stderr stream.
-
-Here he finds an error:
-
-```
-/snap/slack/42/usr/lib/x86_64-linux-gnu/gdk-pixbuf-2.0/2.10.0/loaders/libpixbufloader-png.so: cannot open shared object file: No such file or directory (gdk-pixbuf-error-quark, 5)
-```
-
-This error log indicates that the slack process tried to dynamically load a [shared code module](https://programs.wiki/wiki/linux-system-programming-static-and-dynamic-libraries.html) that didn’t exist. Resolving that issue resolves the failures.
-
-Gregg concludes by pointing out that the tool opensnoop could have led him straight to this answer; but of course, that’s much easier to know in hindsight.
-
-### Broken Load Balancers {#broken-load-balancers}
-
-Let’s look at the [broken loadbalancer scenario](https://hostedgraphite1.wordpress.com/2018/03/01/spooky-action-at-a-distance-how-an-aws-outage-ate-our-load-balancer/) from Hosted Graphite.
-
-On a Friday night, Hosted Graphite engineers start receiving reports of trouble from their monitoring indicating that their website was intermittently becoming unavailable. Their periodic checks to the website and graphs rendering API were both intermittently timing out.
-
-They begin to gather more information to find out what might be happening. They note an
-
-overall drop in all traffic across all their ingestion endpoints which suggested there might
-
-be a network connectivity issue at play. When looking through the impact from canaries vs HTTP API endpoint traffic, they notice that canaries ingestion traffic is affected only in certain AWS regions but the HTTP API ingestion is affected regardless of the location. To add further confusion, some of their internal services also start to report timeouts.
-
-There are conflicting information but all of the events indicate an AWS connectivity issue
-
-and they decide to follow it through.
-
-Digging further, they realise that the internal services having issues are relying on S3 (another AWS service) and that their AWS dependent integrations are also severely impacted. At this point AWS status page is reporting connectivity issues both in us-east-1 and us-west-2 region which is even more confusing as they cannot comprehend "_how_" an AWS outage could affect how they serve their website when it’s neither hosted on AWS nor physically located anywhere near the affected regions.
-
-**One of the hardest problems during any incident is differentiating cause(s) from symptoms.**
-
-So they start looking into the only service they are using which was hosted on AWS, [Route53 health checks](https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/welcome-health-checks.html) for their own (self-hosted) load balancers. These Route53 health checks were configured to ensure that traffic was only routed to healthy load balancers to server production traffic and unhealthy ones were removed from the DNS entry. The health check logs indicate failures from many locations. They don’t know if this was a cause or a symptom, so they disable the route53 health checks, to either confirm or rule out that theory. Unfortunately, disabling the health checks didn’t resolve the issue so they continue digging for clues.
-
-It is at this point, the AWS incident gets resolved and they notice that their traffic rate starts to recover with it, further confirming that the AWS outage was the trigger for this incident. Now they know what happened but not why.
-
-They find two visualisations from their load balancing logs which help them paint a clear
-
-picture of what happened. The first graph shows a sharp rise in the active connections through their load balancing tier, followed by a flat line exactly during the time period of most impact and then a decline towards the end of the incident. This explains the SSL handshake woes they noticed as any new connection won't be accepted by their load balancers once
-
-the maximum connection limit was reached.
-
-This still doesn't explain where these connections were originating from as they didn't see any increase in the number of requests received. This is where the second visualisation comes into picture. This graph shows the average time it took for hosts from different ISPs to send a full request over time and the top row represents requests from AWS hosts. These requests were taking up to 5 seconds to make a full request while other ISPs remained largely unaffected. At this point they finally crack the case.
-
-The AWS hosts from the affected regions were experiencing connectivity issues which significantly slowed down their connections to the load balancers. As a result, these hosts were hogging all the available connections until they hit a connection limit in their load balancers, causing hosts in other locations to be unable to create a new connection.
-
-## Tools {#tools}
-
-There is no exhaustive list of troubleshooting tools. We use whatever is both available and best-suited for the problem at hand. Sometimes that’s a general-purpose tool – like Linux OS-level tooling or TCP packet dumps – and sometimes it’s something system-specific, like application-specific counters, or logging.
-
-In many cases, Google is your friend. It is a starting point for understanding what an error message may mean and what may have caused the error. It is also a good way to find new observability tools (e.g. searching for things like ‘linux how to debug full disk’ will generally throw up some tutorials).
-
-However: do not spend too long trying to find information in this way. Google is your friend, but it is not your only friend. It is easy to fall into a rabbit hole and lose your entire day to Googling, so give yourself some time limits. Set an alarm for 90 minutes. Write up your journey with the problem so far and take it to a more senior engineer. They will appreciate the work you have already done on describing and exploring the problem.
-
-System-specific monitoring can help you: does the system you are investigating export metrics (statistics exposed by a software program for purposes of analysis and alerting)? Does it have a status page, or a command-line tool that you can use to understand its state?
-
-Examples:
-
-- [https://www.haproxy.com/blog/exploring-the-haproxy-stats-page/](https://www.haproxy.com/blog/exploring-the-haproxy-stats-page/)
-- [https://vitess.io/docs/13.0/user-guides/configuration-basic/monitoring/](https://vitess.io/docs/13.0/user-guides/configuration-basic/monitoring/)
-- You many have system-specific dashboards built in Grafana, Prometheus, a SaaS provider like Honeycomb or Datadog, or another monitoring tool
-
-Loadbalancers and datastore statistics are particularly useful - these can often help you to determine whether problems exist upstream (towards backends or servers) or downstream (towards frontends, or clients) of your loadbalancers/datastores. Understanding this can help you narrow down the scope of your investigations.
-
-Logs are also a great source of information, although they can also be a source of plentiful red herrings. Many organisations use a tool such as Splunk or ELK to centralise logs for searching and analysis.
-
-There is a fairly long list of tools below. You don’t need to be an expert in all of these, but it is worth knowing that they exist, what they do in broad strokes, and where to get more information (man pages, google). Try running all of these
-
-You should be familiar with basic Linux tooling such as:
-
-- [perf](https://www.brendangregg.com/perf.html)
-- [strace](https://man7.org/linux/man-pages/man1/strace.1.html)
-- ltrace
-- top, htop
-- sar
-- netstat
-- lsof
-- kill
-- df, du, iotop
-- ps, pstree
-- the[ /proc/](https://www.kernel.org/doc/html/latest/filesystems/proc.html) filesystem
-- dmesg, location of system logfiles - generally /var/syslog/, journalctl
-- tools like cat, less, grep, sed, and awk are invaluable for working with lengthy logs or text output from tools
-- jq is useful for parsing and formatting JSON
-
-For debugging network or connectivity issues, you should know tools like:
-
-- dig (for DNS)
-- traceroute
-- tcpdump and wireshark
-- netcat
-
-[curl](https://curl.se/docs/manpage.html) is invaluable for reproducing and understanding problems with HTTP servers, including issues with certificates.
-
-[Man pages](https://www.kernel.org/doc/man-pages/) can be super useful when you are looking for more information and available options for any of the common Linux tools.
-
-[eBPF ](https://ebpf.io/)is a newer technology that lets you insert traces into the Linux OS, making everything visible. A lot of observability tools use eBPF under the hood, or you can write your own eBPF programs.
-
-# Related Reading {#related-reading}
-
-[https://github.com/iovisor/bcc/blob/master/docs/tutorial.md](https://github.com/iovisor/bcc/blob/master/docs/tutorial.md)
-
-[https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55](https://netflixtechblog.com/linux-performance-analysis-in-60-000-milliseconds-accc10403c55)
-
-[https://linuxtect.com/linux-strace-command-tutorial/](https://linuxtect.com/linux-strace-command-tutorial/)
-
-[https://www.brendangregg.com/perf.html](https://www.brendangregg.com/perf.html)
-
-[https://danielmiessler.com/study/tcpdump/](https://danielmiessler.com/study/tcpdump/)
-
-[https://www.lifewire.com/wireshark-tutorial-4143298](https://www.lifewire.com/wireshark-tutorial-4143298)