-
What is a distributed system?
According to Coulouris et al., "A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another."
The components of this system can be thought of as software programs that run on physical hardware, such as computers. These components take many forms; e.g., they can be web servers, routers, web browsers, etc. To keep a generic view, we assume that each program runs on a separate machine. We refer to each of these machines as a node.
-
Parts of a distributed system
There are two categories of the central parts that help distributed systems function:
- The various parts that compose a distributed system: These are located remotely and are separated by a network
- The network that separates the various parts of a distributed system: It acts as a communication mechanism that lets them exchange messages.
-
-
Why we need a distributed system There are three main benefits of distributed systems, as shown in the illustration below.
-
Performance
According to Mohan et al., "Performance is the degree to which a software system or component meets its objectives for timeliness."
-
Problem with a single computer The physical constraints of its hardware impose certain limits on the performance of a single computer. Moreover, it is extremely expensive to improve a single computer’s performance after a certain point.
-
Solution
We can achieve the same performance with two or more low-spec computers as with a single, high-end computer. So, distributed systems allow us to achieve better performance at a lower cost.
Note that better performance can translate to different things depending on the context, such as lower latency per request, higher throughput, etc.
-
-
Scalability
According to Bondi et al., "Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to accommodate that growth."
-
Problem with a single computer
Data storage and processing are responsible for most of the value that software systems impart in the real world. As a system’s customer base grows, the system needs to handle more traffic and store larger amounts of data. However, a system that comprises a single computer can only scale up to a certain point, as explained earlier.
-
Solution
If we build a distributed system, we can split and store the data in multiple computers, and distribute the processing work.
Vertical scaling refers to the approach of scaling a system by adding resources (memory, CPU, disk, etc.) to a single node. Meanwhile, horizontal scaling refers to the approach of scaling by adding more nodes to the system.
As a result of this, we can scale our systems to sizes that we could not imagine with a single computer system.
-
-
Availability
In the context of software systems, availability is the probability of a system to work as required, when required, during a mission.
-
Problem with a single computer
Nowadays, most online services need to operate all the time (also known as "24/7 service"), which is a huge challenge. When a service states that it has five-nine availability, it usually operates 99.999% of the time. This implies that it can be down for only 5 minutes at most per year to satisfy this guarantee.
If we consider how unreliable hardware can be, we can easily understand how big an undertaking this is. Of course, it would be infeasible to provide this kind of guarantee with a single computer.
-
Solution
Redundancy is one of the widely used mechanisms to achieve higher availability. It refers to storing data into multiple, redundant computers. So, when one computer fails, we can efficiently switch to another one. This way, we’ll prevent our customers from experiencing this failure.
Given that data are stored now in multiple computers, we end up with a distributed system!
If we leverage a distributed system, we get all of the above benefits. However, as we will see later on, there is tension between them and several other properties. So, in most cases, we have to make a trade-off. To do this, we must understand the basic constraints and limitations of distributed systems. The first part of this course will help us with this.
-
-
-
The difference in developing software for distributed systems
Distributed systems are subject to many more constraints than software systems that run on a single computer. As a result, the development of software for distributed systems is also very different. However, those who are new to distributed systems make assumptions based on their experience with software development for systems that run on a single computer. Of course, this creates a lot of problems down the road in the systems they build.
To eliminate this confusion and help people build better systems, L Peter Deutsch and others at Sun Microsystems created a collection of these false assumptions. These are the fallacies of distributed computing
-
Fallacies
-
The network is reliable
The abstractions developers learn from various technologies and protocols often enforce this common fallacy. As we will see in a later chapter, networking protocols like TCP can make us believe that the network is reliable and never fails. However, this is just an illusion with significant repercussions. Also, we build network connections on top of hardware that will also fail at some point. Hence, we should design our systems accordingly.
-
Latency is zero
Libraries that attempt to model remote procedure calls as local calls, such as gRPC or Thrift, enforce this assumption. We should always remember that there’s a large difference (from milliseconds to nanoseconds) in latency between a call to a remote system and that to local memory access. This gets even worse when we consider calls between data centers on different continents. Thus, this is another thing to keep in mind when deciding how to geo-distribute our system.
-
Bandwidth is infinite
This fallacy is weaker nowadays. This is because the bandwidth we can achieve has significantly improved in the last few decades. For instance, we can now build high-bandwidth connections in our own data centers. However, this does not mean we can use all of it if our traffic needs to cross the Internet. This is important to consider when we make decisions about our distributed system’s topology, and when requests travel through the Internet.
-
The network is secure
This fallacy shows that the wider network used by two nodes to communicate is not necessarily under their control. Thus, we should consider it insecure.
The course dedicates a portion to security, where it explains the various techniques we can use to securely utilize an insecure network. This network also comprises many different parts that different organizations may manage with different hardware. Moreover, failures in some parts of this network may require us to change its topology to keep it functional.
The following fallacies highlight this:
- Topology doesn’t change
- There is one administrator
- The network is homogeneous
-
Transport cost is zero The transportation of data between two points incurs financial costs. We should factor this in when we build a distributed system.
-
-
The global clock fallacy
There’s one fallacy that’s not a part of the above set, but still often causes confusion amongst people new to distributed systems. If we follow the same style as above, we can phrase this fallacy as: > “Distributed systems have a global clock, which we can use to identify when events happen.”
This assumption is quite deceiving since it’s somewhat intuitive and holds true even in non-distributed systems. For instance, an application that runs on a single computer can use the computer’s local clock to decide when events happen, and in what order. However, this is not true in a distributed system, where every node in the system has its own local clock that runs at a unique rate.
While there are ways to keep the clocks in sync, some are very expensive and don’t completely eliminate these differences. Physical laws also bind this limitation. An example of h is the TrueTime API built by Google, which exposes the clock uncertainty explicitly as a first-class citizen.
However, as we’ll see in the upcoming lessons that discuss cause and effects, there are other ways to reason about time using logical clocks.
-
Why distributed systems are hard to design
In general, distributed systems are hard to design, build, and reason about. This increases the risk of error.
This will become more evident later in this course when we explore some algorithms that solve fundamental problems in distributed systems.
It’s worth questioning this: why are distributed systems so hard to design? The answer to this question will help us eliminate our blind spots, and provide guidance on some aspects we should pay attention to.
- Properties that make distributed systems challenging
-
Network asynchrony
Network asynchrony is a property of communication networks that cannot provide strong guarantees around delivering events, e.g., a maximum amount of time a message requires for delivery. This can create a lot of counter-intuitive behaviors that are not present in non-distributed systems. This contrasts to memory operations that provide much stricter guarantees. For instance, messages might take extremely long to deliver in a distributed system. They may even deliver out of order—or not at all.
-
Partial failures
Partial failures are the cases where only some components of a distributed system fail. This behavior can contrast with certain kinds of applications a single server deploys. These applications work under the assumption that either everything is working fine, or there has been a server crash. It introduces significant complexity when it requires atomicity across components in a distributed system. Thus, we must ensure that we either apply an operation to all the nodes of a system, or to none of them.
-
Concurrency
Concurrency is the execution of multiple computations at the same time, and potentially on the same piece of data. These computations interleave with each other. This introduces additional complexity since these computations can interfere with each other and create unexpected behaviors. This is, again, in contrast to simplistic applications with no concurrency, where the program runs in the order the sequence of commands in the source code defined.
-
- Properties that make distributed systems challenging