A guide to Managing the First Fallacy of Distributed Computing

Unreliable Networks
The Impact of Network Failures
Coding for Fault Tolerance and Resilience
- Patterns for Transitory Outages
  - Retries
  - Last Known Good Version
- Patterns for sustained outages
  - Bulkheads
  - Circuit Breakers
Conclusion

Distributed computing is a complex field with numerous challenges, and understanding the fallacies associated with it is crucial for building robust and reliable distributed systems. Here are eight fallacies of distributed computing and their significance:

The Network Is Reliable: Assuming that network connections are always available and reliable can lead to system failures when network outages occur, even when the network outages are transitory. It’s essential to design systems that can gracefully handle network failures through redundancy and fault tolerance mechanisms.
Latency Is Zero: Overestimating the speed of communication between distributed components can result in slow and unresponsive systems. Acknowledging network latency and optimizing for it is vital for delivering efficient user experiences.
Bandwidth Is Infinite: Believing that network bandwidth is unlimited can lead to overloading the network and causing congestion. Efficient data transmission and bandwidth management are crucial to avoid performance bottlenecks.
The Network Is Secure: Assuming that the network is inherently secure can result in vulnerabilities and data breaches. Implementing strong security measures, including encryption and authentication, is necessary to protect sensitive information in distributed systems.
Topology Doesn’t Change: Networks evolve, and assuming a static topology can lead to configuration errors and system instability. Systems should be designed to adapt to changing network conditions and configurations.
There Is One Administrator: Believing that a single administrator controls the entire distributed system can lead to coordination issues and conflicts. In reality, distributed systems often involve multiple administrators, and clear governance and coordination mechanisms are needed.
Transport Cost Is Zero: Neglecting the cost associated with data transfer can lead to inefficient resource utilization and increased operational expenses. Optimizing data transfer and considering the associated costs are essential for cost-effective distributed computing.
The Network Is Homogeneous: Assuming that all network components and nodes have the same characteristics can result in compatibility issues and performance disparities. Systems should be designed to handle heterogeneity and accommodate various types of devices and platforms.

Understanding these fallacies is critical because they underscore the challenges and complexities of distributed computing. Failure to account for these fallacies can lead to system failures, security breaches, and increased operational costs. Building reliable, efficient, and secure distributed systems requires a deep understanding of these fallacies and the implementation of appropriate software design and architecture, and IT operational strategies to address them.

Unreliable Networks

In this blog post, we will look at the first fallacy, its impact on microservices architecture and how to circumvent this limitation. Let’s say we’re using spring-boot to write our microservice and it uses MongoDB as the backend which is deployed as a StatefulSet in Kubernetes. and were running this on EKS. You might also question that it is your cloud provider’s job to give us a reliable network and that we’re paying them for high availability. While the expectation might not be wrong, unfortunately, it doesn’t always work as expected when you rent hardware over the cloud. Let’s say your cloud provider promises 99.99% availability, which is impressive right? No, it ain’t! and I’ll explain how. 99.99% availability could lead to

Every one request in 10,000 requests failing
Every 10 requests in 1,00,000 requests failing

Now you might say that my system doesn’t get that kind of traffic! Fair enough, but this is availability data of the cloud provider, not your instances of service, which means, that if that cloud is getting a billion network requests within its network 1,00,000 will fail! And to make things more complex you can’t expect them to distribute these failures across all accounts using their hardware, you might get hit by any number of those failures depending on your luck. The question here is, do you want to run a business just on the chance of these outages not hitting you? I hope not! So here’s the fundamental description of the first (and most critical) fallacy of distributed computing.

The Impact of Network Failures

Let’s take an example of an e-commerce system, we’d usually see a product catalogue from the Product Microservice; however, the SKU availability might be fetched from another microservice when building the Product Catalogue response. One could argue though that I can replicate the SKU information into the product catalogue via Choreography, but for the scope of this example, let’s assume that’s not in place. The product service therefore is making a REST API Call to the SKU service. What happens when this call fails? How would you convey to the end user whether the product they are looking at is available or not?

Scary stuff yeah? Well, not so scary, since we love to brave ourselves on harder frontiers as engineers, we have a few tricks up our sleeves.

Coding for Fault Tolerance and Resilience

It’s a topic worthy of a book perhaps in itself instead of a blog post. But I’ll try to cover all I can while keeping it simple. Most of what I’m sharing here are experiences gathered over transitioning from Monolith to Microservices for the SaaS business in NimbleWork. And I hope others find it helpful too.

Patterns for Transitory Outages

The following patterns help circumvent transitory outages or blips as we usually call them. The fundamental underlying assumption is that such outages have a lifetime of a second to two at worst.

Retries

One of the simplest things to do is to wrap your network calls in a retry logic so that there are multiple tries before the calling service finally gives up. The idea here is that a temporary network snag from the cloud provider wouldn’t last longer than the retries made for fetching data. Microservice libraries and frameworks in almost all common programming languages provide this feature. The retires themselves have to be nuanced or discrete; retrying when you get 400 will not change the output for example until the request signature is changed. Here’s an example of using retries when making REST API calls with Spring WebFlux WebClient.

Here’s a summary of what we’re trying to achieve with this piece of code:

Retry a maximum of three times in two seconds
Space the time between retries randomly based on the jitter
Retry only if the upstream service gave HTTP 504, 503 or 502 statuses
Log the error and pass it downstream when the maximum attempts are exhausted
Wrap an empty response instead for client errors or pass the error from the previous step downstream

These retries can help recover from blips or snags which aren’t expected to last long. This can also be a good mechanism if the upstream service we’re calling restarts for whatever reason.

Note: We’ve noticed running replica sets in Kubernetes with Rolling Updates strategy helps reduce such blips and hence retries.

While this is an example using the Reactor Project’s implementation in Spring; all major frameworks and languages provide alternatives

Spring Retry if you’re Spring Framework but not on reactive programming
Supervisor Strategy when you’re on Akka with Scala or Java
scala.util.{Failure, Try} if you’re using Scala without any framework as such
Retry Decorator in python
fetch-retry in JavaScript

and I’m sure this is not an exhaustive list. This pattern takes care of transitory network blips. But what if there’s a sustained outage? More than later in this article.

Last Known Good Version

What if the called service continuously crashes and all retries from various clients exhaust? I prefer falling back to a last-known-good version. There are a couple of strategies that can enable this last-known-good-version policy on the infrastructure and client-side. And we’ll briefly touch upon each one of them.

Deployments The simplest option from an infrastructure perspective is to redeploy to the last known stable version of the service. This is under the assumption that the downstream apps are still compatible with calling this older version. It’s easier to do this in Kubernetes, where it saves previous revisions of deployments.
Cached at downstream Another way is for the clients to save a last successful response they can fall back on in case of failures from the service, showing a stale data-related prompt to the end user on Browser or Mobile UI is a good option.

Caching at downstream The browser, or any client for that matter, continuously writes data to an in-memory store until it receives a heartbeat from the upstream. This mechanism offers various implementations for both UI and headless clients that make service calls through gRPC or REST. Here is a summary of what to do, regardless of the type of client.

Clients are registered on their first API Call for the service to keep track
Subsequent updates to the clients are managed as a push from the service to the client
Clients retain the state locally, Redux on browser; or Redis, Memcached for Headless clients (psst .. LinkedHashMaps too if your soul allows that )
If you’re not at a scale to afford push, you can use tools like RTK for ReactJS and NgRx store for Angular and keep pulling state updates, be sure to inform the end-user that they might be seeing stale data when you get any of the 5XX status errors

Patterns for sustained outages

We’d be lucky if any distributed architecture were a system of only blips, which they are not. Hence, we have to build our systems to handle long-lived outages. Here are some of the patterns that help in this regard.

Bulkheads

Bulkheads address the contingency of outages caused by slow upstream services. While the ideal solution is to fix the upstream issue, it’s not always feasible. Consider a scenario where the service (X) you’re calling relies on another service (Y) that exhibits sluggish response times. If service (X) experiences a high volume of incoming traffic, a significant portion of its threads may be left waiting for the slower upstream service (Y) to respond. This waiting not only slows down the service (X) but also increases the rate of dropped requests, leading to more client retries and exacerbating the bottleneck.

To mitigate this issue, one effective approach is to localize the impact of failures. For instance, you can create a dedicated thread pool with a limited number of threads for calling the slower service. By doing so, you confine the effects of slowness and timeouts to a specific API call, thereby enhancing the overall service throughput.

Circuit Breakers

Circuit Breakers can easily be avoided, we have to write services that will never go down! However, the reality is that our applications often rely on external services developed by others. In these situations, Circuit Breaker, as a pattern, becomes invaluable. It routes all traffic between services through a proxy, which promptly starts rejecting requests once a defined threshold of failures is reached. This pattern proves particularly useful during prolonged network outages in external services, which could otherwise lead to outages in the calling services. Nevertheless, ensuring a seamless user experience in such scenarios is vital, and we’ve found two approaches to be effective:

Notify users of the outage in the affected area while enabling them to use other parts of the system.
Allow the client to cache user transactions, providing a “202 Accepted” response instead of “200” or “201” as usual, and resume these transactions once the upstream service becomes available again.

Conclusion

The realization that, despite a cloud provider’s commitment to high availability, network failures remain an inevitability due to the vast scale and unpredictable nature of these networks, underscores the critical need for resilient systems. This journey immerses us in the realm of distributed computing, challenging us as engineers to arm ourselves with strategies for fault tolerance and resilience. Employing techniques like retries, last-known-good version policies, and the development of separate client-server architectures with state management on both ends equips us to confront the unpredictability of network outages.

As we navigate the intricacies of distributed systems, the adoption of these strategies becomes imperative to ensure smooth user experiences and system stability. Welcome to the world of Microservices in the Cloud, where challenges inspire innovation, and resilience forms the bedrock of our response to unreliable networks.

← Previous Post