Common microservice failures and how DoorDash mitigates them (2024)

DoorDash is an on-demand food delivery service that connects customers with local restaurants through its app and website. It’s currently one of the largest food marketplaces in the US with almost 37 millions users. The platform enables users to browse menus, place orders, and have meals delivered directly to their doorstep.

In 2020, with the constant increase in their user base, the team decided to move from a Django monolith to a microservice architecture. This allowed for better scalability options, shorter waits for tests completion, faster deployment times and increased developer velocity. They wrote a great blog post on how they managed the transition. But this change also brought a lot of complexity with it.

Common microservice failures and how DoorDash mitigates them (1)

The new architecture introduced other types of issues which we’re going to talk about in this article. We’re going to have a look at some of the common pitfalls and anti-patterns that appear in a microservice architecture, how Doordash solved them at a local level and how they’re attempting to mitigate them at a global level.

Common pitfalls with microservice architectures

1. Cascading Failure - A cascading failure happens when the failure of one service leads to the failure of other dependent services. This can cause a chain reaction, potentially bringing down the entire system.

DoorDash had an outage of this kind that they talked about in this blog post. In their case, the chain of failure started from a seemingly innocuous database maintenance, which increased the database latency. The latency then bubbled up to the upstream services, causing errors from timeouts and resource exhaustion. The increased error rates triggered a misconfigured circuit breaker, which stopped traffic between a lot of unrelated services, resulting in an outage with a wide blast radius.

Why it happens:

Tight Coupling: Services are too dependent on each other, leading to a domino effect.
Lack of Isolation: Failures in one service are not contained and propagate to others.
Resource Exhaustion: Failure in a critical service can lead to resource exhaustion (e.g., CPU, memory) in dependent services.

Common microservice failures and how DoorDash mitigates them (2)

2. Retry Storm - A retry storm occurs when a service failure triggers multiple retries from dependent services, which can overwhelm the failing service even more. Retries can worsen the issue when the downstream services are unavailable or slow, leading to work amplification, as each failed request is retried multiple times, which can cause an already degraded service to deteriorate further.

Why it happens:

Uncontrolled Retries: Services automatically retry failed requests without considering the state of the failing service.
Lack of Backoff: Retries happen too frequently, without appropriate delay, exacerbating the problem.

Common microservice failures and how DoorDash mitigates them (3)

Solutions at a local level

There are a couple of known techniques which are used to solve the above, some of which are being used by DoorDash as well:

Exponential Backoff - by gradually increasing the delay between retries to reduce the load

Common microservice failures and how DoorDash mitigates them (6)

Retry Limits - by setting a maximum limit on the number of retries to prevent endless retry loops.
Circuit Breakers - by activating and reducing the number of calls a dependent service makes to another one. Some examples of how circuit breakers are used by other companies:
- Netflix's Hystrix library: Netflix developed Hystrix as a library to manage failures within their distributed system. When Hystrix identifies that a remote service is down or unresponsive, it activates the circuit breaker, stopping further requests from being sent to the problematic service. It can also offer alternative responses or retry the request after a designated period.
- AWS ECS: ECS employs a circuit breaker pattern to automatically isolate failing services, thus preventing cascading failures within applications. This pattern also reduces latency and resource consumption during recovery from a service failure. It effectively ensures high availability and reliability for containerized applications.
- SoundCloud: SoundCloud also uses circuit breaker to handle failures in their distributed architecture.

Bulkheads - by isolating critical resources (like thread pools, database connections, or service instances) ensuring that failures or high resource usage in one microservice do not impact others sharing the same resources.
Timeouts & Fallbacks - by defining the maximum acceptable duration for a microservice to respond to a request and preventing indefinite waiting and and resource consumption when a service is slow or unresponsive.
Load Shedding - by prioritising and limiting incoming requests to prevent overload.

Common microservice failures and how DoorDash mitigates them (8)

Graceful Degradation - by minimising the amount of work that needs to be done, degrading non-essential functionalities or reduce service levels during times of high load, resource scarcity, or service unavailability. This can be achieved by determining essential functionalities that must remain operational even under degraded conditions, simplifying non-critical features to prioritise core functionalities and implementing fallback strategies to maintain basic service levels when primary functionalities are unavailable.
Capacity Planning - by analysing historical data, user patterns, and trends to forecast future demand for each microservice and determining the scalability requirements based on expected growth, seasonal fluctuations, and special events.

Shortcomings of the local countermeasures

The current localised mechanisms face similar limitations:

They rely on metrics specific to the individual service to determine its health. However, many types of failures involve interactions across multiple components, requiring a comprehensive system-wide perspective to effectively address overload conditions.
They use general metrics and heuristics to assess the system health, which may lack precision. For instance, high latency alone may not indicate service overload; it could stem from slow downstream services.
Their corrective actions are constrained. Operating within the local service, these mechanisms can only take local actions which may not be optimal for restoring the system health, as the root cause of the issue might lie elsewhere.

Solutions at a global level

One limitation of load shedding, circuit breakers, and graceful degradation is their narrow perspective within the system. These tools assess factors like their own resource usage, immediate dependencies, and incoming request volume. However, they lack the ability to adopt a global view of the entire system and decide based on that.

Aperture, an open-source system for reliability management, goes beyond local solutions by implementing a centralised load control. It provides a unified system for managing loads across multiple services during outages. It has 3 main components:

Observe: Aperture gathers reliability metrics from each node and consolidates them using Prometheus.
Analyse: A standalone Aperture controller continuously monitors these metrics and detects deviations from Service Level Objectives (SLOs).
Actuate: Upon detecting anomalies, the Aperture controller triggers policies tailored to observed patterns and applies actions on each node, such as load shedding or distributed rate limiting.

Aperture utilizes YAML-based policies that guides its actions during system disruptions. When an alert is triggered, Aperture automatically executes actions based on these configured policies. Some of the actions it offers include distributed rate limiting and concurrency control (also known as load shedding). By maintaining a centralized oversight and control of the entire system, Aperture enables various strategies to mitigate outages. For instance, it can be configured to implement a policy that throttles traffic to an upstream service when a downstream service is overwhelmed, preventing excessive requests from reaching the problematic area and thereby enhancing system responsiveness and cost-efficiency.

Common microservice failures and how DoorDash mitigates them (9)

DoorDash deployed Aperture within one of their core services and conducted artificial request simulations to evaluate its performance. They discovered that Aperture effectively operated as a robust and user-friendly global rate limiter and load shedding solution, providing a concurrency limiting algorithm which minimises the impact of unexpected load or latency.

The Aperture blog also provides good use cases of how their solution can be used to solve production problems:

Conclusion

We had a look into the different pitfalls that arise in distributed systems, what are the triggers that might cause them, what are some of the localised solutions to prevent them from happening and what can be done at a more global level.

In the next articles, we’re going to take a deeper look into how Aperture works, how we can configure it and how to define a set of policies for a given use case.

Common microservice failures and how DoorDash mitigates them (2024)

FAQs

What are the failures in microservices? ›

Failure is the inability of a system to perform its required function due to unhandled errors. In microservices, this might include: Partial Failure: Failure of one or more services leading to degradation in functionality. Total System Failure: Cascading errors causing the entire system to become unavailable.

See Details ›

Does DoorDash use microservices? ›

Within DoorDash's Cell-Based Architecture, each cell consists of multiple Kubernetes clusters, and each microservice is deployed exclusively to one cluster within a given cell. DoorDash's engineers deployed each Kubernetes cluster across multiple AZs to enhance availability and fault tolerance.

Get More Info Here ›

How to trace failing of a specific microservice from list of microservices? ›

Company

Implement microservices logging.
Complement logging with Crash Reporting.
Generate a unique ID for each request to trace microservices.
Prepare each microservice for accepting and storing request IDs.
Create and implement your own logging patterns.
Use a logging framework.
Store all your logs in a single database.

More items...

Jul 21, 2023

Show Me More ›

How to handle failed transactions in microservices? ›

Architect

Circuit Breaker Pattern: Implement the Circuit Breaker pattern, which is a design pattern that detects failures in remote services and prevents cascading failures in the system. ...
Retry Mechanisms: Implement a retry mechanism in your microservices to handle transient failures.

More items...

Jun 8, 2023

Read On ›

How do you fix performance problems in microservices? ›

This section provides some recommendations for steps that you can take to minimize this impact.

Turn CRUD operations into microservices. ...
Provide batch APIs. ...
Use asynchronous requests. ...
Use the shortest route. ...
Avoid chatter during security enforcement. ...
Trace microservice requests. ...
What's next.

View Details ›

Does DoorDash have an API? ›

DoorDash Developer Services. Fulfill local delivery orders placed through your own app or website. Integrate with the DoorDash Drive API to tap into our network of 1 million+ monthly active Dashers. Focus on what you love doing.

Explore More ›

What backend does DoorDash use? ›

DoorDash uses Kotlin for all their backend services and infrastructure. Hundreds of engineers work with Kotlin across a massive system spanning multiple repos. Two years ago, they made the bold move to split their existing legacy Python monolith into multiple backend services, which they developed in Kotlin.

Discover More ›

What stack does DoorDash use? ›

Another achievement of Project Reach was to begin standardizing the tech stack. In this phase, DoorDash adopted Kotlin as the common language for backend services and used gRPC as the default remote procedure call framework for inter-service communication.

Get More Info Here ›

How do you troubleshoot microservices issues? ›

Troubleshooting microservices with Observe

To stay on top of microservices performance and troubleshoot issues rapidly, you need a data collection, aggregation, and analysis strategy that lets you ingest as much data as possible about your microservices.

Get More Info Here ›

How do you handle errors in microservices? ›

How can you handle errors and exceptions in microservices and APIs?

Use standard error codes and formats.
Implement retries and timeouts.
Use logging and monitoring.
Apply error handling patterns.
Test and debug your services and APIs.
Here's what else to consider.

Jan 9, 2024

Show Me More ›

How do you handle cascading failures in microservices? ›

In the event of cascading failure, the load continues to grow at a rapid rate, which means as new nodes are provisioned, they almost immediately get overloaded and start to fail as well. Often, the only way to recover from cascading failure is by drastically reducing or turning off all traffic to the overloaded system.