Common microservice failures and how DoorDash mitigates them (2024)

DoorDash is an on-demand food delivery service that connects customers with local restaurants through its app and website. It’s currently one of the largest food marketplaces in the US with almost 37 millions users. The platform enables users to browse menus, place orders, and have meals delivered directly to their doorstep.

In 2020, with the constant increase in their user base, the team decided to move from a Django monolith to a microservice architecture. This allowed for better scalability options, shorter waits for tests completion, faster deployment times and increased developer velocity. They wrote a great blog post on how they managed the transition. But this change also brought a lot of complexity with it.

Common microservice failures and how DoorDash mitigates them (1)

The new architecture introduced other types of issues which we’re going to talk about in this article. We’re going to have a look at some of the common pitfalls and anti-patterns that appear in a microservice architecture, how Doordash solved them at a local level and how they’re attempting to mitigate them at a global level.

Common pitfalls with microservice architectures

1. Cascading Failure - A cascading failure happens when the failure of one service leads to the failure of other dependent services. This can cause a chain reaction, potentially bringing down the entire system.

DoorDash had an outage of this kind that they talked about in this blog post. In their case, the chain of failure started from a seemingly innocuous database maintenance, which increased the database latency. The latency then bubbled up to the upstream services, causing errors from timeouts and resource exhaustion. The increased error rates triggered a misconfigured circuit breaker, which stopped traffic between a lot of unrelated services, resulting in an outage with a wide blast radius.

Why it happens:

  • Tight Coupling: Services are too dependent on each other, leading to a domino effect.

  • Lack of Isolation: Failures in one service are not contained and propagate to others.

  • Resource Exhaustion: Failure in a critical service can lead to resource exhaustion (e.g., CPU, memory) in dependent services.

Common microservice failures and how DoorDash mitigates them (2)

2. Retry Storm - A retry storm occurs when a service failure triggers multiple retries from dependent services, which can overwhelm the failing service even more. Retries can worsen the issue when the downstream services are unavailable or slow, leading to work amplification, as each failed request is retried multiple times, which can cause an already degraded service to deteriorate further.

Why it happens:

  • Uncontrolled Retries: Services automatically retry failed requests without considering the state of the failing service.

  • Lack of Backoff: Retries happen too frequently, without appropriate delay, exacerbating the problem.

3. Death Spiral - A death spiral happens when the system starts to fail under load, and the attempts to handle the failure (like retries or additional resource allocation) further degrade the system’s performance, leading to a vicious cycle of deteriorating performance.

We’ve seen earlier how issues can spread vertically through different services dependant on one another. But they can also spread horizontally, inside the service cluster, from one node to another.

Why this happens:

  • Resource Contention: Excessive retries and fallback operations consume more resources, reducing availability for normal operations.

  • Unbalanced Load: Efforts to handle failures, like adding more instances, might lead to other parts of the system becoming overloaded.

Common microservice failures and how DoorDash mitigates them (4)

4. Metastable Failure - Metastable failure refers to a state where the system appears to be stable under low load but becomes unstable and fails when the load increases beyond a certain threshold. Metastable failures occur in open systems with an uncontrolled source of load where a trigger causes the system to enter a bad state that persists even when the trigger is removed. When any of several potential triggers cause the system to enter a metastable state, a feedback loop sustains the failure, keeping the system in this state until a significant (usually manual) corrective action is applied.

For example, an initial trigger such as a surge in users, might cause one of the backend services to load shed and start responding to certain requests with 429 (rate limit). The callers will then retry their calls, but these retries, combined with requests from new users, overwhelm the backend service even more, leading to further load shedding. This results in a positive feedback loop where calls are continuously retried (along with new calls), get rate limited, and then retried again, perpetuating the cycle.

The above is called a Thundering Herd problem and is one example of a Metastable failure.

Why this happens:

  • Insufficient Capacity: The system is not designed to handle peak loads, leading to instability.

  • Hidden Bottlenecks: Bottlenecks that are not apparent under normal conditions become critical under high load.

  • Latency Sensitivity: Small increases in latency can have a disproportionate effect on the system's stability.

  • Request Retries: Retrying failed requests is widely used to mask transient issues. However, it also results in work amplification, which can lead to additional failures

Common microservice failures and how DoorDash mitigates them (5)

Solutions at a local level

There are a couple of known techniques which are used to solve the above, some of which are being used by DoorDash as well:

  • Exponential Backoff - by gradually increasing the delay between retries to reduce the load

Common microservice failures and how DoorDash mitigates them (6)
  • Retry Limits - by setting a maximum limit on the number of retries to prevent endless retry loops.

  • Circuit Breakers - by activating and reducing the number of calls a dependent service makes to another one. Some examples of how circuit breakers are used by other companies:

    • Netflix's Hystrix library: Netflix developed Hystrix as a library to manage failures within their distributed system. When Hystrix identifies that a remote service is down or unresponsive, it activates the circuit breaker, stopping further requests from being sent to the problematic service. It can also offer alternative responses or retry the request after a designated period.

    • AWS ECS: ECS employs a circuit breaker pattern to automatically isolate failing services, thus preventing cascading failures within applications. This pattern also reduces latency and resource consumption during recovery from a service failure. It effectively ensures high availability and reliability for containerized applications.

    • SoundCloud: SoundCloud also uses circuit breaker to handle failures in their distributed architecture.

    Common microservice failures and how DoorDash mitigates them (7)
  • Bulkheads - by isolating critical resources (like thread pools, database connections, or service instances) ensuring that failures or high resource usage in one microservice do not impact others sharing the same resources.

  • Timeouts & Fallbacks - by defining the maximum acceptable duration for a microservice to respond to a request and preventing indefinite waiting and and resource consumption when a service is slow or unresponsive.

  • Load Shedding - by prioritising and limiting incoming requests to prevent overload.

Common microservice failures and how DoorDash mitigates them (8)
  • Graceful Degradation - by minimising the amount of work that needs to be done, degrading non-essential functionalities or reduce service levels during times of high load, resource scarcity, or service unavailability. This can be achieved by determining essential functionalities that must remain operational even under degraded conditions, simplifying non-critical features to prioritise core functionalities and implementing fallback strategies to maintain basic service levels when primary functionalities are unavailable.

  • Capacity Planning - by analysing historical data, user patterns, and trends to forecast future demand for each microservice and determining the scalability requirements based on expected growth, seasonal fluctuations, and special events.

Shortcomings of the local countermeasures

The current localised mechanisms face similar limitations:

  • They rely on metrics specific to the individual service to determine its health. However, many types of failures involve interactions across multiple components, requiring a comprehensive system-wide perspective to effectively address overload conditions.

  • They use general metrics and heuristics to assess the system health, which may lack precision. For instance, high latency alone may not indicate service overload; it could stem from slow downstream services.

  • Their corrective actions are constrained. Operating within the local service, these mechanisms can only take local actions which may not be optimal for restoring the system health, as the root cause of the issue might lie elsewhere.

Solutions at a global level

One limitation of load shedding, circuit breakers, and graceful degradation is their narrow perspective within the system. These tools assess factors like their own resource usage, immediate dependencies, and incoming request volume. However, they lack the ability to adopt a global view of the entire system and decide based on that.

Aperture, an open-source system for reliability management, goes beyond local solutions by implementing a centralised load control. It provides a unified system for managing loads across multiple services during outages. It has 3 main components:

  • Observe: Aperture gathers reliability metrics from each node and consolidates them using Prometheus.

  • Analyse: A standalone Aperture controller continuously monitors these metrics and detects deviations from Service Level Objectives (SLOs).

  • Actuate: Upon detecting anomalies, the Aperture controller triggers policies tailored to observed patterns and applies actions on each node, such as load shedding or distributed rate limiting.

Aperture utilizes YAML-based policies that guides its actions during system disruptions. When an alert is triggered, Aperture automatically executes actions based on these configured policies. Some of the actions it offers include distributed rate limiting and concurrency control (also known as load shedding). By maintaining a centralized oversight and control of the entire system, Aperture enables various strategies to mitigate outages. For instance, it can be configured to implement a policy that throttles traffic to an upstream service when a downstream service is overwhelmed, preventing excessive requests from reaching the problematic area and thereby enhancing system responsiveness and cost-efficiency.

Common microservice failures and how DoorDash mitigates them (9)

DoorDash deployed Aperture within one of their core services and conducted artificial request simulations to evaluate its performance. They discovered that Aperture effectively operated as a robust and user-friendly global rate limiter and load shedding solution, providing a concurrency limiting algorithm which minimises the impact of unexpected load or latency.

The Aperture blog also provides good use cases of how their solution can be used to solve production problems:

Conclusion

We had a look into the different pitfalls that arise in distributed systems, what are the triggers that might cause them, what are some of the localised solutions to prevent them from happening and what can be done at a more global level.

In the next articles, we’re going to take a deeper look into how Aperture works, how we can configure it and how to define a set of policies for a given use case.

Common microservice failures and how DoorDash mitigates them (2024)

FAQs

What are the failures in microservices? ›

Failure is the inability of a system to perform its required function due to unhandled errors. In microservices, this might include: Partial Failure: Failure of one or more services leading to degradation in functionality. Total System Failure: Cascading errors causing the entire system to become unavailable.

Does DoorDash use microservices? ›

Within DoorDash's Cell-Based Architecture, each cell consists of multiple Kubernetes clusters, and each microservice is deployed exclusively to one cluster within a given cell. DoorDash's engineers deployed each Kubernetes cluster across multiple AZs to enhance availability and fault tolerance.

How to trace failing of a specific microservice from list of microservices? ›

Company
  1. Implement microservices logging.
  2. Complement logging with Crash Reporting.
  3. Generate a unique ID for each request to trace microservices.
  4. Prepare each microservice for accepting and storing request IDs.
  5. Create and implement your own logging patterns.
  6. Use a logging framework.
  7. Store all your logs in a single database.
Jul 21, 2023

How to handle failed transactions in microservices? ›

Architect
  1. Circuit Breaker Pattern: Implement the Circuit Breaker pattern, which is a design pattern that detects failures in remote services and prevents cascading failures in the system. ...
  2. Retry Mechanisms: Implement a retry mechanism in your microservices to handle transient failures.
Jun 8, 2023

How do you fix performance problems in microservices? ›

This section provides some recommendations for steps that you can take to minimize this impact.
  1. Turn CRUD operations into microservices. ...
  2. Provide batch APIs. ...
  3. Use asynchronous requests. ...
  4. Use the shortest route. ...
  5. Avoid chatter during security enforcement. ...
  6. Trace microservice requests. ...
  7. What's next.

Does DoorDash have an API? ›

DoorDash Developer Services. Fulfill local delivery orders placed through your own app or website. Integrate with the DoorDash Drive API to tap into our network of 1 million+ monthly active Dashers. Focus on what you love doing.

What backend does DoorDash use? ›

DoorDash uses Kotlin for all their backend services and infrastructure. Hundreds of engineers work with Kotlin across a massive system spanning multiple repos. Two years ago, they made the bold move to split their existing legacy Python monolith into multiple backend services, which they developed in Kotlin.

What stack does DoorDash use? ›

Another achievement of Project Reach was to begin standardizing the tech stack. In this phase, DoorDash adopted Kotlin as the common language for backend services and used gRPC as the default remote procedure call framework for inter-service communication.

How do you troubleshoot microservices issues? ›

Troubleshooting microservices with Observe

To stay on top of microservices performance and troubleshoot issues rapidly, you need a data collection, aggregation, and analysis strategy that lets you ingest as much data as possible about your microservices.

How do you handle errors in microservices? ›

How can you handle errors and exceptions in microservices and APIs?
  1. Use standard error codes and formats.
  2. Implement retries and timeouts.
  3. Use logging and monitoring.
  4. Apply error handling patterns.
  5. Test and debug your services and APIs.
  6. Here's what else to consider.
Jan 9, 2024

How do you handle cascading failures in microservices? ›

In the event of cascading failure, the load continues to grow at a rapid rate, which means as new nodes are provisioned, they almost immediately get overloaded and start to fail as well. Often, the only way to recover from cascading failure is by drastically reducing or turning off all traffic to the overloaded system.

What to do when a microservice fails? ›

Approaches:
  1. Retry Mechanisms: Implement retry mechanisms with exponential backoff for failed email sending operations to handle transient issues.
  2. Circuit Breaker: Utilize a circuit breaker pattern to detect when the email service is experiencing prolonged failures.
Jan 26, 2024

How do I make my microservice fault tolerance? ›

Best Practices for Ensuring Resilience in Microservices
  1. Asynchronous communication, which uses intermediaries like message queues, is inherently more fault-tolerant. ...
  2. Timeout. ...
  3. Retry. ...
  4. Circuit Breaker. ...
  5. Implement Statelessness and Idempotence. ...
  6. Adopt Observability and Monitoring. ...
  7. Effective Failure Recovery Mechanisms.
Mar 21, 2024

How do you debug errors in microservices? ›

Best Practices for Tracing and Debugging Microservices
  1. Externalize and Centralize the Storage of Your Logs. ...
  2. Log-Structured Data. ...
  3. Create and Pass a Correlation Identifier Through All Requests. ...
  4. Return Your Identifier Back to Your Client. ...
  5. Make Your Logs Searchable. ...
  6. Allow Your Logging Level to Be Changed Dynamically.
Feb 16, 2017

What are the problems with microservices? ›

Duplication of efforts across implementation teams and an increased cost through multiple different technologies. Integrated testing is difficult when there are only separate microservice teams. Greater operation complexity through more moving parts, and more operational skills are required from your development team.

What is the downfall of microservices? ›

In the microservice architecture, communication between services happens via a network, which leads to increased latency, networking overhead, and potential failure points. Securing reliable communication, coping with network failures and maintaining data consistency among the distributed services can be hard.

What are the weakness of microservices? ›

Disadvantages of microservices

Developers may have to write extra code to ensure smooth communication between modules. Deployment and versioning challenges: Coordinating deployments and managing version control across multiple services can be complex, leading to compatibility issues.

What are the risks of microservices? ›

In a microservices security architecture pattern, common risks and challenges include increased attack surface due to the distributed nature of services, potential vulnerabilities in communication between microservices, difficulty in enforcing consistent security policies across multiple services, and the complexity of ...

Top Articles
WANTED: Classic Motorcycles! local cash buying collector - wanted - by dealer - sale - craigslist
✨ Uncle Henry Spike Knife — 🛍️ The Retail Market
Chicago Neighborhoods: Lincoln Square & Ravenswood - Chicago Moms
Is pickleball Betts' next conquest? 'That's my jam'
Mackenzie Rosman Leaked
How to change your Android phone's default Google account
Seething Storm 5E
Craigslist Furniture Bedroom Set
Bellinghamcraigslist
Braums Pay Per Hour
Myunlb
Uc Santa Cruz Events
Cape Cod | P Town beach
How Quickly Do I Lose My Bike Fitness?
FIX: Spacebar, Enter, or Backspace Not Working
Notisabelrenu
Dexter Gomovies
Jesus Calling Oct 27
Used Sawmill For Sale - Craigslist Near Tennessee
Gdp E124
Culver's Flavor Of The Day Taylor Dr
Keci News
Ivegore Machete Mutolation
Highmark Wholecare Otc Store
UMvC3 OTT: Welcome to 2013!
Bay Area Craigslist Cars For Sale By Owner
Gillette Craigslist
Medline Industries, LP hiring Warehouse Operator - Salt Lake City in Salt Lake City, UT | LinkedIn
Gesichtspflege & Gesichtscreme
Winterset Rants And Raves
Mia Malkova Bio, Net Worth, Age & More - Magzica
Moonrise Time Tonight Near Me
Indiana Jones 5 Showtimes Near Jamaica Multiplex Cinemas
How does paysafecard work? The only guide you need
Tgh Imaging Powered By Tower Wesley Chapel Photos
Hermann Memorial Urgent Care Near Me
Domino's Delivery Pizza
Imperialism Flocabulary Quiz Answers
D3 Boards
Page 5662 – Christianity Today
Temu Y2K
Bones And All Showtimes Near Johnstown Movieplex
Sept Month Weather
Kb Home The Overlook At Medio Creek
Petfinder Quiz
Jimmy John's Near Me Open
Fallout 76 Fox Locations
Rétrospective 2023 : une année culturelle de renaissances et de mutations
8663831604
Land of Samurai: One Piece’s Wano Kuni Arc Explained
Ravenna Greataxe
login.microsoftonline.com Reviews | scam or legit check
Latest Posts
Article information

Author: Cheryll Lueilwitz

Last Updated:

Views: 5664

Rating: 4.3 / 5 (74 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Cheryll Lueilwitz

Birthday: 1997-12-23

Address: 4653 O'Kon Hill, Lake Juanstad, AR 65469

Phone: +494124489301

Job: Marketing Representative

Hobby: Reading, Ice skating, Foraging, BASE jumping, Hiking, Skateboarding, Kayaking

Introduction: My name is Cheryll Lueilwitz, I am a sparkling, clean, super, lucky, joyous, outstanding, lucky person who loves writing and wants to share my knowledge and understanding with you.