blog.suje.sh

Circuit Breakers, Retries, Backoff, and Timeouts — Making Microservices Resilient

In microservice architectures, things don’t always fail cleanly. Sometimes a service goes down, but more often it just gets slow, times out, or behaves inconsistently.

If we don’t handle these failures properly, a small outage in one service can quickly turn into a cascading failure across the whole system. That’s where timeouts, retries, backoff, and circuit breakers come in — four reliability patterns that make microservices resilient instead of fragile.

I’ve used these in Spring Boot applications through libraries like Resilience4j and Spring Cloud, but for a long time I never really stepped back to understand why they exist or how they work together. This post is my attempt to document that understanding in simple terms, with real examples.


The Real Problem in Distributed Systems

Most failures are not hard failures (like connection refused). They are slow failures (like waiting 7 seconds for a 200ms API call). Slow failures are worse because they block threads, fill connection pools, and cause retries that make the situation even worse.

And the dangerous part: retrying a slow system can overload it further, turning a temporary issue into a full outage.

So instead of assuming “things will work unless they break”, resilient systems assume:


Everything will fail eventually.
Some failures are temporary.
Some failures are slow.
Retries are not always safe.

Pattern #1 — Timeouts

A timeout defines how long you’re willing to wait for a response before giving up.

Why timeouts matter:

  • Without one, calls hang forever and consume resources
  • Long timeouts lead to thread exhaustion
  • Increasing timeout is never the real fix — it just delays failure

Bad design:


timeout = 60 seconds

Better design:


timeout = 200-500ms (based on p95 latency)

Example (Spring Boot WebClient):

WebClient.builder()
    .baseUrl("http://payment-service")
    .clientConnector(
        new ReactorClientHttpConnector(
            HttpClient.create()
                .responseTimeout(Duration.ofMillis(500))
        )
    );

Rule of thumb: every remote call needs a timeout. No timeout = hidden outage waiting to happen.


Pattern #2 — Retries

Retries help handle temporary failures (network blips, slow DB, AWS hiccup). But retries are dangerous if used blindly.

Good retry:

  • Fast failure
  • Small retry count
  • Combined with timeout
  • Combined with backoff

Bad retry:

  • Infinite retry loops
  • Retry without timeout
  • Retry in tight loop (hammering a dying service)
  • Retrying non-idempotent operations (double charge, duplicate orders)

Example wrong retry:

retry every 100ms, 10 times

Example correct retry:

retry 3 times with exponential backoff and jitter

Example (Resilience4j):

RetryConfig config = RetryConfig.custom()
    .maxAttempts(3)
    .waitDuration(Duration.ofMillis(200))
    .build();

Pattern #3 — Exponential Backoff (with Jitter)

If 10,000 clients retry at the same time, the system collapses. Backoff spreads out retries so they don’t create a traffic spike.

Example:

Retry #No BackoffWith Exponential Backoff
1100ms100ms
2100ms200ms
3100ms400ms
4100ms800ms

Jitter adds randomness to avoid retry storms:

800ms → random between 600–1200ms

This is why AWS, Google Cloud, and Kafka strongly recommend exponential backoff + jitter for all retry logic.


Pattern #4 — Circuit Breaker

A circuit breaker stops calls to an unhealthy service instead of retrying forever.

State diagram:

Closed      → calls allowed normally
Open        → calls are rejected immediately
Half-open   → test a few calls before closing again

Purpose:

  • Protects the failing service from overload
  • Protects callers from hanging on slow calls
  • Prevents cascading failure across microservices

Example (Resilience4j):

CircuitBreakerConfig config = CircuitBreakerConfig.custom()
    .failureRateThreshold(50)
    .waitDurationInOpenState(Duration.ofSeconds(10))
    .slidingWindowSize(20)
    .build();

Without circuit breakers, a failing payment service can take down order service, API gateway, and user frontend — even if only one service is broken.


How They Work Together

PatternSolvesWithout It
TimeoutSlow failuresThreads hang forever
RetryTemporary failuresOne glitch becomes user-visible
BackoffRetry stormsSystem collapses from traffic spike
Circuit BreakerCascading failureOne outage spreads to many services

Together, they form a resilience chain:

timeout → retry (limited) → backoff → circuit breaker → fallback

Mini Example: Order Service Calling Payment Service

Problem scenario:

  • Payment API becomes slow (10s instead of 200ms)
  • Order service keeps waiting → threads exhausted
  • Clients retry → more load → both services die

Fixed with resilience patterns:

timeout = 300ms
retry = 2 attempts
backoff = exponential with jitter
circuit breaker = open after 50% failures
fallback = mark order as "payment pending"

Now instead of crashing, the system degrades gracefully.


Common Questions

✅ “What’s the difference between retry and circuit breaker?” → Retry assumes failure is temporary. Circuit breaker assumes failure is persistent.

✅ “Why is retry without backoff dangerous?” → Can create a thundering herd effect.

✅ “Should retries be at client or server side?” → Client, unless the operation is idempotent and safe to retry internally.

✅ “Why must timeouts be shorter than SLA?” → If users expect 300ms responses, a 5s timeout is already a failure.

✅ “Does Kafka provide exactly-once delivery?” → Only when the processing logic is idempotent or transactional.


My Takeaway

I used to treat timeouts, retries, and circuit breakers as configuration flags — something you enable because it seems like a good idea.

Now I understand they are a coordinated defense system for distributed applications. Each one solves a different failure mode, but you only get real resilience when you use them together.

The question I ask now is:

“If this dependency slows down or fails, will my service survive or collapse?”

If the answer is “collapse”, I know I need to rethink the design.


← All posts