Resilience | Developer Knowledge Base

Resilience is about containing partial failure so one weak dependency does not drag an entire distributed system into collapse.

1. Definíció / Definition

Mi ez? / What is it?

Resilience is the discipline of protecting distributed systems from partial failure, excessive latency, and dependency overload. In the Spring ecosystem, that usually means Resilience4j combined with Spring Cloud Circuit Breaker abstractions around remote calls.

Miért létezik? / Why does it exist?

Because networks fail, downstream services slow down, and external dependencies become intermittently unhealthy. If callers keep waiting forever or retrying blindly, they exhaust their own resources and spread failure through the system. Resilience patterns exist to fail faster, isolate damage, and preserve useful capacity.

Hol helyezkedik el? / Where does it fit?

It belongs around integration boundaries: service-to-service HTTP calls, asynchronous request workflows, and external API clients. It is not a substitute for monitoring, capacity planning, or good API design; it is a protection layer around unavoidable remote dependency risk.

2. Alapfogalmak / Core Concepts

2.1 Resilience4j as the Hystrix successor

Hystrix is now a legacy choice. Resilience4j is lightweight, modular, and better aligned with current Java and Spring ecosystems. Its main modules are:

CircuitBreaker
Retry
RateLimiter
Bulkhead
TimeLimiter

Spring Cloud Circuit Breaker provides an abstraction on top of concrete implementations, but in most Spring-based systems Resilience4j is the practical default.

2.2 Circuit breaker states

A circuit breaker observes call outcomes and changes state according to configured thresholds.

Typical state transitions:

CLOSED → OPEN when failures or slow calls cross the threshold.
OPEN → HALF_OPEN after the configured wait duration expires.
HALF_OPEN → CLOSED when the trial calls succeed.
HALF_OPEN → OPEN again when the trial calls fail.

State	Meaning	Behavior
CLOSED	Normal traffic flow	Calls pass and metrics are collected
OPEN	Dependency considered unhealthy	Calls are rejected immediately
HALF_OPEN	Trial phase	Limited calls test whether recovery occurred

Open state reduces pressure on a failing dependency and prevents the caller from wasting resources on requests that are likely to fail.

2.3 Sliding window configuration

Circuit breaker decisions depend on a sliding window of recent observations.

Type	What it measures	Typical use
Count-based	Last N calls	Stable request volume
Time-based	Last X seconds	Burstier or fluctuating traffic

Common parameters include:

failure rate threshold;
slow call rate threshold;
minimum number of calls before evaluation;
wait duration in open state;
permitted trial calls in half-open state.

2.4 Retry

Retry is useful only when failure is likely transient and the operation can safely be attempted again. It is not a universal best practice. When applied to non-idempotent operations or already overloaded dependencies, retry becomes a force multiplier for failure.

2.5 RateLimiter and TimeLimiter

A RateLimiter controls how many requests may pass during a period. This can protect downstream capacity or ensure you stay within an external partner's contract.

A TimeLimiter sets an upper bound on how long an asynchronous result may take. Without explicit timeout control, callers retain threads, memory, and request context far longer than intended.

2.6 Bulkhead: semaphore vs thread-pool isolation

Bulkheads prevent one dependency from consuming all resources.

Pattern	Isolation style	Best fit
Semaphore bulkhead	Limits concurrent executions	Low overhead concurrency protection
Thread-pool bulkhead	Uses a dedicated executor pool	Stronger isolation for blocking work

Semaphore bulkheads are simpler and cheaper. Thread-pool bulkheads provide harder execution isolation but introduce more operational tuning and overhead.

3. Gyakorlati használat / Practical Usage

A payment integration is the classic example. A checkout service calls an external payment provider. If the provider slows down, the checkout path should not hold threads indefinitely. A circuit breaker stops repeated calls into known failure, a timeout caps waiting time, retry is used only for specific transient errors, and a bulkhead prevents payment traffic from consuming the entire request-processing pool.

Inventory and pricing services are another good example. They are called frequently and sit on the hot path of user interactions. If they degrade, the impact can ripple across many upstream services. Bulkheads and rate limits can keep that degradation bounded. If the domain supports it, a carefully designed fallback such as a short-lived cached read model may preserve some user-facing behavior.

However, fallback must respect business truth. Returning an empty recommendation list is usually acceptable. Returning a guessed account balance or pretending inventory is available when validation failed is usually not. Resilience patterns must align with domain correctness, not just technical availability.

4. Kód példák / Code Examples

4.1 Annotation-based circuit breaker with fallback

@Service
class PaymentProviderClient {

    @CircuitBreaker(name = "paymentProvider", fallbackMethod = "fallbackAuthorize")
    @Retry(name = "paymentProvider")
    public PaymentResponse authorize(PaymentRequest request) {
        return invokeRemoteProvider(request);
    }

    private PaymentResponse invokeRemoteProvider(PaymentRequest request) {
        throw new IllegalStateException("Provider unavailable");
    }

    public PaymentResponse fallbackAuthorize(PaymentRequest request, Throwable throwable) {
        return new PaymentResponse("PENDING_MANUAL_REVIEW", "fallback due to: " + throwable.getMessage());
    }
}

record PaymentRequest(String orderId, BigDecimal amount) {}
record PaymentResponse(String status, String message) {}

4.2 TimeLimiter with asynchronous execution

@Service
class PricingClient {

    @TimeLimiter(name = "pricingService")
    public CompletableFuture<PricingResponse> getPricing(String sku) {
        return CompletableFuture.supplyAsync(() -> fetchPricing(sku));
    }

    private PricingResponse fetchPricing(String sku) {
        return new PricingResponse(sku, new BigDecimal("19.99"));
    }
}

record PricingResponse(String sku, BigDecimal price) {}

4.3 Bulkhead-protected recommendation call

@Service
class RecommendationClient {

    @Bulkhead(name = "recommendationService", type = Bulkhead.Type.SEMAPHORE, fallbackMethod = "fallback")
    public List<String> getRecommendations(String customerId) {
        return List.of("sku-1", "sku-2");
    }

    public List<String> fallback(String customerId, Throwable throwable) {
        return List.of();
    }
}

4.4 Resilience4j YAML configuration

resilience4j:
  circuitbreaker:
    instances:
      paymentProvider:
        slidingWindowType: COUNT_BASED
        slidingWindowSize: 20
        minimumNumberOfCalls: 10
        failureRateThreshold: 50
        slowCallRateThreshold: 60
        slowCallDurationThreshold: 2s
        waitDurationInOpenState: 30s
        permittedNumberOfCallsInHalfOpenState: 5
  retry:
    instances:
      paymentProvider:
        maxRetryAttempts: 3
        waitDuration: 200ms
  ratelimiter:
    instances:
      pricingService:
        limitForPeriod: 50
        limitRefreshPeriod: 1s
        timeoutDuration: 0

5. Trade-offok / Trade-offs

Benefits

prevents unhealthy dependencies from consuming all caller resources;
improves failure behavior by failing fast and predictably;
reduces the chance of cascade failure;
creates explicit, tunable operational policy around remote calls.

Costs

poor tuning can reject traffic unnecessarily;
fallback behavior can introduce business inconsistency if designed badly;
combining retries, timeouts, and breakers creates complex interaction effects;
resilience libraries do not fix bad architecture or poor observability.

Use it when remote dependencies matter to service health, especially with external APIs or heavily used internal services.

Avoid simplistic use when the work is local, the operation is not safely retryable, or the team has not thought through the business meaning of degraded behavior.

6. Gyakori hibák / Common Mistakes

6.1 Retrying non-idempotent operations

Blind retries on operations that charge money, create orders, or trigger side effects can duplicate business actions. Without idempotency keys or explicit guarantees, retry can be dangerous.

6.2 Adding a fallback for everything

Not every failure should degrade into a default value. Sometimes the correct behavior is a fast and explicit error that preserves business integrity.

6.3 Poor sliding-window sizing

A tiny window makes the breaker twitchy; an oversized window makes it slow to react. Tune against real traffic patterns, not guesswork.

6.4 Misaligned timeout layers

If HTTP client timeouts, TimeLimiter, gateway timeouts, and upstream expectations are inconsistent, diagnosis becomes confusing and failure modes become harder to predict.

6.5 Using retry without isolation

If a dependency is already slow, retries increase load on the dependency while also consuming more caller resources. Without bulkheads or hard concurrency control, retry alone often worsens incidents.

7. Senior szintű meglátások / Senior-level Insights

Resilience policy should be designed by business criticality, not by copy-pasted annotations. Recommendations, payments, ledger writes, inventory checks, and audit streams all deserve different protection strategies. “One config for every dependency” is a common anti-pattern.

Fallback quality matters more than fallback existence. A good fallback is domain-honest. An empty recommendation set may be acceptable; a fabricated financial result is not. Senior engineering means evaluating degraded behavior through business semantics, not just uptime metrics.

Metrics are essential. Circuit breaker open rates, retry counts, rate limiter rejections, bulkhead saturation, and timeout ratios are the difference between tuning and guessing. Resilience is not a one-time code pattern; it is an operational feedback loop.

Finally, resilience patterns cannot rescue a fundamentally poor architecture. If the service graph is too chatty, if synchronous dependencies are overused, or if no clear SLA/SLO model exists, circuit breakers will merely expose those structural issues in a more controlled way.

8. Szószedet / Glossary

Resilience4j: Java library implementing resilience patterns.
Circuit Breaker: mechanism that stops calls when failure rates become unacceptable.
Retry: controlled re-attempt of an operation after failure.
RateLimiter: component that limits request throughput.
Bulkhead: resource isolation pattern that prevents one dependency from consuming everything.
TimeLimiter: timeout controller for asynchronous operations.
HALF_OPEN: circuit breaker trial state used to test recovery.
Sliding window: the observation horizon used to evaluate recent call outcomes.

9. Gyorsreferencia / Cheatsheet

Pattern	Purpose	Key tuning point	Common mistake
CircuitBreaker	Stop repeated calls into failure	failure threshold, open wait	Overreacting to small samples
Retry	Recover from transient faults	attempts, backoff	Retrying non-idempotent work
RateLimiter	Control throughput	permits per period	Wrong capacity assumptions
Bulkhead	Isolate resource usage	concurrency or pool size	No isolation around slow dependencies
TimeLimiter	Cap wait time	timeout duration	Misaligned timeout stack
Fallback	Provide degraded behavior	business-valid alternative	Returning misleading data