Resilience
circuit breaker, Resilience4j, retry, rate limiting, bulkhead
Resilience
Resilience is about containing partial failure so one weak dependency does not drag an entire distributed system into collapse.
1. Definíció / Definition
Mi ez? / What is it?
Resilience is the discipline of protecting distributed systems from partial failure, excessive latency, and dependency overload. In the Spring ecosystem, that usually means Resilience4j combined with Spring Cloud Circuit Breaker abstractions around remote calls.
Miért létezik? / Why does it exist?
Because networks fail, downstream services slow down, and external dependencies become intermittently unhealthy. If callers keep waiting forever or retrying blindly, they exhaust their own resources and spread failure through the system. Resilience patterns exist to fail faster, isolate damage, and preserve useful capacity.
Hol helyezkedik el? / Where does it fit?
It belongs around integration boundaries: service-to-service HTTP calls, asynchronous request workflows, and external API clients. It is not a substitute for monitoring, capacity planning, or good API design; it is a protection layer around unavoidable remote dependency risk.
2. Alapfogalmak / Core Concepts
2.1 Resilience4j as the Hystrix successor
Hystrix is now a legacy choice. Resilience4j is lightweight, modular, and better aligned with current Java and Spring ecosystems. Its main modules are:
- CircuitBreaker
- Retry
- RateLimiter
- Bulkhead
- TimeLimiter
Spring Cloud Circuit Breaker provides an abstraction on top of concrete implementations, but in most Spring-based systems Resilience4j is the practical default.
2.2 Circuit breaker states
A circuit breaker observes call outcomes and changes state according to configured thresholds.
Typical state transitions:
CLOSED→OPENwhen failures or slow calls cross the threshold.OPEN→HALF_OPENafter the configured wait duration expires.HALF_OPEN→CLOSEDwhen the trial calls succeed.HALF_OPEN→OPENagain when the trial calls fail.
| State | Meaning | Behavior |
|---|---|---|
| CLOSED | Normal traffic flow | Calls pass and metrics are collected |
| OPEN | Dependency considered unhealthy | Calls are rejected immediately |
| HALF_OPEN | Trial phase | Limited calls test whether recovery occurred |
Open state reduces pressure on a failing dependency and prevents the caller from wasting resources on requests that are likely to fail.
2.3 Sliding window configuration
Circuit breaker decisions depend on a sliding window of recent observations.
| Type | What it measures | Typical use |
|---|---|---|
| Count-based | Last N calls | Stable request volume |
| Time-based | Last X seconds | Burstier or fluctuating traffic |
Common parameters include:
- failure rate threshold;
- slow call rate threshold;
- minimum number of calls before evaluation;
- wait duration in open state;
- permitted trial calls in half-open state.
2.4 Retry
Retry is useful only when failure is likely transient and the operation can safely be attempted again. It is not a universal best practice. When applied to non-idempotent operations or already overloaded dependencies, retry becomes a force multiplier for failure.
2.5 RateLimiter and TimeLimiter
A RateLimiter controls how many requests may pass during a period. This can protect downstream capacity or ensure you stay within an external partner's contract.
A TimeLimiter sets an upper bound on how long an asynchronous result may take. Without explicit timeout control, callers retain threads, memory, and request context far longer than intended.
2.6 Bulkhead: semaphore vs thread-pool isolation
Bulkheads prevent one dependency from consuming all resources.
| Pattern | Isolation style | Best fit |
|---|---|---|
| Semaphore bulkhead | Limits concurrent executions | Low overhead concurrency protection |
| Thread-pool bulkhead | Uses a dedicated executor pool | Stronger isolation for blocking work |
Semaphore bulkheads are simpler and cheaper. Thread-pool bulkheads provide harder execution isolation but introduce more operational tuning and overhead.
3. Gyakorlati használat / Practical Usage
A payment integration is the classic example. A checkout service calls an external payment provider. If the provider slows down, the checkout path should not hold threads indefinitely. A circuit breaker stops repeated calls into known failure, a timeout caps waiting time, retry is used only for specific transient errors, and a bulkhead prevents payment traffic from consuming the entire request-processing pool.
Inventory and pricing services are another good example. They are called frequently and sit on the hot path of user interactions. If they degrade, the impact can ripple across many upstream services. Bulkheads and rate limits can keep that degradation bounded. If the domain supports it, a carefully designed fallback such as a short-lived cached read model may preserve some user-facing behavior.
However, fallback must respect business truth. Returning an empty recommendation list is usually acceptable. Returning a guessed account balance or pretending inventory is available when validation failed is usually not. Resilience patterns must align with domain correctness, not just technical availability.
4. Kód példák / Code Examples
4.1 Annotation-based circuit breaker with fallback
@Service
class PaymentProviderClient {
@CircuitBreaker(name = "paymentProvider", fallbackMethod = "fallbackAuthorize")
@Retry(name = "paymentProvider")
public PaymentResponse authorize(PaymentRequest request) {
return invokeRemoteProvider(request);
}
private PaymentResponse invokeRemoteProvider(PaymentRequest request) {
throw new IllegalStateException("Provider unavailable");
}
public PaymentResponse fallbackAuthorize(PaymentRequest request, Throwable throwable) {
return new PaymentResponse("PENDING_MANUAL_REVIEW", "fallback due to: " + throwable.getMessage());
}
}
record PaymentRequest(String orderId, BigDecimal amount) {}
record PaymentResponse(String status, String message) {}
4.2 TimeLimiter with asynchronous execution
@Service
class PricingClient {
@TimeLimiter(name = "pricingService")
public CompletableFuture<PricingResponse> getPricing(String sku) {
return CompletableFuture.supplyAsync(() -> fetchPricing(sku));
}
private PricingResponse fetchPricing(String sku) {
return new PricingResponse(sku, new BigDecimal("19.99"));
}
}
record PricingResponse(String sku, BigDecimal price) {}
4.3 Bulkhead-protected recommendation call
@Service
class RecommendationClient {
@Bulkhead(name = "recommendationService", type = Bulkhead.Type.SEMAPHORE, fallbackMethod = "fallback")
public List<String> getRecommendations(String customerId) {
return List.of("sku-1", "sku-2");
}
public List<String> fallback(String customerId, Throwable throwable) {
return List.of();
}
}
4.4 Resilience4j YAML configuration
resilience4j:
circuitbreaker:
instances:
paymentProvider:
slidingWindowType: COUNT_BASED
slidingWindowSize: 20
minimumNumberOfCalls: 10
failureRateThreshold: 50
slowCallRateThreshold: 60
slowCallDurationThreshold: 2s
waitDurationInOpenState: 30s
permittedNumberOfCallsInHalfOpenState: 5
retry:
instances:
paymentProvider:
maxRetryAttempts: 3
waitDuration: 200ms
ratelimiter:
instances:
pricingService:
limitForPeriod: 50
limitRefreshPeriod: 1s
timeoutDuration: 0
5. Trade-offok / Trade-offs
Benefits
- prevents unhealthy dependencies from consuming all caller resources;
- improves failure behavior by failing fast and predictably;
- reduces the chance of cascade failure;
- creates explicit, tunable operational policy around remote calls.
Costs
- poor tuning can reject traffic unnecessarily;
- fallback behavior can introduce business inconsistency if designed badly;
- combining retries, timeouts, and breakers creates complex interaction effects;
- resilience libraries do not fix bad architecture or poor observability.
Use it when remote dependencies matter to service health, especially with external APIs or heavily used internal services.
Avoid simplistic use when the work is local, the operation is not safely retryable, or the team has not thought through the business meaning of degraded behavior.
6. Gyakori hibák / Common Mistakes
6.1 Retrying non-idempotent operations
Blind retries on operations that charge money, create orders, or trigger side effects can duplicate business actions. Without idempotency keys or explicit guarantees, retry can be dangerous.
6.2 Adding a fallback for everything
Not every failure should degrade into a default value. Sometimes the correct behavior is a fast and explicit error that preserves business integrity.
6.3 Poor sliding-window sizing
A tiny window makes the breaker twitchy; an oversized window makes it slow to react. Tune against real traffic patterns, not guesswork.
6.4 Misaligned timeout layers
If HTTP client timeouts, TimeLimiter, gateway timeouts, and upstream expectations are inconsistent, diagnosis becomes confusing and failure modes become harder to predict.
6.5 Using retry without isolation
If a dependency is already slow, retries increase load on the dependency while also consuming more caller resources. Without bulkheads or hard concurrency control, retry alone often worsens incidents.
7. Senior szintű meglátások / Senior-level Insights
Resilience policy should be designed by business criticality, not by copy-pasted annotations. Recommendations, payments, ledger writes, inventory checks, and audit streams all deserve different protection strategies. “One config for every dependency” is a common anti-pattern.
Fallback quality matters more than fallback existence. A good fallback is domain-honest. An empty recommendation set may be acceptable; a fabricated financial result is not. Senior engineering means evaluating degraded behavior through business semantics, not just uptime metrics.
Metrics are essential. Circuit breaker open rates, retry counts, rate limiter rejections, bulkhead saturation, and timeout ratios are the difference between tuning and guessing. Resilience is not a one-time code pattern; it is an operational feedback loop.
Finally, resilience patterns cannot rescue a fundamentally poor architecture. If the service graph is too chatty, if synchronous dependencies are overused, or if no clear SLA/SLO model exists, circuit breakers will merely expose those structural issues in a more controlled way.
8. Szószedet / Glossary
- Resilience4j: Java library implementing resilience patterns.
- Circuit Breaker: mechanism that stops calls when failure rates become unacceptable.
- Retry: controlled re-attempt of an operation after failure.
- RateLimiter: component that limits request throughput.
- Bulkhead: resource isolation pattern that prevents one dependency from consuming everything.
- TimeLimiter: timeout controller for asynchronous operations.
- HALF_OPEN: circuit breaker trial state used to test recovery.
- Sliding window: the observation horizon used to evaluate recent call outcomes.
9. Gyorsreferencia / Cheatsheet
| Pattern | Purpose | Key tuning point | Common mistake |
|---|---|---|---|
| CircuitBreaker | Stop repeated calls into failure | failure threshold, open wait | Overreacting to small samples |
| Retry | Recover from transient faults | attempts, backoff | Retrying non-idempotent work |
| RateLimiter | Control throughput | permits per period | Wrong capacity assumptions |
| Bulkhead | Isolate resource usage | concurrency or pool size | No isolation around slow dependencies |
| TimeLimiter | Cap wait time | timeout duration | Misaligned timeout stack |
| Fallback | Provide degraded behavior | business-valid alternative | Returning misleading data |
🎮 Games
8 questions