Monitoring
actuator, health checks, Micrometer, logging, distributed tracing
Monitoring
Effective monitoring turns a running Spring application from a black box into an observable system you can operate under pressure.
1. Definíció / Definition
Mi ez? / What is it?
Monitoring in Spring is the operational visibility layer around a running application: health state, metrics, logs, and traces. In practice it is assembled from Spring Boot Actuator, Micrometer, structured logging with SLF4J and Logback, and a tracing backend such as Zipkin or Jaeger.
Miért létezik? / Why does it exist?
Because production failures are rarely obvious from code alone. Teams need fast answers to questions like: is the JVM alive, is the service ready to receive traffic, which dependency is slow, what changed after deployment, and how a single request moved across multiple services.
Hol helyezkedik el? / Where does it fit?
It sits between application code and operations. Developers instrument business-critical paths, expose health information, and preserve request context in logs. Platform and SRE layers then scrape, aggregate, visualize, and alert on that information.
2. Alapfogalmak / Core Concepts
2.1 Spring Boot Actuator endpoints
Actuator provides standard operational endpoints without forcing each team to invent its own ad hoc admin API.
| Endpoint | Purpose | Typical use |
|---|---|---|
/actuator/health |
overall health state | load balancer, readiness, liveness |
/actuator/metrics |
metric discovery | debugging, exploring meter names |
/actuator/prometheus |
scrape endpoint | Prometheus integration |
/actuator/info |
build and app metadata | version, commit id, build timestamp |
/actuator/env |
environment inspection | config debugging, restricted access |
/actuator/loggers |
runtime log level control | incident response |
Actuator is powerful precisely because it reduces friction, which is also why it must be exposed carefully.
2.2 Health modeling: health, liveness, readiness
A single green health flag is not enough in modern deployments.
Useful health dimensions:
- Liveness — answers whether the process is still alive.
- Readiness — answers whether the service can serve traffic right now.
- Health details — answers whether dependencies and custom indicators are healthy.
A service can be alive but not ready. For example, the JVM is running but the application cannot connect to its database, complete startup warming, or acquire required configuration.
2.3 Micrometer metrics model
Micrometer is an abstraction layer over concrete monitoring systems. That matters because instrumentation code should survive backend changes.
Important meter types:
- Counter for monotonically increasing totals.
- Gauge for instantaneous values such as queue depth.
- Timer for latency and invocation count.
- DistributionSummary for values like payload size.
A production metric is not just a number. It needs good naming, useful tags, and bounded cardinality.
2.4 Logging and MDC
Logs remain the most detailed signal during failure analysis. With plain text logging you often get events; with contextual logging you get evidence.
MDC attaches request-scoped values to log lines:
traceIdspanIdrequestIdtenantId- safely chosen domain identifiers
That allows targeted filtering instead of broad grep-driven archaeology during incidents.
2.5 Distributed tracing in modern Spring
Spring Cloud Sleuth used to be the common choice; in newer Spring Boot generations the direction is Micrometer Tracing. The conceptual model remains the same:
Typical trace path:
- The client request enters through the gateway.
- The same trace ID follows the call across
Order Service,Payment Service, andEmail Service. - Each hop creates its own span ID so individual steps stay measurable.
- Trace ID identifies the end-to-end request path.
- Span ID identifies one step within that path.
Tracing is especially valuable when latency or failure is distributed across services rather than isolated to one JVM.
2.6 Correlating the signals
Metrics, logs, and traces should not compete; they should triangulate.
- Metrics tell you something is degrading.
- Traces tell you where time is spent.
- Logs tell you what happened in technical and business context.
3. Gyakorlati használat / Practical Usage
A common production pattern is exposing separate readiness and liveness probes in Kubernetes. During startup, the process should stay alive while reporting “not ready” until migrations, caches, or dependency handshakes are complete. This prevents premature traffic routing and flaky rollout behavior.
Another frequent use case is latency analysis on a hot path such as checkout. A Timer can capture p50, p95, and p99 behavior for /checkout, while tracing reveals whether tail latency comes from the payment provider, an overloaded connection pool, or a slow database query. Without that split, teams tend to blame the wrong layer.
Monitoring also improves incident handling. Suppose operations receives a spike in error rate after a release. Actuator health details show Redis is healthy, Prometheus shows DB latency is stable, but tracing shows a new downstream HTTP call dominates request time. At that point debugging becomes evidence-driven instead of opinion-driven.
Runtime logger management is another practical tool. During a production issue you may need DEBUG logs for com.example.payment.client but not for the whole application. Actuator can adjust that at runtime, avoiding a redeploy and limiting unnecessary log volume.
4. Kód példák / Code Examples
4.1 Actuator and Prometheus exposure
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus,loggers
endpoint:
health:
probes:
enabled: true
show-details: when_authorized
metrics:
tags:
application: billing-service
4.2 Custom HealthIndicator
@Component
public class PaymentGatewayHealthIndicator implements HealthIndicator {
private final PaymentGatewayClient client;
public PaymentGatewayHealthIndicator(PaymentGatewayClient client) {
this.client = client;
}
@Override
public Health health() {
try {
long latencyMs = client.ping();
return Health.up()
.withDetail("dependency", "payment-gateway")
.withDetail("latencyMs", latencyMs)
.build();
} catch (Exception ex) {
return Health.down(ex)
.withDetail("dependency", "payment-gateway")
.build();
}
}
}
4.3 Micrometer instrumentation
@Service
public class InvoiceService {
private final Timer generationTimer;
private final Counter generatedCounter;
public InvoiceService(MeterRegistry registry) {
this.generationTimer = Timer.builder("invoice.generation.duration")
.description("Time spent generating invoices")
.publishPercentileHistogram()
.register(registry);
this.generatedCounter = Counter.builder("invoice.generated.total")
.description("Generated invoices")
.register(registry);
}
public Invoice generate(InvoiceCommand command) {
return generationTimer.record(() -> {
Invoice invoice = new Invoice(command.id(), "READY");
generatedCounter.increment();
return invoice;
});
}
}
record InvoiceCommand(String id) {}
record Invoice(String id, String status) {}
4.4 MDC request filter
@Component
public class CorrelationLoggingFilter extends OncePerRequestFilter {
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response,
FilterChain filterChain) throws ServletException, IOException {
try {
MDC.put("requestId", Optional.ofNullable(request.getHeader("X-Request-Id"))
.orElse(UUID.randomUUID().toString()));
MDC.put("path", request.getRequestURI());
filterChain.doFilter(request, response);
} finally {
MDC.clear();
}
}
}
5. Trade-offok / Trade-offs
Benefits
- faster incident detection and diagnosis;
- better SLO-driven operations;
- clearer capacity and latency trends;
- safer production debugging through controlled runtime visibility.
Costs
- metrics with unbounded tags create high-cardinality pain;
- logging volume can become expensive and noisy;
- tracing introduces sampling and storage trade-offs;
- exposing operational endpoints can become a security issue if done carelessly.
Use aggressively when services are customer-facing, distributed, high-throughput, or tightly bound to SLAs.
Avoid overengineering when the system is small and low-risk. Observability should be intentional, not ornamental.
6. Gyakori hibák / Common Mistakes
6.1 Exposing too many Actuator endpoints
Developers often expose env, beans, and other diagnostic endpoints broadly because they are convenient. In production, convenience without access control becomes risk.
6.2 High-cardinality metric tags
Tags such as userId, email address, raw path parameters, or request IDs make metric aggregation explode. Keep metric labels bounded and analytical.
6.3 Expensive health checks
A health endpoint should not execute half the application. If a health probe is slow, stateful, or failure-prone, it becomes operationally dangerous itself.
6.4 Logging sensitive data
Credentials, tokens, card data, and unmasked personal information do not belong in logs. Cleanup after the incident is not a credible control.
6.5 Missing context propagation
Teams sometimes add tracing to one service and assume observability is solved. Without header propagation across HTTP, messaging, and async boundaries, the trace graph becomes fragmented and misleading.
7. Senior szintű meglátások / Senior-level Insights
Senior teams treat monitoring as a design concern, not a post-release checkbox. Instrumentation should align with failure modes and business criticality. For a payment flow, success rate, latency, and downstream dependency health are first-class metrics. For a background recommendation refresh job, the set of operational signals will differ.
Good health modeling is subtle. Not every dependency deserves equal weight in readiness. If an optional recommendation engine is down, your checkout path may still be ready. If the primary ledger store is unavailable, readiness should fail immediately. Senior engineering means encoding those business truths into operational semantics.
Runtime logging controls are useful but dangerous. Turning on DEBUG globally during an incident can turn a latency problem into an I/O problem. Limit scope, protect access, and make changes auditable.
Finally, standardization matters. A shared naming convention for meters, structured log fields, and trace correlation keys is what makes organization-wide observability usable at scale. The platform wins when teams do not invent incompatible telemetry dialects.
8. Szószedet / Glossary
- Actuator: Spring Boot module for operational endpoints.
- HealthIndicator: component that reports health for a subsystem or dependency.
- Micrometer: metrics facade for multiple monitoring backends.
- Counter: monotonically increasing event count.
- Gauge: instantaneous measurement.
- Timer: latency and count measurement.
- MDC: mapped diagnostic context for log correlation.
- Trace ID: identifier for an end-to-end request path.
- Span ID: identifier for a single operation inside a trace.
- Prometheus: pull-based metrics collection system.
- Grafana: dashboard and visualization layer.
9. Gyorsreferencia / Cheatsheet
| Concern | Good default | Watch out for |
|---|---|---|
| Actuator exposure | expose only required endpoints | security and network boundaries |
| Health checks | fast, cheap, deterministic | avoid heavy business logic |
| Counter | totals for events | never use for values that go down |
| Gauge | queue depth, pool size | tie it to stable objects |
| Timer | request latency | enable histograms when needed |
| Logging | structured and contextual | protect sensitive data |
| MDC | request and trace context | propagate across async boundaries |
| Tracing | traceId plus spanId | sampling and cost tuning |
| Runtime loggers | targeted incident debugging | authentication and auditability |
🎮 Games
8 questions