Monitoring | Developer Knowledge Base

Effective monitoring turns a running Spring application from a black box into an observable system you can operate under pressure.

1. Definíció / Definition

Mi ez? / What is it?

Monitoring in Spring is the operational visibility layer around a running application: health state, metrics, logs, and traces. In practice it is assembled from Spring Boot Actuator, Micrometer, structured logging with SLF4J and Logback, and a tracing backend such as Zipkin or Jaeger.

Miért létezik? / Why does it exist?

Because production failures are rarely obvious from code alone. Teams need fast answers to questions like: is the JVM alive, is the service ready to receive traffic, which dependency is slow, what changed after deployment, and how a single request moved across multiple services.

Hol helyezkedik el? / Where does it fit?

It sits between application code and operations. Developers instrument business-critical paths, expose health information, and preserve request context in logs. Platform and SRE layers then scrape, aggregate, visualize, and alert on that information.

2. Alapfogalmak / Core Concepts

2.1 Spring Boot Actuator endpoints

Actuator provides standard operational endpoints without forcing each team to invent its own ad hoc admin API.

Endpoint	Purpose	Typical use
`/actuator/health`	overall health state	load balancer, readiness, liveness
`/actuator/metrics`	metric discovery	debugging, exploring meter names
`/actuator/prometheus`	scrape endpoint	Prometheus integration
`/actuator/info`	build and app metadata	version, commit id, build timestamp
`/actuator/env`	environment inspection	config debugging, restricted access
`/actuator/loggers`	runtime log level control	incident response

Actuator is powerful precisely because it reduces friction, which is also why it must be exposed carefully.

2.2 Health modeling: health, liveness, readiness

A single green health flag is not enough in modern deployments.

Useful health dimensions:

Liveness — answers whether the process is still alive.
Readiness — answers whether the service can serve traffic right now.
Health details — answers whether dependencies and custom indicators are healthy.

A service can be alive but not ready. For example, the JVM is running but the application cannot connect to its database, complete startup warming, or acquire required configuration.

2.3 Micrometer metrics model

Micrometer is an abstraction layer over concrete monitoring systems. That matters because instrumentation code should survive backend changes.

Important meter types:

Counter for monotonically increasing totals.
Gauge for instantaneous values such as queue depth.
Timer for latency and invocation count.
DistributionSummary for values like payload size.

A production metric is not just a number. It needs good naming, useful tags, and bounded cardinality.

2.4 Logging and MDC

Logs remain the most detailed signal during failure analysis. With plain text logging you often get events; with contextual logging you get evidence.

MDC attaches request-scoped values to log lines:

traceId
spanId
requestId
tenantId
safely chosen domain identifiers

That allows targeted filtering instead of broad grep-driven archaeology during incidents.

2.5 Distributed tracing in modern Spring

Spring Cloud Sleuth used to be the common choice; in newer Spring Boot generations the direction is Micrometer Tracing. The conceptual model remains the same:

Typical trace path:

The client request enters through the gateway.
The same trace ID follows the call across Order Service, Payment Service, and Email Service.
Each hop creates its own span ID so individual steps stay measurable.

Trace ID identifies the end-to-end request path.
Span ID identifies one step within that path.

Tracing is especially valuable when latency or failure is distributed across services rather than isolated to one JVM.

2.6 Correlating the signals

Metrics, logs, and traces should not compete; they should triangulate.

Metrics tell you something is degrading.
Traces tell you where time is spent.
Logs tell you what happened in technical and business context.

3. Gyakorlati használat / Practical Usage

A common production pattern is exposing separate readiness and liveness probes in Kubernetes. During startup, the process should stay alive while reporting “not ready” until migrations, caches, or dependency handshakes are complete. This prevents premature traffic routing and flaky rollout behavior.

Another frequent use case is latency analysis on a hot path such as checkout. A Timer can capture p50, p95, and p99 behavior for /checkout, while tracing reveals whether tail latency comes from the payment provider, an overloaded connection pool, or a slow database query. Without that split, teams tend to blame the wrong layer.

Monitoring also improves incident handling. Suppose operations receives a spike in error rate after a release. Actuator health details show Redis is healthy, Prometheus shows DB latency is stable, but tracing shows a new downstream HTTP call dominates request time. At that point debugging becomes evidence-driven instead of opinion-driven.

Runtime logger management is another practical tool. During a production issue you may need DEBUG logs for com.example.payment.client but not for the whole application. Actuator can adjust that at runtime, avoiding a redeploy and limiting unnecessary log volume.

4. Kód példák / Code Examples

4.1 Actuator and Prometheus exposure

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus,loggers
  endpoint:
    health:
      probes:
        enabled: true
      show-details: when_authorized
  metrics:
    tags:
      application: billing-service

4.2 Custom HealthIndicator

@Component
public class PaymentGatewayHealthIndicator implements HealthIndicator {

    private final PaymentGatewayClient client;

    public PaymentGatewayHealthIndicator(PaymentGatewayClient client) {
        this.client = client;
    }

    @Override
    public Health health() {
        try {
            long latencyMs = client.ping();
            return Health.up()
                    .withDetail("dependency", "payment-gateway")
                    .withDetail("latencyMs", latencyMs)
                    .build();
        } catch (Exception ex) {
            return Health.down(ex)
                    .withDetail("dependency", "payment-gateway")
                    .build();
        }
    }
}

4.3 Micrometer instrumentation

@Service
public class InvoiceService {

    private final Timer generationTimer;
    private final Counter generatedCounter;

    public InvoiceService(MeterRegistry registry) {
        this.generationTimer = Timer.builder("invoice.generation.duration")
                .description("Time spent generating invoices")
                .publishPercentileHistogram()
                .register(registry);
        this.generatedCounter = Counter.builder("invoice.generated.total")
                .description("Generated invoices")
                .register(registry);
    }

    public Invoice generate(InvoiceCommand command) {
        return generationTimer.record(() -> {
            Invoice invoice = new Invoice(command.id(), "READY");
            generatedCounter.increment();
            return invoice;
        });
    }
}

record InvoiceCommand(String id) {}
record Invoice(String id, String status) {}

4.4 MDC request filter

@Component
public class CorrelationLoggingFilter extends OncePerRequestFilter {

    @Override
    protected void doFilterInternal(HttpServletRequest request,
                                    HttpServletResponse response,
                                    FilterChain filterChain) throws ServletException, IOException {
        try {
            MDC.put("requestId", Optional.ofNullable(request.getHeader("X-Request-Id"))
                    .orElse(UUID.randomUUID().toString()));
            MDC.put("path", request.getRequestURI());
            filterChain.doFilter(request, response);
        } finally {
            MDC.clear();
        }
    }
}

5. Trade-offok / Trade-offs

Benefits

faster incident detection and diagnosis;
better SLO-driven operations;
clearer capacity and latency trends;
safer production debugging through controlled runtime visibility.

Costs

metrics with unbounded tags create high-cardinality pain;
logging volume can become expensive and noisy;
tracing introduces sampling and storage trade-offs;
exposing operational endpoints can become a security issue if done carelessly.

Use aggressively when services are customer-facing, distributed, high-throughput, or tightly bound to SLAs.

Avoid overengineering when the system is small and low-risk. Observability should be intentional, not ornamental.

6. Gyakori hibák / Common Mistakes

6.1 Exposing too many Actuator endpoints

Developers often expose env, beans, and other diagnostic endpoints broadly because they are convenient. In production, convenience without access control becomes risk.

6.2 High-cardinality metric tags

Tags such as userId, email address, raw path parameters, or request IDs make metric aggregation explode. Keep metric labels bounded and analytical.

6.3 Expensive health checks

A health endpoint should not execute half the application. If a health probe is slow, stateful, or failure-prone, it becomes operationally dangerous itself.

6.4 Logging sensitive data

Credentials, tokens, card data, and unmasked personal information do not belong in logs. Cleanup after the incident is not a credible control.

6.5 Missing context propagation

Teams sometimes add tracing to one service and assume observability is solved. Without header propagation across HTTP, messaging, and async boundaries, the trace graph becomes fragmented and misleading.

7. Senior szintű meglátások / Senior-level Insights

Senior teams treat monitoring as a design concern, not a post-release checkbox. Instrumentation should align with failure modes and business criticality. For a payment flow, success rate, latency, and downstream dependency health are first-class metrics. For a background recommendation refresh job, the set of operational signals will differ.

Good health modeling is subtle. Not every dependency deserves equal weight in readiness. If an optional recommendation engine is down, your checkout path may still be ready. If the primary ledger store is unavailable, readiness should fail immediately. Senior engineering means encoding those business truths into operational semantics.

Runtime logging controls are useful but dangerous. Turning on DEBUG globally during an incident can turn a latency problem into an I/O problem. Limit scope, protect access, and make changes auditable.

Finally, standardization matters. A shared naming convention for meters, structured log fields, and trace correlation keys is what makes organization-wide observability usable at scale. The platform wins when teams do not invent incompatible telemetry dialects.

8. Szószedet / Glossary

Actuator: Spring Boot module for operational endpoints.
HealthIndicator: component that reports health for a subsystem or dependency.
Micrometer: metrics facade for multiple monitoring backends.
Counter: monotonically increasing event count.
Gauge: instantaneous measurement.
Timer: latency and count measurement.
MDC: mapped diagnostic context for log correlation.
Trace ID: identifier for an end-to-end request path.
Span ID: identifier for a single operation inside a trace.
Prometheus: pull-based metrics collection system.
Grafana: dashboard and visualization layer.

9. Gyorsreferencia / Cheatsheet

Concern	Good default	Watch out for
Actuator exposure	expose only required endpoints	security and network boundaries
Health checks	fast, cheap, deterministic	avoid heavy business logic
Counter	totals for events	never use for values that go down
Gauge	queue depth, pool size	tie it to stable objects
Timer	request latency	enable histograms when needed
Logging	structured and contextual	protect sensitive data
MDC	request and trace context	propagate across async boundaries
Tracing	traceId plus spanId	sampling and cost tuning
Runtime loggers	targeted incident debugging	authentication and auditability