Observability is Not Monitoring: The Three Pillars in Production

Monitoring and observability are not synonyms. Monitoring is the practice of watching known failure modes. Observability is the property of a system that allows you to understand any internal state from external outputs — including failure modes you have not anticipated yet.

Most teams build monitoring. Elite teams build observability.

Why the Distinction Matters

In a monolithic system, a stack trace tells you exactly where the failure occurred. In a distributed system, a request touches 12 services before returning a 500. The stack trace only shows you the last hop. The failure may have originated three services upstream.

Observability is the infrastructure that lets you reconstruct the full causal chain from first principles, without modifying any code after the incident occurs.

The Three Pillars: How They Interlock

Metrics are aggregations over time. They answer: “Is the system behaving normally?” They are cheap to store (a single value per interval), fast to query, and excellent for alerting. But they lose detail in aggregation — a 95th percentile latency of 200ms tells you that something is slow, not which request or why.

Logs are the narrative of individual events. They answer: “What happened during this specific operation?” Structured logs (JSON, not plaintext) are queryable: filter status=500 and service=checkout and user_id=42. The cost is high: logs at high throughput are expensive to ship, store, and query.

Traces are the distributed call graph of a single request. They answer: “Why did this specific request take 3.2 seconds?” A trace contains spans — one per service, database query, or external call — connected by propagated context headers. A waterfall of spans reveals exactly where time was spent.

The interlock is the key insight: traces link to logs, logs contain trace IDs, and metrics alert you to look. An alert fires on P99 latency. You find the slow traces. Within each span you drill into structured logs. This is the debugging workflow that distributed systems require.

OpenTelemetry: The Standard Worth Adopting

Before OpenTelemetry, every vendor had its own SDK. Switching from Datadog to Honeycomb meant re-instrumenting your entire codebase. OTel solves this with a vendor-neutral instrumentation API and SDK, plus the OTel Collector — a proxy that receives telemetry and routes it to any backend.

Instrument once. Route everywhere.

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/trace"
)

func ProcessOrder(ctx context.Context, orderID string) error {
    tracer := otel.Tracer("checkout-service")
    ctx, span := tracer.Start(ctx, "ProcessOrder")
    defer span.End()

    span.SetAttributes(
        attribute.String("order.id", orderID),
        attribute.String("service.version", version.String()),
    )

    if err := validateInventory(ctx, orderID); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, err.Error())
        return err
    }

    return chargeCustomer(ctx, orderID)
}

The propagated context automatically carries the trace ID into validateInventory and chargeCustomer, creating child spans without any additional instrumentation in those functions.

Structured Logging: The Minimum Viable Investment

If you are starting from zero, structured logging is the highest-leverage first investment. Replace all fmt.Println("user logged in") calls with:

logger.Info("user_login",
    slog.String("user_id", userID),
    slog.String("ip", remoteAddr),
    slog.String("trace_id", span.SpanContext().TraceID().String()),
    slog.Duration("auth_duration", time.Since(start)),
)

Every log line should include:

A stable event name (not a human sentence — a machine-readable key)
Contextual identifiers (user ID, request ID, trace ID)
Quantitative values (durations, counts, sizes) — not narrative strings

With this, your logs become a queryable database. Correlating all events for a single user across 30 minutes of logs becomes a filter user_id="abc123", not a grep through gigabytes of text.

SLOs: Observability with Stakes

Metrics only matter when they are tied to commitments. Service Level Objectives (SLOs) define what “good” means for your service:

99.9% of requests complete in under 200ms
Error rate below 0.1% over a rolling 30-day window

The error budget is the gap between perfect and your SLO: for 99.9% availability, you have 43.8 minutes of downtime per month to spend. When you burn through your error budget, you stop shipping features and focus on reliability. When the budget is healthy, you ship.

This is the discipline that ties observability to business outcomes. The metrics matter because they measure against a promise.

Conclusion

Observability is infrastructure. It requires upfront investment in instrumentation, a collector pipeline, and a storage backend. Teams that skip this investment pay for it during every production incident — groping in the dark with logs that contain no context, no trace IDs, no structured fields. Deploy observability first, before you need it. By the time you need it, it is already too late to build it.