The Incident Is the Outlier

At some point every SRE discovers that the incident they most need to debug is the one their tracing system can't show them. The dashboards light up. The alerts fire. But when you go looking for the specific chain of calls that caused the failure, you find two samples from the last four minutes, and neither happens to be the request that broke.

This isn't a bug. It's the intended behavior of sampling-based tracing.

Most distributed tracing systems are built on a probabilistic model. Somewhere in the request path, usually at the root, the system flips a coin: keep this trace, or throw it away. The coin is weighted to keep one in a hundred, or ten per second, or whatever the budget allows. Datadog's trace agent defaults to roughly ten traces per second per instance. At any non-trivial QPS that means almost everything hits the floor.

This works fine for one job: showing you what a normal request looks like on a normal day. It fails at another: reconstructing what happened during the two minutes when everything broke.

The math is not subtle. An incident is a rare event by definition. If your failure fires on one request in ten thousand, and you sample one in a thousand, you expect to capture somewhere between zero and one failing traces per minute. You see the aftermath. Elevated error rates, latency fans, circuit breakers opening. But not the request that broke. You piece together the story from symptoms instead of from the chain.

Most of the time this is fine. Most incidents are broad enough that the pattern shows up across many requests, and you can reconstruct what happened from dashboards. But the incidents where it doesn't work are exactly the hard ones. A specific interleaving of state changes across three services. A subtle cascade that only fires on a narrow slice of traffic. Those are the incidents that stretch from minutes into hours, and end in postmortems with twenty bullet points about tooling gaps.

Two jobs, one pipeline

I think the core problem is that the industry conflated two different jobs. Observability means maintaining a rolling understanding of system health. It's inherently statistical. It tolerates gaps in signal because what you want is a summary. Incident reconstruction is the opposite. You need one specific event. You need the chain that broke, not a representative sample of chains that didn't. A gap in signal isn't a cost you can amortize. It's the failure.

When a team adopts a sampled tracing pipeline, they're implicitly optimizing for the first job and hoping the second comes free. Most of the time it does. The fraction of time it doesn't is the reason postmortems so often have that specific cadence: investigation was slower than recovery, because the tools that should have helped with investigation weren't designed for investigation.

The obvious counter is tail-based sampling. Instead of flipping the coin at the start, wait until the trace completes and decide whether to keep it. Keep errors. Keep slow requests. Keep specific endpoints. This is genuinely better than head-based sampling, and honest teams reach for it. But it has its own problem: tail-based sampling captures the failures you predicted, not the ones that actually happen. The retention rules were written before the incident. New failure modes that don't match old criteria fall through.

All sampling, head or tail or adaptive, optimizes for cost per unit of typical observability value. The word "typical" is doing all the work. It means signal averaged across the normal request population. Incident response needs something else entirely. You don't want the average. You want the specific. Averaging is the enemy of the incident, not its tool.

Three shapes of failure

There are a few patterns I keep seeing in outages that defeat sampling.

One is the needle outage. A request pattern fires on a narrow slice of traffic: one tenant, one combination of feature flags, one upstream state. It triggers a cascade that amplifies across the stack. At 10k QPS the trigger fires twenty times a minute. With 1% sampling you expect to capture 0.2 of them. You won't have the trace. You'll have elevated latency on one downstream, and you'll spend the next hour guessing which upstream caused it. Clerk's September 2025 database incident had a version of this: connection recycling bursts happened within seconds every fifteen minutes, but monitoring sampled at sixty-second intervals, so a periodic pattern appeared as sporadic noise.

Another is the state-coupling outage. A piece of shared state, a cache key or a connection pool, transitions into a degraded mode through an ordering of events across three services. Each individual service's traces look normal. The problem lives in the composition, in the specific sequence. Sampling makes this unreconstructable, because a random subset of a sequence is not a sequence.

Then there's the silent-failure outage. A service fails in a way that looks like success. The error rate doesn't change. The latency doesn't change. A downstream starts doing the wrong thing with wrong data. By the time the customer-visible failure surfaces, the trigger is hours old and has fallen out of the retention window.

Not every outage fits these shapes. Cloudflare's July 2019 regex incident was pure CPU exhaustion: a WAF rule with a catastrophically backtracking regular expression pinned every edge server to 100%. The signal lived in host-level CPU metrics, plain as day, and no amount of better distributed tracing would have changed the outcome. The point isn't that sampling always fails. It's that for a specific class of incidents, the long-tail cross-service failures that fill the hardest postmortems, sampling is structurally the wrong primitive.

The cost problem is real, but mislocated

The usual response is that full-fidelity tracing costs too much. At the shipping layer, this is true. Capturing every span at 100k QPS and sending it to a backend in real time would saturate your egress and multiply your observability bill by a number no finance team will approve. Sampling exists for a reason.

But there's a distinction that gets lost. There's a difference between capturing events in-process for a short window, which costs a few megabytes of memory per service, and shipping all events to a remote backend, which costs real money in network and storage. Those two costs can be separated.

A ring buffer that holds the last N seconds of spans in local memory and flushes only when something goes wrong pays for capture continuously and for shipping rarely. When the alert fires, the last N seconds of every involved service are already sitting in memory, waiting to be assembled into the actual trace. You didn't sample anything. You just delayed the shipping decision until you knew what mattered.

This is a different architecture, not a parameter tweak. The window is bounded, so triggers that predate it are still lost. The flush mechanism is new surface area. The SDK has to live in-process, which isn't how sidecar agents work. These are real constraints.

But the thing you get is concrete: at the moment the alert fires, the actual request chain that caused the failure exists. Captured. Readable. The incident is no longer the outlier your system was designed to discard. It's the event the whole pipeline was built to catch.

So what

I don't think sampling-based tracing is a mistake. It's a good solution to the wrong problem. The teams that love their tracing setup tend to have the easy kind of incidents, where the blast radius is wide enough that the signal survives sampling. The teams that keep getting burned are the ones who reach for their traces at 3am and find the cupboard bare at the moment that matters.

If that's you, the fix is not to tune your sampling rate. It's to notice that what you actually need is not observability. It's incident reconstruction. And those are different problems that want different architectures.