Skip to content
← back to blog

The First 20 Minutes of Every Incident Are Wasted

Five engineers. Five dashboards. Zero agreement. The problem isn't observability — it's that nobody is looking at the same thing.

Ahmed AdlyMarch 30, 2026

An alert fires at 2:47am. Four engineers join a Slack huddle by 2:52.

One is scrolling Grafana. One is tailing CloudWatch logs. One has Datadog APM open. The fourth is running kubectl logs in a terminal. They're all looking at the same incident. None of them are looking at the same thing.

For the next 18 minutes, they narrate what they see. "I'm seeing 503s on checkout." "Payments latency spiked at 2:44." "I don't see anything on my end." "Wait, which service are you looking at?"

At 3:10, someone says: "OK, so we think it started in payments?"

The fix takes 7 minutes. The agreement took 23.

This isn't a bad-team problem

This is every team. The pattern is so consistent it barely registers as a problem anymore.

An engineer gets paged. They open their preferred observability tool. They start forming a theory. Other engineers join and do the same — each from their own tool, their own slice of the system. For the next 10–20 minutes, the room isn't debugging. It's aligning. People are comparing notes, narrating screenshots, asking "which dashboard are you on?"

Google noticed this internally. The Google SRE Workbook recommends that "when three or more people work on an incident, it's useful to start a collaborative document that lists working theories, eliminated causes, and useful debugging information, such as error logs and suspect graphs." The fact that Google — with probably the best internal tooling on earth — still needs a shared doc to align engineers says something about the state of the art.

The Grafana Labs 2025 Observability Survey found that SREs use an average of 18 data sources. Developers average 10. Alert fatigue was cited as the number one obstacle to faster incident response. And engineering managers flagged "painful incident coordination across teams" as the biggest bottleneck — ahead of alert fatigue.

Nobody lacks data. Everybody lacks agreement.

The convergence gap

There's a gap between "the alert fires" and "the team agrees on what happened." I call it the convergence gap. It's the most expensive part of every incident, and no tool is designed to close it.

Observability tools are built for investigation. They're excellent at it. Grafana, Datadog, Honeycomb — they help a single engineer dig into a single signal. But they're vertical. Logs in one pane. Metrics in another. Traces in a third. Each engineer picks their starting point and reconstructs the incident from their own angle.

The convergence gap is what happens when five reconstructions have to become one. It's not a tooling failure. It's a structural gap. The tools give you data. Nothing gives you agreement.

Runframe's State of Incident Management 2025 report found that operational toil rose to 30% — the first increase in five years — despite significant AI investment. 73% of organizations experienced outages linked to ignored or suppressed alerts. By one estimate, the toil costs roughly $37,500 per engineer per year — 30% of an average salary spent on manual incident work instead of building. The tools are getting better. The coordination problem isn't.

What convergence actually looks like

Here's what changes when the incident starts from a shared artifact instead of five separate dashboards:

The alert fires. A causal trace is already assembled — built from the SDK-captured call relationships between services in the seconds before the alert. It shows the chain: which service called which, in what order, how long each call took, and where the first non-200 status appeared.

One engineer opens the trace link. Another engineer opens the same link. They're looking at the same frames, the same gaps, the same first confirmed break. No narration required. No "which dashboard are you on?" No 15 minutes of comparing notes.

If something is unclear, an engineer pins an annotation to a specific frame in the trace. "This 503 from payments — was this the card processor timeout?" The annotation is attached to the causal event, not buried in a Slack thread that nobody will find tomorrow.

The trace also shows what it doesn't know. Gaps in coverage are labeled, not hidden. If a service in the chain wasn't instrumented, the trace says so — a visible gap, not a silent blind spot. You know exactly where the evidence stops and where you'll need to dig manually.

No one is guessing about completeness. No one is narrating their screen. The room starts from one picture.

What this isn't

I want to be precise about what I'm describing, because the last thing you need at 3am is a tool that overpromises.

This is not real-time collaborative editing. There's no live cursor showing where your teammate is looking. It's not Google Docs for traces.

It's a shared artifact. Everyone who opens the link sees the same causal chain, can annotate specific frames, and knows exactly what's observed versus what's missing. The value isn't real-time synchronization — it's that the artifact exists before anyone opens it, and it shows the same thing to everyone who does.

That distinction matters. A dashboard requires interpretation. A causal trace is the interpretation — the chain of calls that actually happened, assembled deterministically from what the SDKs captured. You don't have to build a mental model from four different graphs. The model is already there.

This is not APM

APM tells you your p95 latency is 340ms. It shows you a flame graph for a single request. It gives you metrics about your services.

None of that helps when the room can't agree on which service broke first.

The gap isn't "we need more data." It's "we need the same data, structured as a causal chain, visible to everyone, ready before we even join the call." That's a different product category.

APM is for investigation. This is for the step before investigation — the step where the team converges on a shared understanding of what happened so they can investigate together instead of in parallel.

I built Incidentary because I got tired of the chaos that follows an incident alert. The trace is assembled before the alert fires — from a ring buffer of causal events captured by the SDK. When the page lands, the artifact is already there. One link. Same picture. Start debugging.

Try it

npm install @incidentary/sdk-node

One middleware on one service. When your next incident happens, you'll have a trace link to share instead of a Slack thread to scroll.

Free tier: 200K causal events/month.

Quickstart guide →