Why incident investigation stays manual even with good observability
You have Grafana, Prometheus, Loki, maybe Datadog. The dashboards are great. Your on-call engineer still spends 45 minutes figuring out what happened. Here's why the tooling doesn't close that gap.
Most engineering teams I've talked to have genuinely good observability now. Prometheus for metrics, Grafana for dashboards, Loki or Datadog for logs. They've instrumented their services, they have meaningful alerts, they can tell when something is broken within a minute or two of it happening.
And yet when an alert fires at 3am, the on-call engineer still spends 30 to 60 minutes figuring out what actually happened before they can do anything useful. The tooling got better, the investigation didn't.
There's a specific gap that observability tools don't close, and I don't think it gets talked about enough.
What observability tools are actually good at
Modern observability tooling is good at answering one question: what is broken right now?
You get an alert. You open Grafana. You can see error rate spiking on the payment service. You can see latency went up at 2:47am. You can see the database connection pool is at 98%. You can see which pods are restarting. The tooling is very good at showing you the current state of your system and what metrics are outside their normal range.
That's genuinely valuable. Without it you'd be flying blind. But it's not the same as knowing why.
The gap: from anomaly to explanation
Knowing that error rate spiked at 2:47am doesn't tell you why it spiked. To get from the anomaly to an explanation, you need to answer a different set of questions:
- →Did anything deploy to this service recently? When exactly?
- →Did the errors start before or after that deploy?
- →What do the error logs actually say — is it a null pointer, a timeout, a third-party API failing?
- →Is this service the origin of the problem or is it downstream of something else that broke first?
- →Has this same error pattern happened before? What fixed it last time?
None of those questions are answered by Grafana. The dashboard shows you what, not why. To answer these you have to leave Grafana, open GitHub, look at the deploy history, go back to Loki and filter logs to the right time window, cross-reference the timestamps, check if another service was also misbehaving around the same time.
That's the manual investigation. Every incident, every time, regardless of how good your dashboards are.
Why the tools don't help here
Observability tools are designed to show you data. They're good at visualization, querying, and alerting on thresholds. They're not designed to correlate data across sources and give you a causal narrative.
Datadog has some features that gesture toward this — the deployment markers on graphs help, the log correlation helps a bit. But none of it closes the loop automatically. You still have to be the one looking at the deploy marker, looking at when the error rate changed, deciding whether those two things are related.
The reason observability tools don't do this is partly that it's a hard problem, and partly that it's genuinely outside their scope. They're data platforms. The causal reasoning on top of the data is left to the engineer.
Which is fine when you have an experienced engineer who knows the system well, it's 2pm, and they have 40 minutes to investigate. It's a much bigger problem when it's a 3am page, the engineer on call is the only person awake, and every minute of confusion is costing the business.
The institutional knowledge problem
There's another layer to this. The gap between anomaly and explanation isn't just about data — it's also about context that isn't in any dashboard.
The experienced engineer who's been on the team for two years knows that when the payment service throws NullPointerException in that specific class, it's almost always because the feature flag service returned an unexpected value. They know this because they've seen it three times. That knowledge lives in their head, maybe in a Slack thread from eight months ago, probably not in any runbook.
When a new engineer gets paged with the same alert at 3am, they don't have that context. The dashboards don't have it either. They start from zero.
This is why on-call is so much harder for junior engineers and for engineers who are new to a service — not because the tooling is different, but because the investigation requires context the tooling doesn't surface.
What would actually close the gap
The investigation is always the same sequence. Build a timeline: what changed, what broke first, what's the probable chain of causation. The data to build that timeline exists — it's in GitHub, in Loki, in Prometheus. It just hasn't been assembled automatically.
The thing that would actually help is: when the alert fires, automatically pull the data that's relevant to that specific service and that specific time window, correlate it, and present a narrative to the engineer before they start their manual investigation. Not replace the investigation — give them a starting point with evidence already assembled.
The engineer still makes the judgment call. They can see the evidence and disagree with the AI's interpretation. But they start with something instead of nothing.
Most of the time the explanation is simple. Bad deploy. Config change that had an unexpected interaction. Third-party dependency having a bad hour. The investigation takes 40 minutes because you had to manually collect and correlate the data — not because the root cause was actually hard to understand once you had it.
How Wachd approaches this
When an alert fires, Wachd automatically collects the context bundle: last 10 commits to the affected service from GitHub, 30 minutes of error logs from Loki or Datadog, metric history from Prometheus around the alert window. PII stripped before any of it touches the AI backend. Then it runs the correlation and sends the on-call engineer a plain-English probable cause alongside the page.
It doesn't replace Grafana or Prometheus — it reads from them. Your existing observability stack stays in place. Wachd is the layer that does the cross-source correlation automatically so the engineer doesn't have to.
Self-hosted, runs in your Kubernetes cluster, Apache 2.0. Air-gapped mode with Ollama for teams that can't send incident data outside their network.
Better observability is necessary but not enough
The answer to this isn't to buy more observability tooling. More dashboards don't close the last mile between anomaly and explanation — they just give you more places to look manually. The gap isn't data availability, it's the correlation step that has to happen after the data is available.
That step is still manual for almost every team. And it will stay manual until something automates the collection and correlation, not just the visualization.