The case for AI that explains instead of acts

There's a wave of agentic AI tooling being built for incident response. Engineers are right to be skeptical. Here's why Wachd is built to diagnose, not remediate.

There's a pattern emerging in the AI SRE space: tools that don't just surface what's wrong, but take action. Auto-remediate. Auto-rollback. Auto-scale. The pitch is that you remove humans from the loop and incidents resolve faster.

Engineers are right to be skeptical.

Production systems are not deterministic. An agent that rolls back a deploy because a metric crossed a threshold will sometimes be correct. It will also sometimes roll back a good deploy, mask an underlying infrastructure problem, or trigger a cascade by acting faster than the team can reason about what's happening. The blast radius of a wrong autonomous action in production is not the same as a wrong recommendation.

What "bounded" means in practice

Wachd doesn't touch your infrastructure. It has read-only access to your repos, your logs, and your metrics. It cannot deploy, rollback, restart, or scale anything. When an alert fires, it collects context, strips PII, and gives your on-call engineer a plain-English diagnosis before the page even hits their phone.

That's the boundary. The engineer decides what to do. Wachd decides what probably happened.

This isn't a limitation — it's a deliberate design choice. The teams that will actually adopt AI in their incident workflow and keep it are the ones where the AI earns trust by being right and transparent, not by acting fast and opaque. Trust is built incrementally. You don't get there by letting an agent take autonomous actions in production before it's proven itself over hundreds of incidents.

What "explainable" means in practice

The analysis Wachd produces is not a confidence score or a category label. It's a short paragraph: what metric crossed threshold, what changed in the service in the 30 minutes before, what the error logs show, and what the probable cause is.

The engineer can read it and disagree. They can see where it came from — the specific commits, the specific log lines, the specific metric window. If the AI got it wrong, the engineer can see exactly why it got it wrong and correct course. There's no black box.

This matters at 3am more than any other time. An opaque "anomaly detected" is not useful. A paragraph that says "memory usage spiked 4 minutes after the last deploy, error logs show OOM kills on the worker pods, probable cause is the new batch job introduced in commit abc1234" is something you can act on.

Why agentic tooling is optimising for the wrong thing

Most of the agentic AI tooling being built right now is optimising for demo quality, not production trust. Autonomous remediation looks impressive in a demo — you watch a metric spike, an agent kicks in, the metric recovers, nobody had to do anything. It's a compelling story.

It looks different when the agent fires in a production incident your team didn't expect. When the rollback it triggered was the wrong rollback. When it masked a deeper problem that keeps coming back. When the engineer on call can't explain what the system did or why, because the reasoning was never surfaced.

The agentic tools that are shipping today are being tested in controlled environments against known failure modes. Production is different. The failure modes you haven't seen before are exactly the ones where you most need a human in the loop — and where an agent acting on incomplete context is most likely to make things worse.

The diagnostic layer

Wachd is built to be the diagnostic layer, not the autonomous operator. The job is to close the gap between "alert fired" and "engineer understands what happened" — not to close the gap between "alert fired" and "system fixed itself."

That first gap is real and it's costly. Every team I've talked to spends 20 to 60 minutes on the investigation step before they can take any useful action. That time isn't wasted because of bad tooling — it's wasted because the correlation across repos, logs, and metrics has to be done manually every single time.

Automating that correlation and handing the result to the engineer immediately after the page goes out — that's where the value is. The engineer still makes the call. They just make it with the context already assembled instead of spending 40 minutes assembling it themselves.

Deploy free on Kubernetes →Watch demo ↗

Why incident investigation stays manual even with good observability →Running AI root cause analysis locally with Ollama →Why alert fatigue survives even after you fix your routing →