Blog/Why alert fatigue survives routing
May 7, 2026·6 min read

Why alert fatigue survives even after you fix your routing

You deduplicate, group, tune thresholds, route to the right person. Noise goes down. Engineers are still burned out. Here's why.

You do the work. You deduplicate alerts. You set up proper grouping in Alertmanager. You tune thresholds so you're only waking people up for symptoms, not causes. Maintenance windows, silence rules, severity tiers. The noise goes down. Team stops complaining.

And then six months later the on-call rotation is still burning people out. MTTR hasn't moved. Engineers dread being on-call. You fixed the routing and the problem is still there.

The reason is that routing and context are different problems. Most teams solve the first one, run out of energy, and never get to the second.

What routing actually fixes

Routing is about delivery. Getting the right alert to the right person at the right time without waking up half the company for something irrelevant. Good routing means:

  • The person who gets paged is actually the one who should handle it
  • Related alerts are grouped — one incident, not fifty pages
  • Low-severity stuff goes to a Slack channel, not someone's phone at 2am
  • Expected noise during maintenance windows doesn't fire at all

All of that is real and worth doing. Teams that haven't done this work are in genuine pain. Fix it.

What routing doesn't fix

Routing gets the alert to you. It does nothing about what's in the alert.

When it lands on your phone it still says HighErrorRate firing or pod restart count > 5. You still have to figure out: what service is actually affected, did something deploy recently, is this the same thing as last Tuesday or something new, where do I even start looking.

The answers to those questions are in five different tools. GitHub for recent commits. Datadog or Loki for error logs. Grafana or Prometheus for metric history. Slack for who mentioned this service last week. You open all of them, correlate what you're seeing, build the timeline yourself, then decide what to do.

Routing didn't change any of that. You just get paged less often now, and when you do get paged the investigation is exactly as long as it always was.

Why teams don't notice this right away

When you fix your routing something interesting happens. Alert volume drops. Team morale improves. Management sees the metrics and declares the alert fatigue problem solved. There's a real improvement to celebrate.

But a few months in, the on-call engineers know something is still wrong. They're just getting paged less often — when they do get paged, the experience is unchanged. Acknowledge. Open laptop. Tab through five tools. Build the timeline. Figure out what happened. Resolve.

The metric that doesn't move is MTTR. Because the bottleneck was never getting the alert to the right person. It was understanding what the alert means once it arrives.

Two separate problems

Alert fatigue is actually two things that got merged into one label:

Problem 1 — Volume

Too many alerts, too noisy, wrong person, wrong time. Engineers get paged for things they can't do anything about. This makes people tune out and miss the real ones.

→ Routing, deduplication, and threshold tuning fix this.

Problem 2 — Context

The right alert fires. The right person gets it. And they still spend 30 to 60 minutes doing manual investigation before they can act. This is what burns people out over time — not the frequency, the cognitive load of the investigation itself.

→ Routing doesn't help here at all.

Most tooling in this space solves problem 1. Problem 2 is mostly unsolved, which is why teams that have done everything right on the routing side still have burned-out engineers.

What actually helps with the context problem

The investigation is always the same sequence. What changed recently in this service? What do the error logs look like right now vs. 30 minutes ago? When did the metric start deviating — before or after the last deploy?

If you have that information when the alert fires, the investigation collapses. The fix is usually obvious the moment you see the timeline. A bad deploy. A config change that went out an hour ago. A dependency that started throwing errors at the same time.

The expensive part is building that timeline manually every single time. For every alert. Including the 3am ones when you're half asleep.

Automate the collection. When the alert fires, automatically pull recent commits from GitHub, error logs from Loki or Datadog, metric history from Prometheus. Correlate them. Put them in front of the engineer before they even open their laptop.

That's the part routing never did.

How we built this into Wachd

When an alert fires, Wachd pulls the last N commits to the affected service from GitHub, 30 minutes of error logs from Loki or Datadog, and metric history from Prometheus around the alert window. Strips all PII before anything touches the AI backend. Then runs it through Ollama (local, air-gapped), Claude, OpenAI, or Gemini — your choice — and delivers a plain-English probable cause alongside the page.

The engineer still makes the call. The AI gives them a starting point with actual evidence behind it, not a guess. They arrive with context instead of nothing.

Self-hosted, runs in your Kubernetes cluster, Apache 2.0. If you can't send incident data outside your network, Ollama runs fully in-cluster with zero external API calls.

Fix your routing. Then fix the context.

Routing work is real and worth doing. Don't skip it. But when you're done, alert fatigue will still be there — just quieter. The engineers getting paged less often will still dread being on-call, because every page is still a solo investigation.

The volume problem and the context problem both need to be solved. Most teams stop after the first one.