Blog/AI root cause analysis with Ollama

May 7, 2026·7 min read

Running AI root cause analysis locally with Ollama

Most AI incident tools send your logs and stack traces to a cloud API. Here's how to run the whole thing inside your own cluster — no outbound calls, works completely air-gapped.

The pitch for AI-assisted incident response is straightforward: when an alert fires, the AI correlates your logs, metrics, and recent commits and tells you what probably went wrong. Useful if it works. But most of the tools that do this — Incident.io, PagerDuty's AI features, various copilot-style SaaS products — send your incident data to their cloud to do the analysis.

For a lot of teams that's fine. For teams in regulated industries, teams with strict data residency requirements, or teams running in air-gapped environments — it's a non-starter. The interesting question is whether you can get the same quality of analysis running a local LLM inside your own cluster.

Short answer: yes, with Ollama and a model like Llama 3.2 or Mistral, you can get useful root cause analysis with zero external API calls. Here's how it works in practice.

Why Ollama works well for this

Ollama makes running local LLMs on a server or Kubernetes node straightforward. You pull a model once (the weights live on a PVC), and then it serves a local HTTP API that looks similar to the OpenAI API. Models like Llama 3.2, Mistral 7B, and Phi-3 run on CPU-only nodes — no GPU required, though GPU will obviously be faster.

For incident analysis specifically, the task is well-suited to local models. You're not asking the AI to write code or reason about something open-ended. You're giving it a structured bundle of evidence — commits, error logs, metric values — and asking it to find the correlation and summarize it. That's a pattern-matching and summarization task. 7B parameter models handle it well.

The context window is the thing to watch. If you feed it 30 minutes of raw Loki logs unfiltered, you'll blow through the window and quality drops. The key is pre-filtering the logs before they reach the model — filter to error-level and above, deduplicate repeated lines, keep the 50 most relevant. That's an engineering problem, not a model problem.

What gets sent to the model

The quality of the output depends almost entirely on what you put in. This is the context bundle Wachd builds before calling the AI backend — same whether that backend is Ollama locally or Claude/OpenAI in cloud mode:

Recent commits

Last 10 commits to the affected service repo. Commit hash, author, timestamp, message, and files changed. Fetched from GitHub or GitLab via read-only API token.

Error logs

30 minutes of logs from Loki, Datadog, or Splunk around the alert time. Filtered to ERROR and above, deduplicated, capped at ~200 lines to stay within context window.

Metric history

The metric that triggered the alert plus related metrics — latency, error rate, request volume — for the 30 minutes before and after the alert fired. From Prometheus or Grafana.

Alert metadata

Alert name, labels, annotations, the exact time it fired. Gives the model a fixed reference point for the timeline.

Before any of this touches the model, PII is stripped. Emails, IPs, UUIDs, API keys, internal hostnames — all replaced with placeholders like [EMAIL] and [IP]. Even with a local model that never sends data anywhere, stripping PII before analysis is the right habit.

The prompt structure that works

Vague prompts get vague outputs. The prompt structure that produces consistent useful results looks like this:

// Simplified version of Wachd's analysis prompt

You are analyzing a production incident. Your job is to identify the probable root cause based only on the evidence provided. Do not speculate beyond what the evidence supports.

Alert: {alert_name} fired at {alert_time}

Recent commits to {service_name}:

{commits}

Error logs ({log_count} lines, last 30 min):

{filtered_logs}

Metric values around alert time:

{metric_history}

Answer these questions in order:

1. What changed recently that could have caused this? (cite specific commits or config changes)

2. What does the error pattern in the logs tell you?

3. What is the probable root cause in one sentence?

4. What is the suggested immediate action?

If the evidence is insufficient to determine root cause, say so clearly.

The last line matters. Local models will sometimes confabulate confidently when they don't have enough signal. Telling the model explicitly to say when the evidence is insufficient reduces that. You still get hallucinations occasionally but they're rarer when the model is given permission to be uncertain.

Model choice and hardware

We've tested a few models on this task. Quick summary:

Model	Size	CPU latency	Quality
llama3.2	3B	~15s	Good for simple incidents
llama3.1:8b	8B	~45s	Better reasoning, recommended default
mistral:7b	7B	~40s	Good, slightly less consistent
phi3:mini	3.8B	~20s	Fast, weaker on complex timelines

CPU latency is on a 4-core node with 16GB RAM. 45 seconds sounds slow but incidents aren't millisecond-latency problems — by the time the alert fires, routes, and the engineer acknowledges it, the analysis is ready. If you have GPU nodes available, latency drops to 2–5 seconds for any of these models.

For storage, model weights for llama3.1:8b are about 5GB. You want a PVC for this so the pod doesn't re-download the model on every restart.

Deploying Ollama in Kubernetes alongside Wachd

In Wachd's Helm chart, Ollama is an optional component you enable in values.yaml:

# values.yaml

analysis:

backend: ollama

# no API keys needed — all local

ollama:

enabled: true

model: llama3.1:8b

gpuEnabled: false # set true if you have GPU nodes

storage: 20Gi # PVC for model weights

This deploys Ollama as a separate deployment in the same namespace, pulls the model on first startup, and wires Wachd's worker to call it for analysis. The worker waits for the model to be ready before processing alerts — if Ollama is down, alerts still get routed (without analysis) so you don't lose pages.

Honest limitations

Local models are not as good as GPT-4 or Claude 3.5 on complex multi-service incidents where the root cause isn't obvious from the evidence. They're good at the common case — a bad deploy, a saturated resource, a dependency that started erroring — and that's most incidents. For complex cascading failures, you might get a less confident or less precise answer.

The other limitation is that local models don't know anything about your specific codebase or architecture. They're reasoning purely from the evidence in the context window. Claude or GPT-4 don't know your codebase either, but they have broader world knowledge that sometimes helps with library-level issues. For in-house service failures, this difference rarely matters.

If you're in an environment where you can use cloud AI, you'll get better results especially for edge cases. If you can't — regulated environment, air-gapped cluster, strong preference for keeping data in-house — local models are genuinely usable and better than nothing, which is what most teams have today.

Try it

Wachd is open source, Apache 2.0, and the Ollama integration is included with no license required. You can have a working local AI analysis pipeline running in under 30 minutes on any Kubernetes cluster.

Deploy free on Kubernetes →Watch demo ↗

Why alert fatigue survives even after you fix your routing →OpsGenie is shutting down: the complete alternatives guide for 2026 →