TraceGuard AI - From failure alert to merged fix, automatically

⚠ The Problem

Observability stops at the alert

Your monitoring tool tells you something broke. Everything after that - root cause, fix, test, PR, review - is still entirely manual.

🚨

The alert is just the beginning

LangSmith, Langfuse, and Helicone tell you a run failed. Finding which file caused it, understanding why, writing a fix, and opening a PR is still entirely your problem.

🌊

Failures compound between deploys

An infinite loop or hallucination bug doesn't crash anything - it silently burns credits and returns bad answers until someone notices and the manual fix cycle begins.

⏱

The fix cycle takes hours

Reproduce the failure, find the file, write the patch, open a PR, add a regression test - for every single incident. This is toil that compounds as your agent usage grows.

🔁

Rejected fixes restart from zero

A reviewer rejects your patch with notes. Those notes live in a GitHub comment. The next iteration starts from scratch - no memory of what was tried, what failed, or why.

💡 The Solution

The missing remediation layer

Observability tools are read-only - they surface failures and stop there. TraceGuard AI is read-write - it changes your code.

When a failure alert arrives from LangSmith, Langfuse, or any monitoring tool, TraceGuard classifies the root cause, dispatches a LangGraph agent to fetch your actual code from GitHub, generates a targeted multi-file fix, and opens a PR - all in under 60 seconds.

If you reject a fix with notes, the patch bot reads your feedback and generates a revised PR automatically. You stay in the loop for final approval - everything else is handled.

Receive

Failure arrives via LangSmith webhook, Langfuse webhook, or generic ingest endpoint.

Classify

Groq LLM maps the trace to one of 10 failure types with severity and root cause.

Patch

LangGraph agent fetches your real code from GitHub and generates a multi-file fix in a single clean commit.

Validate

Groq-as-judge scores the fix quality before you approve. Scores are written back to LangSmith.

Approve or retry

One click to merge. Reject with notes and the bot generates a revised fix addressing your feedback.

⚙ How It Works

Five stages, fully automated

Each stage runs asynchronously with its own DB session. A crash in one stage doesn't cascade to the others.

Classify

A Groq LLM (llama-3.3-70b-versatile) receives a truncated excerpt of the failing trace - inputs, outputs, error, and child_run names - and returns structured JSON: failure type, severity, title, description, root cause, and trace evidence quotes.

groq llama-3.3-70b JSON output 12k TPM safe

Patch Bot

A LangGraph state machine runs three nodes: fetch_code extracts real file paths from stack traces and pulls them from your GitHub repo; generate_fix produces a minimal targeted patch; open_pr creates a branch, commits, and opens a PR with explanation.

LangGraph PyGitHub real code auto PR

Eval Writer

Groq synthesizes a LangSmith evaluator function - Python code you can add directly to your test suite. It includes a representative test input derived from the original failing trace and an expected output that the patched agent should produce.

groq LangSmith eval Python codegen

Shadow Runner

Groq-as-judge scores the original failing output and the patched expected output on a 0.0–1.0 quality scale. If the patched score is ≥10% better, the patch is auto-promoted. Both scores are written back to LangSmith as traceguard/quality_before and traceguard/quality_after_patch feedback.

A/B scoring LangSmith feedback auto-promote

Dashboard Review

The React dashboard shows every failure card with its classification, root cause, linked PR URL, diff, and shadow scores. Approve squash-merges the PR on GitHub and marks the failure resolved. Reject closes the PR with a comment. All state updates are pushed live over WebSocket.

React 19 WebSocket human-in-loop squash merge

Crash Recovery

On startup, TraceGuard scans for failures stuck in classified state (e.g. from a previous container crash) and re-queues them through the patch pipeline with a staggered 10-second delay between each to stay within Groq rate limits.

auto-resume rate-limit safe startup recovery

🗂 Failure Taxonomy

10 classified failure types

Every trace is mapped to one of these types. The taxonomy drives which patch strategy is applied and which evaluator is generated.

infinite_loop

Agent or tool calling itself in a cycle with no exit condition

hallucination

Agent fabricated facts, citations, or data not present in context

tool_misuse

Wrong arguments, schema mismatch, or wrong tool chosen for task

context_overflow

Input exceeded the model's maximum context window limit

empty_response

Agent returned empty or null output with no fallback

format_error

Response did not match the required output schema or format

reasoning_failure

Chain-of-thought broke down, leading to a wrong conclusion

latency_regression

Response time degraded significantly from the established baseline

tool_timeout

External tool or API call timed out with no fallback strategy

unknown

Failure could not be mapped to any known taxonomy type

✨ Capabilities

Built for production from day one

🔌

Three intake connectors

Native support for LangSmith and Langfuse webhooks. A generic /ingest endpoint accepts normalized failures from Helicone, Arize, custom scripts, or any tool that can fire an HTTP POST.

🔁

Rejection learning

Reject a PR with reviewer notes and the patch bot reads your feedback, incorporating it into a revised fix - automatically. No restarting from scratch; the context carries forward.

⚡

Groq inference - free tier friendly

Uses llama-3.3-70b-versatile on Groq's free tier. All traces are truncated to stay within the 12k TPM limit. Swap the model via GROQ_MODEL env var.

🔄

WebSocket live feed

Every pipeline event - failure classified, patch generated, eval written, shadow scored - is pushed to the dashboard instantly over a persistent WebSocket connection.

🔐

Optional API key auth

Set API_KEY to protect write endpoints. /api/webhook/langsmith is intentionally open - LangSmith can't send custom headers. Frontend picks up the key from VITE_API_KEY.

🗄

SQLite → PostgreSQL

SQLite for local dev, PostgreSQL for production. Railway's plugin injects DATABASE_URL automatically. Alembic manages schema migrations on every deploy.

📊

LangSmith feedback loop

Shadow scores are written back to LangSmith as structured feedback (traceguard/quality_before, traceguard/quality_after_patch), closing the observability loop.

🤖

LangGraph orchestration

Patch Bot is a proper LangGraph StateGraph - each node gets the full typed state, errors are logged individually, and the graph is compiled once and reused.

📂

Multi-file patches

The patch bot fixes multiple files in a single clean Git commit - system prompt, tool definition, and agent runner in one PR. No piecemeal commits, no rebasing.

🐳

Docker Hub image

Pre-built image published on every version tag. No Python setup required - just docker run with your env vars and you're live in under 60 seconds.

🔁

CI / CD pipeline

GitHub Actions runs backend import checks and frontend TypeScript + build on every push. Docker Hub publish triggers on v* tags.

🖥 Dashboard

Three views, one workflow

Dark-theme React 19 UI with TanStack Query and live WebSocket updates.

traceguard-ai.vercel.app/dashboard

TraceGuard AI

Dashboard Patch Review Eval Vault

● failure_classified · high

Total Failures

Critical

High

Patches

+ infinite loop

+ hallucination

+ tool misuse

+ context overflow

+ empty response

CRITICAL Infinite Loop Detected in Agent ● patched

The agent entered an infinite loop by repeatedly calling search_tool without an exit condition, exhausting the iteration limit of 10. Root cause: missing max_iterations guard in the ReAct agent configuration.

"Agent stopped due to iteration limit of 10."

PR open github.com/…/pull/42 ↑ score +0.31

HIGH Hallucinated Citation with Fake DOI ● patched

Agent fabricated a research citation including a non-existent DOI and journal volume. No grounding instruction was present in the system prompt.

"Dr. Smith published this in Nature 2024 [doi:10.1038/fake-doi]"

PR open github.com/…/pull/43 ↑ score +0.22

MEDIUM Context Window Exceeded - 145K tokens ◌ classified

Input to the model exceeded the 128K token context limit by 17K tokens. Full document was injected without chunking or summarization.

"maximum context length is 128000 tokens. Your messages resulted in 145230 tokens."

traceguard-ai.vercel.app/patches

TraceGuard AI

Dashboard Patch Review Eval Vault

● patch_generated

agent/main_agent.py · code_fix

Fix: infinite_loop - add max_iterations guard to ReAct agent

✓ Approve & Merge ✗ Reject

def run_agent(query: str) -> str: agent = create_react_agent(llm, tools) - return agent.invoke({"input": query})["output"] + return agent.invoke( + {"input": query}, + config={"recursion_limit": 10} + )["output"]

Added recursion_limit=10 to the agent invocation config to prevent unbounded tool call loops. Shadow score improved from 0.21 → 0.82.

agent/answer.py · prompt_rewrite

Fix: hallucination - add grounding instruction to system prompt

✓ Approve & Merge ✗ Reject

def answer(question: str, context: str) -> str: - return llm.invoke(f"{question}\n\nContext: {context}").content + system = ("Only answer using facts present in the provided context. " + "If the answer is not in the context, say 'I don't know.'") + return llm.invoke([SystemMessage(system), + HumanMessage(f"{question}\n\nContext: {context}")]).content

traceguard-ai.vercel.app/evals

TraceGuard AI

Dashboard Patch Review Eval Vault

● eval_generated

eval_infinite_loop_guard

0.21

Before patch

→

0.82

After patch

+61%

improvement

def eval_infinite_loop_guard(run, example): # Check agent respects iteration limit without crashing output = run.outputs.get("output", "") return {"score": int("iteration limit" not in output)}

eval_hallucination_grounding

0.35

Before patch

→

0.78

After patch

+43%

improvement

def eval_hallucination_grounding(run, example): # Verify no fabricated DOIs or citations in output output = run.outputs.get("output", "") return {"score": int("doi:10.1038/fake" not in output)}

eval_context_chunking

0.00

Before patch

→

0.65

After patch

+65%

improvement

🚀 Setup

Live in production in under 10 minutes

Three paths depending on your use case.

Run the backend

# Fastest path - no clone required
docker run -d \
  -e GROQ_API_KEY=gsk_... \
  -e LANGCHAIN_API_KEY=lsv2_... \
  -e LANGCHAIN_PROJECT=traceguard-ai \
  -e GITHUB_TOKEN=ghp_... \
  -e GITHUB_REPO=your-org/your-agent-repo \
  -e DATABASE_URL=postgresql://user:pass@host:5432/db \
  -e API_KEY=your-secret \
  -e CORS_ORIGINS=http://localhost:5173 \
  -p 8000:8000 \
  sauvast/traceguard-ai:latest

Test it

curl -X POST http://localhost:8000/api/webhook/simulate \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret" \
  -d '{"failure_hint": "infinite_loop"}'

# Expected: {"status":"simulated","failure_id":"..."}
# Watch Railway logs - pipeline runs in ~30 seconds

Clone and configure backend

git clone https://github.com/saurabh-oss/traceguard-ai
cd traceguard-ai/backend
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env - set GROQ_API_KEY at minimum

Start the backend

uvicorn app.main:app --reload --port 8000

Start the frontend (new terminal)

cd frontend
npm install
npm run dev
# → http://localhost:5173

Fire a demo failure

curl -X POST http://localhost:8000/api/webhook/simulate \
  -H "Content-Type: application/json" \
  -d '{"failure_hint": "hallucination"}'

Deploy backend to Railway

Railway → New Project → Deploy from GitHub → select traceguard-ai. Add a PostgreSQL plugin - Railway injects DATABASE_URL automatically.

Set Railway environment variables

Variable	Value
`GROQ_API_KEY`	Your Groq key from console.groq.com	required
`LANGCHAIN_API_KEY`	Your LangSmith key	required
`LANGCHAIN_PROJECT`	traceguard-ai	required
`GITHUB_TOKEN`	GitHub PAT with `repo` scope	for PRs
`GITHUB_REPO`	your-org/your-agent-repo	for PRs
`API_KEY`	Strong random secret	recommended
`CORS_ORIGINS`	Your Vercel frontend URL	recommended
`SECRET_KEY`	`openssl rand -hex 32`	recommended

Deploy frontend to Vercel

Vercel → Add New Project → import traceguard-ai. Set Root Directory to frontend. Add two env vars:

VITE_API_URL=https://your-railway-url.up.railway.app
VITE_API_KEY=same-value-as-API_KEY-in-railway

Connect LangSmith webhook

LangSmith → your project → Settings → Webhooks → Add Webhook:

URL:     https://your-railway-url.up.railway.app/api/webhook/langsmith
Trigger: Run Failed

Every failed run in your LangSmith project now flows into TraceGuard automatically.

From failure alert to merged fix

Observability stops at the alert

The alert is just the beginning

Failures compound between deploys

The fix cycle takes hours

Rejected fixes restart from zero

The missing remediation layer

Receive

Classify

Patch

Validate

Approve or retry

End-to-end pipeline

Five stages, fully automated

Classify

Patch Bot

Eval Writer

Shadow Runner

Dashboard Review

Crash Recovery

10 classified failure types

Built for production from day one

Three intake connectors

Rejection learning

Groq inference - free tier friendly

WebSocket live feed

Optional API key auth

SQLite → PostgreSQL

LangSmith feedback loop

LangGraph orchestration

Multi-file patches

Docker Hub image

CI / CD pipeline

Three views, one workflow

Live in production in under 10 minutes

Run the backend

Test it

Clone and configure backend

Start the backend

Start the frontend (new terminal)

Fire a demo failure

Deploy backend to Railway

Set Railway environment variables

Deploy frontend to Vercel

Connect LangSmith webhook

Everything you need

Main Repository

Demo Agent Repo

Docker Hub

Contributing Guide

Get a Groq API Key

LangSmith

From failure alert
to merged fix