Now live on Railway + Vercel

From failure alert
to merged fix

LangSmith tells you something broke. TraceGuard AI fixes it - fetching your real code, generating a patch PR, and scoring the fix before you ever look at it.

⭐ Star on GitHub → Quick Setup 🐳 Docker Hub
3
Intake connectors (LangSmith, Langfuse, generic)
<60s
From alert to patch PR
10
Failure types classified
0
Manual steps to ship a fix

Observability stops at the alert

Your monitoring tool tells you something broke. Everything after that - root cause, fix, test, PR, review - is still entirely manual.

🚨

The alert is just the beginning

LangSmith, Langfuse, and Helicone tell you a run failed. Finding which file caused it, understanding why, writing a fix, and opening a PR is still entirely your problem.

🌊

Failures compound between deploys

An infinite loop or hallucination bug doesn't crash anything - it silently burns credits and returns bad answers until someone notices and the manual fix cycle begins.

The fix cycle takes hours

Reproduce the failure, find the file, write the patch, open a PR, add a regression test - for every single incident. This is toil that compounds as your agent usage grows.

🔁

Rejected fixes restart from zero

A reviewer rejects your patch with notes. Those notes live in a GitHub comment. The next iteration starts from scratch - no memory of what was tried, what failed, or why.

The missing remediation layer

Observability tools are read-only - they surface failures and stop there. TraceGuard AI is read-write - it changes your code.

When a failure alert arrives from LangSmith, Langfuse, or any monitoring tool, TraceGuard classifies the root cause, dispatches a LangGraph agent to fetch your actual code from GitHub, generates a targeted multi-file fix, and opens a PR - all in under 60 seconds.

If you reject a fix with notes, the patch bot reads your feedback and generates a revised PR automatically. You stay in the loop for final approval - everything else is handled.

1

Receive

Failure arrives via LangSmith webhook, Langfuse webhook, or generic ingest endpoint.

2

Classify

Groq LLM maps the trace to one of 10 failure types with severity and root cause.

3

Patch

LangGraph agent fetches your real code from GitHub and generates a multi-file fix in a single clean commit.

4

Validate

Groq-as-judge scores the fix quality before you approve. Scores are written back to LangSmith.

5

Approve or retry

One click to merge. Reject with notes and the bot generates a revised fix addressing your feedback.

End-to-end pipeline

Every component is async and independently sessionized - a failure in any stage doesn't block the others.

INTAKE PIPELINE OUTPUT LangSmith Failed Runs Webhook / Poller /api/webhook dedup by run_id · 60s poll ① Classifier Groq → failure_type · severity title · root_cause_summary ② Patch Bot LangGraph: fetch_code → generate_fix → open_pr ③ Eval Writer synthesize evaluator code + test input ④ Shadow Runner score_before vs score_after · auto-promote GitHub PR auto-branch · diff · merge ⑤ Dashboard human approve / reject PR feedback scores written to LangSmith

Five stages, fully automated

Each stage runs asynchronously with its own DB session. A crash in one stage doesn't cascade to the others.

01

Classify

A Groq LLM (llama-3.3-70b-versatile) receives a truncated excerpt of the failing trace - inputs, outputs, error, and child_run names - and returns structured JSON: failure type, severity, title, description, root cause, and trace evidence quotes.

groq llama-3.3-70b JSON output 12k TPM safe
02

Patch Bot

A LangGraph state machine runs three nodes: fetch_code extracts real file paths from stack traces and pulls them from your GitHub repo; generate_fix produces a minimal targeted patch; open_pr creates a branch, commits, and opens a PR with explanation.

LangGraph PyGitHub real code auto PR
03

Eval Writer

Groq synthesizes a LangSmith evaluator function - Python code you can add directly to your test suite. It includes a representative test input derived from the original failing trace and an expected output that the patched agent should produce.

groq LangSmith eval Python codegen
04

Shadow Runner

Groq-as-judge scores the original failing output and the patched expected output on a 0.0–1.0 quality scale. If the patched score is ≥10% better, the patch is auto-promoted. Both scores are written back to LangSmith as traceguard/quality_before and traceguard/quality_after_patch feedback.

A/B scoring LangSmith feedback auto-promote
05

Dashboard Review

The React dashboard shows every failure card with its classification, root cause, linked PR URL, diff, and shadow scores. Approve squash-merges the PR on GitHub and marks the failure resolved. Reject closes the PR with a comment. All state updates are pushed live over WebSocket.

React 19 WebSocket human-in-loop squash merge
+

Crash Recovery

On startup, TraceGuard scans for failures stuck in classified state (e.g. from a previous container crash) and re-queues them through the patch pipeline with a staggered 10-second delay between each to stay within Groq rate limits.

auto-resume rate-limit safe startup recovery

10 classified failure types

Every trace is mapped to one of these types. The taxonomy drives which patch strategy is applied and which evaluator is generated.

infinite_loop

Agent or tool calling itself in a cycle with no exit condition

hallucination

Agent fabricated facts, citations, or data not present in context

tool_misuse

Wrong arguments, schema mismatch, or wrong tool chosen for task

context_overflow

Input exceeded the model's maximum context window limit

empty_response

Agent returned empty or null output with no fallback

format_error

Response did not match the required output schema or format

reasoning_failure

Chain-of-thought broke down, leading to a wrong conclusion

latency_regression

Response time degraded significantly from the established baseline

tool_timeout

External tool or API call timed out with no fallback strategy

unknown

Failure could not be mapped to any known taxonomy type

Built for production from day one

🔌

Three intake connectors

Native support for LangSmith and Langfuse webhooks. A generic /ingest endpoint accepts normalized failures from Helicone, Arize, custom scripts, or any tool that can fire an HTTP POST.

🔁

Rejection learning

Reject a PR with reviewer notes and the patch bot reads your feedback, incorporating it into a revised fix - automatically. No restarting from scratch; the context carries forward.

Groq inference - free tier friendly

Uses llama-3.3-70b-versatile on Groq's free tier. All traces are truncated to stay within the 12k TPM limit. Swap the model via GROQ_MODEL env var.

🔄

WebSocket live feed

Every pipeline event - failure classified, patch generated, eval written, shadow scored - is pushed to the dashboard instantly over a persistent WebSocket connection.

🔐

Optional API key auth

Set API_KEY to protect write endpoints. /api/webhook/langsmith is intentionally open - LangSmith can't send custom headers. Frontend picks up the key from VITE_API_KEY.

🗄

SQLite → PostgreSQL

SQLite for local dev, PostgreSQL for production. Railway's plugin injects DATABASE_URL automatically. Alembic manages schema migrations on every deploy.

📊

LangSmith feedback loop

Shadow scores are written back to LangSmith as structured feedback (traceguard/quality_before, traceguard/quality_after_patch), closing the observability loop.

🤖

LangGraph orchestration

Patch Bot is a proper LangGraph StateGraph - each node gets the full typed state, errors are logged individually, and the graph is compiled once and reused.

📂

Multi-file patches

The patch bot fixes multiple files in a single clean Git commit - system prompt, tool definition, and agent runner in one PR. No piecemeal commits, no rebasing.

🐳

Docker Hub image

Pre-built image published on every version tag. No Python setup required - just docker run with your env vars and you're live in under 60 seconds.

🔁

CI / CD pipeline

GitHub Actions runs backend import checks and frontend TypeScript + build on every push. Docker Hub publish triggers on v* tags.

Three views, one workflow

Dark-theme React 19 UI with TanStack Query and live WebSocket updates.

traceguard-ai.vercel.app/dashboard
● failure_classified · high
12
Total Failures
2
Critical
5
High
9
Patches
+ infinite loop
+ hallucination
+ tool misuse
+ context overflow
+ empty response
CRITICAL Infinite Loop Detected in Agent ● patched
The agent entered an infinite loop by repeatedly calling search_tool without an exit condition, exhausting the iteration limit of 10. Root cause: missing max_iterations guard in the ReAct agent configuration.
"Agent stopped due to iteration limit of 10."
PR open github.com/…/pull/42 ↑ score +0.31
HIGH Hallucinated Citation with Fake DOI ● patched
Agent fabricated a research citation including a non-existent DOI and journal volume. No grounding instruction was present in the system prompt.
"Dr. Smith published this in Nature 2024 [doi:10.1038/fake-doi]"
PR open github.com/…/pull/43 ↑ score +0.22
MEDIUM Context Window Exceeded - 145K tokens ◌ classified
Input to the model exceeded the 128K token context limit by 17K tokens. Full document was injected without chunking or summarization.
"maximum context length is 128000 tokens. Your messages resulted in 145230 tokens."
traceguard-ai.vercel.app/patches
● patch_generated
agent/main_agent.py · code_fix
Fix: infinite_loop - add max_iterations guard to ReAct agent
✓ Approve & Merge ✗ Reject
def run_agent(query: str) -> str: agent = create_react_agent(llm, tools) - return agent.invoke({"input": query})["output"] + return agent.invoke( + {"input": query}, + config={"recursion_limit": 10} + )["output"]
Added recursion_limit=10 to the agent invocation config to prevent unbounded tool call loops. Shadow score improved from 0.210.82.
agent/answer.py · prompt_rewrite
Fix: hallucination - add grounding instruction to system prompt
✓ Approve & Merge ✗ Reject
def answer(question: str, context: str) -> str: - return llm.invoke(f"{question}\n\nContext: {context}").content + system = ("Only answer using facts present in the provided context. " + "If the answer is not in the context, say 'I don't know.'") + return llm.invoke([SystemMessage(system), + HumanMessage(f"{question}\n\nContext: {context}")]).content
traceguard-ai.vercel.app/evals
● eval_generated
eval_infinite_loop_guard auto-promoted ✓
0.21
Before patch
0.82
After patch
+61%
improvement
def eval_infinite_loop_guard(run, example): # Check agent respects iteration limit without crashing output = run.outputs.get("output", "") return {"score": int("iteration limit" not in output)}
eval_hallucination_grounding auto-promoted ✓
0.35
Before patch
0.78
After patch
+43%
improvement
def eval_hallucination_grounding(run, example): # Verify no fabricated DOIs or citations in output output = run.outputs.get("output", "") return {"score": int("doi:10.1038/fake" not in output)}
eval_context_chunking pending review
0.00
Before patch
0.65
After patch
+65%
improvement

Live in production in under 10 minutes

Three paths depending on your use case.

1

Run the backend

# Fastest path - no clone required
docker run -d \
  -e GROQ_API_KEY=gsk_... \
  -e LANGCHAIN_API_KEY=lsv2_... \
  -e LANGCHAIN_PROJECT=traceguard-ai \
  -e GITHUB_TOKEN=ghp_... \
  -e GITHUB_REPO=your-org/your-agent-repo \
  -e DATABASE_URL=postgresql://user:pass@host:5432/db \
  -e API_KEY=your-secret \
  -e CORS_ORIGINS=http://localhost:5173 \
  -p 8000:8000 \
  sauvast/traceguard-ai:latest
2

Test it

curl -X POST http://localhost:8000/api/webhook/simulate \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your-secret" \
  -d '{"failure_hint": "infinite_loop"}'

# Expected: {"status":"simulated","failure_id":"..."}
# Watch Railway logs - pipeline runs in ~30 seconds
1

Clone and configure backend

git clone https://github.com/saurabh-oss/traceguard-ai
cd traceguard-ai/backend
python3.12 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
# Edit .env - set GROQ_API_KEY at minimum
2

Start the backend

uvicorn app.main:app --reload --port 8000
3

Start the frontend (new terminal)

cd frontend
npm install
npm run dev
# → http://localhost:5173
4

Fire a demo failure

curl -X POST http://localhost:8000/api/webhook/simulate \
  -H "Content-Type: application/json" \
  -d '{"failure_hint": "hallucination"}'
1

Deploy backend to Railway

Railway → New Project → Deploy from GitHub → select traceguard-ai. Add a PostgreSQL plugin - Railway injects DATABASE_URL automatically.

2

Set Railway environment variables

VariableValue
GROQ_API_KEYYour Groq key from console.groq.comrequired
LANGCHAIN_API_KEYYour LangSmith keyrequired
LANGCHAIN_PROJECTtraceguard-airequired
GITHUB_TOKENGitHub PAT with repo scopefor PRs
GITHUB_REPOyour-org/your-agent-repofor PRs
API_KEYStrong random secretrecommended
CORS_ORIGINSYour Vercel frontend URLrecommended
SECRET_KEYopenssl rand -hex 32recommended
3

Deploy frontend to Vercel

Vercel → Add New Project → import traceguard-ai. Set Root Directory to frontend. Add two env vars:

VITE_API_URL=https://your-railway-url.up.railway.app
VITE_API_KEY=same-value-as-API_KEY-in-railway
4

Connect LangSmith webhook

LangSmith → your project → Settings → Webhooks → Add Webhook:

URL:     https://your-railway-url.up.railway.app/api/webhook/langsmith
Trigger: Run Failed

Every failed run in your LangSmith project now flows into TraceGuard automatically.