The 2025 AI Observability Stack
What to log, how to log it, and the tools that make AI debugging tractable.
You can't debug what you can't see. AI systems fail in ways that traditional APM tools weren't built to catch — silent quality drift, hallucinated outputs, multi-turn context degradation. Here's the observability stack that makes production AI tractable.
The core stack we deploy
- Langfuse or LangSmith — trace every LLM call with full prompt, response, latency, cost, and parent/child relationships across multi-step chains. Langfuse if you want self-hosted; LangSmith if you're all-in on the LangChain ecosystem.
- Helicone — drop-in proxy for cost tracking, rate limiting, and caching. Cheapest way to add cost-per-customer attribution.
- Phoenix (Arize) — evals + tracing in one tool. Particularly strong for RAG debugging.
- Datadog or your existing APM — for everything that isn't LLM-specific (database, queue, API latency, errors).
The open-source path
Phoenix + Langfuse self-hosted covers about 90% of needs at $40/month of infrastructure. You give up some UX polish and get full data ownership. Worth it for regulated industries or anyone who doesn't want vendor lock-in.
What to log on every LLM call
- Full prompt (including system prompt) and full response.
- Model identifier and version.
- Token counts (prompt, completion, total).
- Latency (time to first token, total).
- Cost (computed at log time, not query time).
- User ID, session ID, request ID, parent trace ID.
- Tool calls made (names, arguments, responses).
- Retrieved context for RAG (with chunk IDs and scores).
- Any human override or feedback signal.
- Failure mode if any (timeout, schema mismatch, content filter trip).
Sample at 100% in staging, 10–25% in production for high-volume systems. Sampling rate is itself a metric — if you can't reproduce a production issue with the sample, raise the rate.
The dashboards that earn their keep
- Cost-per-customer over time, sliced by feature.
- P50/P95/P99 latency by route and by model.
- Eval pass rate over time (block deploys on regressions).
- Tool-call success rate (a leading indicator for agent failures).
- Hallucination flags (sampled human review queue).
Bottom line
If your AI system goes down at 2am and you can't tell whether it's the model, the prompt, the retrieval, or your network in 5 minutes, your observability is broken. Invest in the stack before the incident, not after.