What voice agent observability should show before production
Voice agent observability should connect transcripts, provider spans, tool calls, memory retrieval, guardrails, cost, latency, and evaluation results in one trace.
Trace every live conversation
latency, cost, tools, and memory in one view
Short answer
Voice agent observability is the operating view for every turn. It should show what the user said, what context the agent received, which model and voice providers ran, which tools were called, what guardrails fired, how long every step took, and what the session cost.
A transcript alone is not enough. Without the trace, operators cannot tell whether a bad answer came from speech recognition, retrieval, prompt state, tool failure, model output, TTS latency, or a policy decision.
The minimum useful trace
Every production voice session should produce a trace tree that lets a human answer what happened without replaying the whole system mentally.
- User transcript and timing for each turn.
- LLM request, response, latency, token use, and model identity.
- STT and TTS provider timing and errors.
- Tool calls, arguments, results, and failures.
- Memory or knowledge-base retrieval supplied to the model.
- Guardrail decisions, escalations, and blocked output.
- Cost and latency by provider layer.
Evaluations belong beside traces
Offline tests catch regressions before launch. Online scorers watch real production traffic. Both are more useful when they attach to the same trace that contains the prompt, retrieval, tool calls, and guardrail outcomes.
| Signal | What it catches | Where to review |
|---|---|---|
| Scenario tests | Known flows breaking after an agent edit | Before publish and in CI-like review. |
| Online scorers | Live answer quality or safety drift | Next to production sessions. |
| Post-session runners | Structured summaries, outcomes, and follow-ups | User history and session detail. |
Optimize for the operator, not only the developer
Voice agents are operated by support, product, compliance, and growth teams, not only engineers. The trace viewer should be readable enough for non-engineers to identify the moment that mattered and precise enough for engineers to fix it.