Hyponema
Blog

How to choose the STT, LLM, and TTS stack for a voice agent

A production voice agent stack has three separate provider choices: speech-to-text, language model, and text-to-speech. Each layer should be evaluated independently.

Voice stack8 min read

Route every layer on purpose

STT, LLM, and TTS without stack lock-in

Short answer

Choose speech-to-text for latency, language coverage, domain accuracy, and interruption behavior. Choose the language model for reasoning quality, cost, tool use, and reliability. Choose text-to-speech for voice quality, streaming latency, language support, and brand fit.

Do not lock the whole agent to one vendor because one layer is strong. Voice agents benefit from separate routing and fallback policies for STT, LLM, and TTS.

Evaluate each layer separately

A provider that is excellent for one layer can be mediocre for another. A low-latency STT provider may not have the voice catalog you want. A strong reasoning model may be too slow for every turn. A beautiful TTS voice may need a faster backup for peak traffic.

LayerPrimary questionCommon failure mode
STTCan it hear the user accurately and quickly?Late partials, noisy transcripts, poor interruption handling.
LLMCan it reason, call tools, and stay in bounds?Slow turns, weak tool arguments, policy drift.
TTSDoes the voice fit the product and stream fast enough?Beautiful voice with latency that breaks conversation flow.

Use cascading fallbacks where the user feels risk

Fallbacks should match user impact. If STT latency spikes, the agent feels broken immediately. If a model call fails, the user hears silence unless the runtime can recover. If TTS is unavailable, the session can still continue only if another compatible voice is ready.

Hyponema separates provider credentials, stack configuration, and session routing so teams can change a layer without rebuilding the whole agent.

Keep provider keys portable

Bring-your-own keys are not only a pricing choice. They also keep procurement, usage visibility, and vendor relationships under the operator control. For teams that already negotiated model, speech, or voice contracts, BYO credentials avoid double billing and lock-in.

  • Use separate credentials for each provider layer.
  • Track latency, cost, and error rate per provider.
  • Keep an escape path for each layer before launch.

Related pages

Questions

Questions about this guide.

Should a voice agent use one provider for every layer?
Usually no. A production stack should let teams choose STT, LLM, and TTS independently because each layer has different latency, quality, cost, and reliability tradeoffs.
What is a cascading fallback?
A cascading fallback tries a backup provider or configuration when the primary layer fails, times out, or degrades. In voice agents, fallbacks are especially important because silence is immediately visible to the user.

Sources

Early access

Ship the voice agent. Keep control of the stack.

Join the waitlist for early access to Hyponema's production workspace for building, deploying, and operating voice agents.