What is a good speech latency target for voice AI agents?

Sub-800ms end-to-end (mouth-to-ear) is the bar for natural-feeling conversation. Human turn-taking pauses are 200-300ms. Production voice agents today average 1,400-1,700ms. Getting under 800ms requires streaming STT, a fast LLM, and colocated infrastructure.

How do I reduce voice agent latency below 1 second?

Switch from batch STT (like Whisper) to streaming STT (like Deepgram Nova-3) to save 400ms+. Colocate STT, LLM, and TTS in the same cloud availability zone to cut 100-400ms of network overhead. Use a fast LLM (Groq or Claude Haiku) and a low-latency TTS (Cartesia Sonic Turbo at ~40ms TTFA).

Can I use Whisper for real-time voice AI?

Whisper is not designed for real-time use. It has no native streaming support. Chunked implementations add 500ms+ of latency compared to streaming STT providers like Deepgram. Whisper is excellent for batch transcription and offline analysis, but streaming STT is required for sub-800ms voice agents.

What is the biggest mistake teams make when optimizing voice AI latency?

Focusing on LLM inference speed while running batch STT. The pipeline is sequential: STT finishes before LLM starts. A slow STT pushes everything back. Teams also ignore VAD configuration, which can add 200-800ms of artificial silence before processing begins. Fix STT and VAD first, then optimize LLM.

How do I measure speech latency correctly in production?

Use percentiles (P50, P90, P99), not averages. Measure mouth-to-ear turn gap (includes network and audio buffering), not just platform processing time. Track per-component latency (VAD, STT, LLM, TTS, network) separately so you know which layer to optimize when latency spikes.

Speech Latency in Voice AI: Sub-800ms Playbook (2026)

Key Takeaways

Streaming STT cuts 400ms+ over batch and sets your latency floor
Colocation in one availability zone saves 100-400ms of network overhead
Self-hosted stacks remove per-minute fees and vendor network hops together

Production voice AI is 5x slower than human conversation. Most teams optimize the wrong component. This post breaks down where the milliseconds go and how to get under 800ms with an open-source stack.

Speech latency is the total time from when a user finishes speaking to when they hear the agent's first audio response. In production, end-to-end latency averages 1,400-1,700ms across 4M+ voice agent calls, per Hamming AI's dataset. Human conversation has natural turn-taking pauses of 200-300ms. Anything above 800ms feels unnatural to callers. Five components make up the latency stack: voice activity detection (VAD), speech-to-text (STT), LLM inference, text-to-speech (TTS), and network transport. Most guides say the LLM is the biggest bottleneck. That is only true if your STT already streams. Batch STT like Whisper sets a 500ms+ floor before the LLM even sees text. Switching to streaming STT (Deepgram Nova-3) and colocating your stack in one availability zone gets you under 800ms. Dograh, an open-source voice agent platform, makes colocation the default by letting you self-host every component.

What Speech Latency Means in Voice AI

Speech latency is the mouth-to-ear turn gap - the time between the last syllable a user speaks and the first syllable they hear back. This differs from the platform turn gap, which only measures processing time inside your infrastructure. Mouth-to-ear includes network travel and audio buffering on both ends. Twilio's latency guide makes this distinction well.

The human baseline is 200-300ms. That is the natural pause between conversational turns. Voice agents do not need to match 200ms, but they need to stay under 800ms. Above that threshold, callers notice the delay. Contact centers report a 40% increase in hang-ups when voice AI response time exceeds one second. The production reality is worse. Hamming AI analyzed 4M+ voice agent calls and found a P50 (median) of 1,400-1,700ms. P99 reaches 8.4-15.3 seconds - one in a hundred calls has a response gap longer than eight seconds. If you are making AI outbound calls work at any volume, those tail latencies destroy conversion rates.

Where Every Millisecond Goes

Five components eat your latency budget. Voice activity detection (VAD) takes 200-800ms and is the hidden killer. VAD determines when the user stops talking. A misconfigured threshold waits 500ms+ of silence before triggering the pipeline. That is dead time where no processing happens. Most off-the-shelf solutions ship with conservative 500ms defaults because false triggers are worse than slow triggers, but 200-300ms is achievable with proper tuning.

Speech-to-text (STT) adds 100-500ms depending on whether you use streaming or batch. Streaming returns partial transcripts as audio arrives. Batch waits for the full utterance. This single choice determines your end-to-end voice latency floor. LLM inference (time to first token) adds 200-800ms, varying by model size. Smaller models like GPT-4o-mini (~400ms) and Claude Haiku (~360ms) are fast enough for voice agents, where speed matters more than reasoning depth. Text-to-speech (TTS) adds 40-200ms and is rarely the bottleneck in 2026 - Cartesia Sonic Turbo delivers ~40ms and ElevenLabs Flash hits ~75ms.

Network transport adds 50-200ms per hop. Every API call to a managed service is a hop, and three managed services means three hops minimum. Twilio's component analysis confirms 50-200ms per hop. Colocated services communicate in under 5ms.

Why STT Is the Bottleneck

Most voice AI guides say LLM inference is 40-70% of total latency. That framing is misleading. It is only true after you have already fixed your STT.

Deepgram Nova-3 streams transcripts in sub-300ms. Whisper, the most popular open-source STT model, has no native streaming support. Chunked Whisper implementations add 500ms+ of latency and reduce accuracy on partial utterances. That 400-500ms gap between streaming and batch STT is larger than the gap between the fastest and slowest production LLMs. GPT-4o-mini and Claude Haiku differ by about 40ms. Deepgram and Whisper differ by 400ms+.

The pipeline is sequential - STT finishes before LLM starts, LLM finishes before TTS starts. A slow STT pushes everything back. With streaming STT, the LLM receives partial text and begins processing hundreds of milliseconds earlier. The savings compound through the pipeline. This is why we exclude Whisper from real-time voice agent recommendations. It works well for batch transcription. It is wrong for real-time speech to text latency optimization. Streaming STT is the foundation.

Open Source Alternative to Vapi / Retell

Self-hosted voice agent platform — no per-minute fees

dograh-hq/dograh

Star on GitHub

Colocation and Self-Hosting

Every network hop between managed services adds 10-50ms. Three API calls to three different cloud regions can add 150-400ms of pure network overhead. No code change fixes this. Only infrastructure decisions do. Hamming AI's data shows the geographic penalty: US East to West adds 60-80ms, US to Europe adds 80-150ms, US to Asia adds 150-250ms. If your STT runs in us-east-1, your LLM in us-west-2, and your TTS in eu-west-1, you burn 200ms+ on network travel alone.

The fix: deploy STT, LLM, and TTS in the same cloud availability zone. Not the same region - the same AZ. Inter-AZ latency is 1-2ms. Cross-region is 40-80ms. Pick us-east-1 or us-west-2 where most providers have presence and run everything together. Verify with traceroute.

Self-hosted stacks make this simple. When you control the infrastructure, colocation is a deployment choice, not a feature request to three vendors. Managed voice platforms also route every API call through their own infrastructure, adding 100-300ms of network overhead that disappears when services communicate locally. At 1,000 minutes per day, managed platforms cost $15,000-$30,000 per year in per-minute fees. Self-hosting on Dograh costs infrastructure only. Dograh is open-source (BSD-2 license) with full data sovereignty for healthcare and finance teams that cannot send call recordings to third-party APIs. As an alternative to managed voice platforms, it lets you run the full pipeline on your own infrastructure.

The Sub-800ms Provider Stack

A specific combination of providers achieves sub-800ms voice AI latency in production.

Component	Provider	Latency	Cost
STT	Deepgram Nova-3	~200ms streaming	$0.0043/min
LLM	Groq (Llama 3.3 70B)	~200ms TTFT	$0.04/1M tokens
LLM (alt)	Claude 3.5 Haiku	~360ms TTFT	$0.25/1M tokens
TTS	Cartesia Sonic Turbo	~40ms TTFA	Usage-based
TTS (alt)	ElevenLabs Flash v2.5	~75ms TTFA	$0.15/1K chars

With Groq + Deepgram + Cartesia colocated: 200 + 200 + 40 + 15ms network = ~455ms platform latency. Add 40ms media edge + 30ms buffering = ~525ms mouth-to-ear. With Claude Haiku instead of Groq: ~685ms mouth-to-ear. Both are well under 800ms. The framework layer adds minimal overhead. Pipecat contributes ~5-10ms and handles streaming orchestration between components. Dograh builds on Pipecat with production deployment and self-hosting support.

Join the Dograh Community

Dograh is an OSS alternative to Vapi. Join our Slack community for queries, releases, best practices & community interactions.

Measuring Speech Latency in Production

Averages hide the worst user experiences. Use percentiles instead. P50 shows the typical experience - 1,400-1,700ms in production today. P90 (3.3-3.8 seconds) shows what one in ten callers experience. P99 (8.4-15.3 seconds) shows the worst case that still happens regularly. These numbers come from Hamming AI's analysis of 4M+ calls.

Measure mouth-to-ear, not just platform turn gap. Include VAD time. Many teams measure from "transcript received" and miss the 200-800ms of VAD delay before that point. Track per-component latency separately: STT, LLM TTFT, TTS TTFB, and network. When overall latency spikes, you need to know which layer caused it. Log timestamps at each pipeline boundary. The 300ms rule framework from AssemblyAI offers a solid starting model for component-level tracking.

Glossary

TTFA (Time to First Audio): The elapsed time from end of user speech to the first audible byte of the agent's response. The single most important latency metric for voice agents because it captures the full pipeline delay the caller actually perceives.
Voice Activity Detection (VAD): Algorithm that determines when a user has stopped speaking so the pipeline can begin processing. Misconfigured VAD adds 200-800ms of artificial silence before STT even starts. Often the largest hidden contributor to speech latency.
Streaming STT: Speech-to-text that returns partial transcripts as audio arrives, rather than waiting for the full utterance to complete. Reduces STT latency from 500ms+ (batch) to sub-300ms (streaming) and allows downstream LLM processing to start earlier.
Colocation: Deploying all pipeline services (STT, LLM, TTS) in the same cloud availability zone to minimize inter-service network latency. Reduces network overhead from 150-400ms (cross-region) to under 5ms (same AZ).

Speech Latency: The Sub-800ms Playbook for Voice AI in 2026