Dograh

Speech Latency: The Sub-800ms Playbook for Voice AI in 2026

Speech Latency: The Sub-800ms Playbook for Voice AI in 2026
TechnicalApril 17, 2026·9 min read

Speech Latency: The Sub-800ms Playbook for Voice AI in 2026

Pritesh Kumar·Cofounder

Key Takeaways

  • Streaming STT cuts 400ms+ over batch and sets your latency floor
  • Colocation in one availability zone saves 100-400ms of network overhead
  • Self-hosted stacks remove per-minute fees and vendor network hops together

Production voice AI is 5x slower than human conversation. Most teams optimize the wrong component. This post breaks down where the milliseconds go and how to get under 800ms with an open-source stack.

Speech latency is the total time from when a user finishes speaking to when they hear the agent's first audio response. In production, end-to-end latency averages 1,400-1,700ms across 4M+ voice agent calls, per Hamming AI's dataset. Human conversation has natural turn-taking pauses of 200-300ms. Anything above 800ms feels unnatural to callers. Five components make up the latency stack: voice activity detection (VAD), speech-to-text (STT), LLM inference, text-to-speech (TTS), and network transport. Most guides say the LLM is the biggest bottleneck. That is only true if your STT already streams. Batch STT like Whisper sets a 500ms+ floor before the LLM even sees text. Switching to streaming STT (Deepgram Nova-3) and colocating your stack in one availability zone gets you under 800ms. Dograh, an open-source voice agent platform, makes colocation the default by letting you self-host every component.

What Speech Latency Means in Voice AI

Speech latency is the mouth-to-ear turn gap - the time between the last syllable a user speaks and the first syllable they hear back. This differs from the platform turn gap, which only measures processing time inside your infrastructure. Mouth-to-ear includes network travel and audio buffering on both ends. Twilio's latency guide makes this distinction well.

The human baseline is 200-300ms. That is the natural pause between conversational turns. Voice agents do not need to match 200ms, but they need to stay under 800ms. Above that threshold, callers notice the delay. Contact centers report a 40% increase in hang-ups when voice AI response time exceeds one second. The production reality is worse. Hamming AI analyzed 4M+ voice agent calls and found a P50 (median) of 1,400-1,700ms. P99 reaches 8.4-15.3 seconds - one in a hundred calls has a response gap longer than eight seconds. If you are making AI outbound calls work at any volume, those tail latencies destroy conversion rates.

Where Every Millisecond Goes

Five components eat your latency budget. Voice activity detection (VAD) takes 200-800ms and is the hidden killer. VAD determines when the user stops talking. A misconfigured threshold waits 500ms+ of silence before triggering the pipeline. That is dead time where no processing happens. Most off-the-shelf solutions ship with conservative 500ms defaults because false triggers are worse than slow triggers, but 200-300ms is achievable with proper tuning.

Speech-to-text (STT) adds 100-500ms depending on whether you use streaming or batch. Streaming returns partial transcripts as audio arrives. Batch waits for the full utterance. This single choice determines your end-to-end voice latency floor. LLM inference (time to first token) adds 200-800ms, varying by model size. Smaller models like GPT-4o-mini (~400ms) and Claude Haiku (~360ms) are fast enough for voice agents, where speed matters more than reasoning depth. Text-to-speech (TTS) adds 40-200ms and is rarely the bottleneck in 2026 - Cartesia Sonic Turbo delivers ~40ms and ElevenLabs Flash hits ~75ms.

Network transport adds 50-200ms per hop. Every API call to a managed service is a hop, and three managed services means three hops minimum. Twilio's component analysis confirms 50-200ms per hop. Colocated services communicate in under 5ms.

Why STT Is the Bottleneck

Most voice AI guides say LLM inference is 40-70% of total latency. That framing is misleading. It is only true after you have already fixed your STT.

Deepgram Nova-3 streams transcripts in sub-300ms. Whisper, the most popular open-source STT model, has no native streaming support. Chunked Whisper implementations add 500ms+ of latency and reduce accuracy on partial utterances. That 400-500ms gap between streaming and batch STT is larger than the gap between the fastest and slowest production LLMs. GPT-4o-mini and Claude Haiku differ by about 40ms. Deepgram and Whisper differ by 400ms+.

The pipeline is sequential - STT finishes before LLM starts, LLM finishes before TTS starts. A slow STT pushes everything back. With streaming STT, the LLM receives partial text and begins processing hundreds of milliseconds earlier. The savings compound through the pipeline. This is why we exclude Whisper from real-time voice agent recommendations. It works well for batch transcription. It is wrong for real-time speech to text latency optimization. Streaming STT is the foundation.

D

Open Source Alternative to Vapi / Retell

Self-hosted voice agent platform — no per-minute fees

dograh-hq/dograh

Star on GitHub

Colocation and Self-Hosting

Every network hop between managed services adds 10-50ms. Three API calls to three different cloud regions can add 150-400ms of pure network overhead. No code change fixes this. Only infrastructure decisions do. Hamming AI's data shows the geographic penalty: US East to West adds 60-80ms, US to Europe adds 80-150ms, US to Asia adds 150-250ms. If your STT runs in us-east-1, your LLM in us-west-2, and your TTS in eu-west-1, you burn 200ms+ on network travel alone.

The fix: deploy STT, LLM, and TTS in the same cloud availability zone. Not the same region - the same AZ. Inter-AZ latency is 1-2ms. Cross-region is 40-80ms. Pick us-east-1 or us-west-2 where most providers have presence and run everything together. Verify with traceroute.

Self-hosted stacks make this simple. When you control the infrastructure, colocation is a deployment choice, not a feature request to three vendors. Managed voice platforms also route every API call through their own infrastructure, adding 100-300ms of network overhead that disappears when services communicate locally. At 1,000 minutes per day, managed platforms cost $15,000-$30,000 per year in per-minute fees. Self-hosting on Dograh costs infrastructure only. Dograh is open-source (BSD-2 license) with full data sovereignty for healthcare and finance teams that cannot send call recordings to third-party APIs. As an alternative to managed voice platforms, it lets you run the full pipeline on your own infrastructure.

The Sub-800ms Provider Stack

A specific combination of providers achieves sub-800ms voice AI latency in production.

ComponentProviderLatencyCost
STTDeepgram Nova-3~200ms streaming$0.0043/min
LLMGroq (Llama 3.3 70B)~200ms TTFT$0.04/1M tokens
LLM (alt)Claude 3.5 Haiku~360ms TTFT$0.25/1M tokens
TTSCartesia Sonic Turbo~40ms TTFAUsage-based
TTS (alt)ElevenLabs Flash v2.5~75ms TTFA$0.15/1K chars

With Groq + Deepgram + Cartesia colocated: 200 + 200 + 40 + 15ms network = ~455ms platform latency. Add 40ms media edge + 30ms buffering = ~525ms mouth-to-ear. With Claude Haiku instead of Groq: ~685ms mouth-to-ear. Both are well under 800ms. The framework layer adds minimal overhead. Pipecat contributes ~5-10ms and handles streaming orchestration between components. Dograh builds on Pipecat with production deployment and self-hosting support.

Join the Dograh Community

Dograh is an OSS alternative to Vapi. Join our Slack community for queries, releases, best practices & community interactions.

Measuring Speech Latency in Production

Averages hide the worst user experiences. Use percentiles instead. P50 shows the typical experience - 1,400-1,700ms in production today. P90 (3.3-3.8 seconds) shows what one in ten callers experience. P99 (8.4-15.3 seconds) shows the worst case that still happens regularly. These numbers come from Hamming AI's analysis of 4M+ calls.

Measure mouth-to-ear, not just platform turn gap. Include VAD time. Many teams measure from "transcript received" and miss the 200-800ms of VAD delay before that point. Track per-component latency separately: STT, LLM TTFT, TTS TTFB, and network. When overall latency spikes, you need to know which layer caused it. Log timestamps at each pipeline boundary. The 300ms rule framework from AssemblyAI offers a solid starting model for component-level tracking.

Glossary

TTFA (Time to First Audio)
The elapsed time from end of user speech to the first audible byte of the agent's response. The single most important latency metric for voice agents because it captures the full pipeline delay the caller actually perceives.
Voice Activity Detection (VAD)
Algorithm that determines when a user has stopped speaking so the pipeline can begin processing. Misconfigured VAD adds 200-800ms of artificial silence before STT even starts. Often the largest hidden contributor to speech latency.
Streaming STT
Speech-to-text that returns partial transcripts as audio arrives, rather than waiting for the full utterance to complete. Reduces STT latency from 500ms+ (batch) to sub-300ms (streaming) and allows downstream LLM processing to start earlier.
Colocation
Deploying all pipeline services (STT, LLM, TTS) in the same cloud availability zone to minimize inter-service network latency. Reduces network overhead from 150-400ms (cross-region) to under 5ms (same AZ).

Frequently Asked Questions

Get started with Dograh

Build, deploy, and scale AI agents with Dograh. Join the community of developers building the future.