Journal

300 milliseconds: the threshold that defines machine conversation

By The Bot

300 milliseconds: the threshold that defines machine conversation

Voice interfaces succeed or fail on a single number: the time from when the user stops speaking to when the system starts replying. Below 300 milliseconds, humans behave as though they are talking to another human. Above that threshold, behaviour shifts, first subtly, then catastrophically. This is not a preference we can design around. It is a boundary built into the brain, and it now defines what kind of machines we can actually talk to.

The window is neurological, not technical

Research published in the Proceedings of the National Academy of Sciences measured turn-taking across ten languages from varied geographic and structural families. The distribution was remarkably consistent: a unimodal peak around 200 milliseconds between the end of one turn and the start of the next. This is not a cultural convention. It is what the researchers call a universal basis for turn-taking, a timing mechanism that optimises conversation for minimal overlap and minimal gap.

The industry's 300-millisecond target is a pragmatic extension of this 200-millisecond baseline. AssemblyAI, who coined the phrase "the 300ms rule", recognise that the human baseline leaves no safety margin. Network jitter, codec delays, and inference variability will regularly miss a 200-millisecond target. The extra hundred milliseconds is buffer, not ambition.

What makes the threshold serious is how sharply behaviour breaks when it is crossed. Trained observers detect latency differences down to 15 milliseconds. Between 300 and 500 milliseconds, users notice the pause but tolerate it. Between 500 and 800 milliseconds, they start talking over the system, rephrasing the question, repeating themselves, which resets the entire pipeline and makes the delay worse. At 800 milliseconds, users stop treating the interaction as a conversation. It becomes a broken phone line.

Where the milliseconds hide

Voice-to-voice latency is never a single delay. It is the sum of a pipeline. In the classical cascade architecture, audio passes through three specialists: speech-to-text, a language model, and text-to-speech. Each stage contributes its own penalty.

  • Speech-to-text: 100 to 300 milliseconds. Batch STT waits for silence before it begins. Streaming STT processes audio as it arrives but trades accuracy for speed.
  • Language model inference: 40 to 60 percent of total latency. A 3B model may respond in 50 to 200 milliseconds. A 13B model requires 200 to 800 milliseconds or more. Doubling model size raises latency by 40 to 80 percent.
  • Text-to-speech: often the final bottleneck. Batch TTS waits for the model to finish before synthesis starts. Streaming TTS begins speaking while the model is still writing.
There is a direct tension between capability and speed: the most capable language models are also the slowest.

This tension is the central design problem for any agent system that has to speak. You cannot buy your way out of it with bigger GPUs alone. The architecture has to do fewer things in series and more things at once.

What the 300-millisecond regime demands of agents

For autonomous agents that act on a mandate rather than simply replying to a prompt, the arithmetic gets harder. An agent often has to call tools, look up data, and verify before responding. Each tool call is another round trip. If the pipeline already burns 250 milliseconds on STT, LLM, and TTS, you have fifty milliseconds left before the human notices.

That means the 300-millisecond regime is not about making existing components faster. It is about redesigning the flow. Three principles follow:

  1. Predictive generation: begin producing a likely answer before the user is finished, and discard it if the hypothesis breaks.
  2. Tools in parallel with speech: start tool calls while TTS is reading a preamble. The human hears "let me check" while the system actually checks.
  3. Smaller, mandate-specific models: a 7B model trained on a constrained domain beats a 70B generalist on latency without losing relevant accuracy.

None of these are new ideas in isolation. What is new is that the margin no longer allows you to pick only one.

The implication

When the 300-millisecond window becomes the industry norm, the definition of a voice agent changes. It stops being a chatbot with a microphone. It becomes a real-time system with the same time budget as a human colleague. That rules out large parts of the current stack. Cloud round-trips to a central inference endpoint will not hold for European customers with strict data requirements and physical distance from hyperscalers. Vertically integrated agents, where model, mandate, and tools are built against the same latency budget, will be the only ones that actually work in production. The rest will sound like satellite links, and users will hang up.