Choice of Voices

A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content

CHI 2020 — ACM Conference on Human Factors

Cambre, J., Colnago, J., Maddock, J., Tsai, J., & Kaye, J.

Evaluating 18 TTS and 3 human voices across 1,090 participants, this study reveals that natural TTS approaches human quality, voice preference varies individually, and listening preserves comprehension compared to reading.

What the Researchers Tested

The researchers recruited 1,090 participants and asked them to evaluate 18 text-to-speech voices alongside 3 professional human narrators. Participants listened to long-form content — not short clips — and rated each voice on quality, naturalness, and listening experience.

This wasn't a lab demo with cherry-picked sentences. It was a large-scale, ecologically valid study designed to answer a practical question: can synthesized voices actually work for real, sustained listening?

Key Findings

Natural TTS is real — but voice matters

Top-rated TTS voices scored within range of human narrators. However, the gap between the best and worst synthetic voices was enormous — choosing the right voice is critical.

No single perfect voice for everyone

Participant preferences varied significantly. A voice rated highly by one group was rated poorly by another. Personal choice, not a universal default, drives satisfaction.

Speed sweet spot: 163–177 WPM

Listeners preferred speech rates in the 163–177 words-per-minute range — faster than typical audiobook narration but slower than rapid-fire speech. Controllable playback speed matters.

Listening preserves comprehension vs. reading

Participants who listened to content performed comparably on comprehension measures to those who read the same text. Audio is not a lossy channel — it is a viable productivity mode.

Why This Matters for VOZCLA

1.Natural voices reduce friction. When TTS quality is high enough, people stop noticing the technology and start absorbing the content. VOZCLA uses state-of-the-art voices so listening feels effortless.
2.Choice matters. Because no single voice satisfies everyone, VOZCLA offers multiple voice options so users can pick the one that fits them best.
3.Control matters. The 163–177 WPM sweet spot confirms what power users already know: speed control is essential. VOZCLA gives users full playback speed control to match their preference.
4.Audio is a real productivity mode. Comprehension parity with reading means listening isn't a compromise — it's an alternative channel that frees your eyes and hands for other tasks.

What This Means for You

You don't have to read everything. High-quality TTS lets you listen to articles, documents, and research while commuting, exercising, or doing other tasks — without losing comprehension. And with VOZCLA, you get the voice, speed, and experience that works for you, not a one-size-fits-all default.

Note: This paper was written using VOZCLA.

Back to Research