Dictation is underrated

A look at why speech-to-text works, when it doesn’t, and what good dictation design actually requires.

I type around 80 WPM on a good day. I speak at roughly 150. That gap has always bugged me, because the bottleneck in most of my writing isn’t thinking—it’s getting the words onto the screen.

Typing feels like the default. But if you do a lot of text‑heavy work—drafts, notes, docs, journals, emails—you start noticing how much of your time goes to the mechanical act of pressing keys. And worse, how often typing pulls you into micro‑editing mode before you’ve even finished a thought. You type half a sentence, backspace, rephrase, lose the thread.

Speech‑to‑text is the obvious alternative. You talk, text appears. In theory, you capture ideas at roughly the speed you think them. In practice, it’s more complicated than that—which is what this post is about.

Why speech works for writing

Writing is thinking. The input method you use shapes how that thinking feels.

When I dictate a first draft, I get a continuous stream of words on the page. That continuity matters. I don’t stop to fix a typo, reconsider a word choice, or fiddle with formatting. I just talk. The narrative thread stays intact in a way that’s hard to replicate when I’m typing and editing simultaneously.

Speech also removes the motor overhead. No fine finger movements, no staring at a cursor. If your hands are tired, or you have RSI, or you just want to get a rough draft out fast, talking is physically easier.

There’s a productivity philosophy here too: separate drafting from editing. Write the ugly first draft, then clean it up. Speech basically forces you into that workflow. You can’t backspace mid‑sentence without stopping your whole thought. So you keep going. That constraint turns out to be useful.

The research picture

The speed advantage is real and well‑documented.

Ruan et al. (2017) compared speech input to smartphone keyboard entry and found speech was about 2.9× faster—153 WPM vs. 52 WPM in English, 123 vs. 43 in Mandarin. These are lab conditions measuring upper‑bound performance, but the magnitude is hard to ignore.

The error story is more interesting. During entry, speech actually had lower corrected error rates than keyboard in both languages. But after entry, speech left slightly more uncorrected errors sitting in the final text.

That nuance matters. Dictation feels fast until you have to go back and fix things.

Vogel et al. (2015) ran a randomized trial of speech recognition for clinical documentation. Documentation speed went from 173 to 217 characters per minute (26% increase, P=.04). Average document length roughly doubled. Participants reported better mood ratings. When dictation fits the workflow, the gains are real.

But Chen et al. (2020) found the opposite pattern for consumer digital health documentation tasks. Compared to keyboard and mouse, speech recognition increased cognitive load for both simple and complex tasks, took longer on complex tasks, and scored lower on usability. No difference in error rates.

What’s going on? The tasks were different. Chen’s tasks required heavy recall and structured data entry—exactly the kind of work where speech adds mental overhead instead of removing it.

The pattern is consistent: dictation wins when it reduces friction. It loses when it adds correction burden and cognitive load.

Where dictation actually helps

The best use cases share a common trait: they’re “language‑forward”—the main thing you’re doing is producing natural language, and speed matters more than perfect formatting.

First drafts. Get the ideas out. Edit later. This is where I use dictation most. If I’m staring at a blank page and struggling to start, switching to voice usually breaks the block. I’m not writing anymore, I’m just talking. Different mental mode entirely.

Notes and thought capture. When the goal is just “get this down before I forget it,” speech is ideal. Meeting notes, research observations, shower thoughts you’re trying to preserve.

Emails and messages. If you know what you want to say, dictating it is faster than typing it. Especially on mobile.

Documentation. Domain‑dependent—works well when the workflow supports it and correction is manageable. The clinical documentation results from Vogel et al. are a good example.

Accessibility. For people who can’t type comfortably, dictation isn’t a productivity hack—it’s the primary interface.

Why people bounce off dictation

I’ve tried dictation tools maybe a dozen times over the years before one finally stuck. The failure modes are predictable:

Errors erode trust. If the system gets words wrong consistently, you stop dictating full sentences. You start over‑enunciating, pausing awkwardly, avoiding words you think the system won’t get. At that point you’re performing for the machine instead of thinking.

Correction is painful. This is the big one. Research on voice text entry consistently finds that correction time can dominate the interaction, wiping out the speed advantage of speaking. If fixing mistakes takes longer than just typing would have, dictation is a net negative.

Latency kills rhythm. Dictation is a conversational activity. You expect words to appear as you say them. When there’s a lag, you hesitate, lose your train of thought, and start second‑guessing what you said.

It’s socially weird. Open offices, coffee shops, shared spaces—talking to your computer feels awkward. Self‑consciousness makes people dictate worse, which makes the output worse, which reinforces the awkwardness.

Punctuation is a mess. If you have to say “period” and “comma” and “new paragraph” constantly, you’re spending cognitive effort on formatting instead of content.

What good dictation design looks like

I’ll keep this to design principles, not implementation details.

Let people talk normally. The more a system requires unnatural speech—robotic enunciation, careful pauses, memorized commands—the more mental overhead it creates. The goal is for dictation to feel like talking, not like programming.

Treat correction as the core problem. Speed means nothing if editing is slow. Good dictation needs fast ways to re‑speak a phrase, select and fix specific words, review recent output, and switch between voice and keyboard for corrections. This is where most dictation tools fail.

Don’t trap people in voice mode. The best dictation experience I’ve used lets me talk for a paragraph, grab the keyboard to fix a name, then keep talking—without any mode‑switching friction. Speech for generation, keyboard for precision. Hybrid is the right answer.

Handle punctuation gracefully. Users need enough control to produce clean text, but not so many rules that dictation becomes a memorization exercise.

Be predictable. Inconsistent behavior—delayed output, words that rearrange themselves, capitalization that changes randomly—destroys trust fast. People need to feel like the system is stable.

Be honest about privacy. Speech input means audio capture. Systems should make it clear when audio is being recorded, what happens to it, and how to pause or mute. Sengupta et al. (2020) found that privacy and self‑consciousness affect usability directly.

Measure the right thing. The metric that matters is time‑to‑finished‑text, not raw transcription speed. A system that transcribes at 150 WPM but requires 10 minutes of editing per paragraph is slower than typing.

Where dictation still loses

Noisy environments degrade accuracy. Rare names and domain‑specific terms trip up any model. Heavy problem‑solving tasks with lots of recall seem to get harder, not easier, with speech input (Chen et al.’s results). Structured data entry, code, and anything requiring careful visual scanning are still keyboard territory.

And the editing overhead is real. Even fast transcription doesn’t help if the correction loop is slow.

Where this is going

The interesting framing isn’t “speech replaces typing.” It’s “speech becomes one of several input modes, and the system is smart about all of them.”

Speech for fast capture. Keyboard for precision. Mouse or touch for navigation. Maybe AI for reformatting and cleanup. The job of dictation is to handle the parts of work where you’re already thinking in language and just need to get it onto the screen faster.

The thing I keep coming back to: flow is the actual product. If dictation keeps you in the zone—thinking, talking, producing—while reducing the mechanical tax of typing, it’s doing its job. The moment it makes you think about the tool instead of the work, it’s failing.

References

Ruan, S., Wobbrock, J. O., Liou, K., Ng, A., & Landay, J. (2017). Comparing Speech and Keyboard Text Entry for Short Messages in Two Languages on Touchscreen Phones. doi:10.1145/3161187
Vogel, M., Kaisers, W., Wassmuth, R., & Mayatepek, E. (2015). Analysis of Documentation Speed Using Web‑Based Medical Speech Recognition Technology: Randomized Controlled Trial. Journal of Medical Internet Research, 17(11), e247. PMC4642384
Chen, J., Cambon, A. C., & Zhang, M. (2020). Effect of Speech Recognition on Problem Solving and Recall in Consumer Digital Health Tasks: Controlled Laboratory Experiment. JMIR Human Factors. PMC7296411
Sengupta, K., Ahuja, S., & MacKenzie, I. S. (2020). Leveraging Error Correction in Voice‑based Text Entry by Talk‑and‑Gaze. CHI 2020.

Related VOZCLA Research

Natural Text‑to‑Speech for Long‑Form Understanding

VOZCLA Research Team

Smart Commands for Clean, Organize, Summarize, Rewrite, and Translate

VOZCLA Research Team

Back to Research