Text-to-Speech: Is it as good as it sounds?

Global Trends5 April 2021836

In a word — Yes & No!

Unlike Speech-to-Text (STT), Text-to-Speech (TTS) holds a clear advantage in that it generates a computer-generated voice from a non-ambiguous source. Assuming the text is correct, the “voice” will reproduce it more or less accurately. The same cannot be said for Speech-to-Text, where errors routinely occur for a variety of reasons, such as the speaker’s accent, talking speed, slurring of words, and so on.

However, the devil’s in the details. There are challenges in Text-to-Speech, even when the source text is perfect… As text, that is, such as:

  • foreign words
  • mispronunciation of names
  • abbreviations / acronyms
  • dates, times & measures
  • homonyms

There are workarounds for all of the above, but like Machine Translation, Speech-to-Text, and Optical Character Recognition (OCR), the time & manpower required often negate the very utility one hopes to realize by using these technologies in the first place. Much depends on the desired end use in order to rationalize such investment.

For example, applying Text-to-Speech to a 300-page novel might well be worth the investment, but certainly a waste of time for a short document or web page. And even with the added costs, it will still sound less-than-human regardless of the latest improvements. Besides, Text-to-Speech generated audio books might just not be a very pleasant user experience.

That said, Text-to-Speech is still a great tool regardless of its drawbacks. It may not be the perfect substitute for humans, at least not yet, but it still fills a gap that would otherwise leave certain audiences with no means whatsoever to access the written word. And if the intonation is a little off, or certain words are mispronounced, the user is still happy enough with the results. The advancement in human-sounding voices only enhances the experience further, which will only get better over time. And even better, the apps for TTS are often free, with customizable voices to boot!


(Images courtesy of Pexels)


Today, the most common users of Text-to-Speech include…

  • readers / listeners on the go (audiobooks)
  • the visually impaired
  • non-native speakers who can understand but cannot read a foreign language
  • the speech impaired, to deliver their message
  • low budget eLearning courses

Here at EQHO, we’ve definitely profited from Speech-to-Text technology as we are often tasked with translating videos where no script is provided. In the past, the audio would have to be transcribed manually, but today we use the latest software to transcribe, then a human review as there are always errors. Still, it’s a great technology that helps us a lot in both time & costs.

However, the same cannot be said for Text-to-Speech. By the time we correct for intonation, apply rules for abbreviations, acronyms, homonyms, etc., we can provide a professional human voice just as easily & quickly at competitive prices. As the technology improves over time, humans may well become obsolete… But we’re not there yet — not by a long shot!

Learn more about EQHO AI-Powered Text-to-Speech here.


Are your translation solutions scalable?

As experts in Asian localization and translation services, with more 20 years of experience, EQHO is the ideal choice. To find out more about how localization can benefit your business, or to get started, contact us today.

This website uses cookies to improve your experience. By continuing to use our site you agree to our Privacy Policy. ACCEPT