SpeechSynthesis is a Dou$#@bag
Here at talkr, we've been developing iOS apps exclusively for a while, but finally Android and web users can get silly making things talk with
Smooth Talkr. This post talks about some of the hurdles we faced developing with the
speechSynthesis HTML5 interface.
First off, allow me to explain the title. It's only been 8 years since my favorite tech article was written
Application Cache is a Dou$#@bag, but it's the 2020's now, and I can't get away with such a raunchy title anymore. Jake Archibald's experience with the Application Cache has so closely mirrored my own experience developing with the speechSynthesis javascript API that I had to pay homage to his post.
At first I was delighted by the simplicity of the HTML 5 speechSynthesis interface. A few dozen lines in a
codepen demonstrate all of the voices and capabilities of the system. But relying on this technology in a production system has been a challenge. Getting reasonably consistent results from a diverse range of browsers and platforms was the hard part. Some of the problems are easy to fix once you know about them, and some are not.
Wide Range of Quantity and Quality of Voices
An obvious, but important challenge is that every browser and platform has access to different voices of varying quality. Some browsers like Chrome and
Microsoft Edge provide access to cloud-powered voices that augment those provided by the OS. Other browsers rely on the operating system to supply voices. iOS and OSX give everyone a wide selection, Android provides a few, and Windows don't provide much out of the box.
iOS Voices
The following voices are available on iOS and OSX. Note that on iOS, this is a subset of the voices available to the operating system and speechSynthesis.getVoices() will report having many more (see below for an explanation). OSX has its version of these, along with many others. On OSX, any voices downloaded by the user are also available to the speechSynthesis API.
Name |
Gender |
Language |
Locale |
Susan |
Female |
English |
en-US |
Daniel |
Male |
English |
en-GB |
Moira |
Female |
English |
en-IE |
Tessa |
Female |
English |
en-ZA |
Karen |
Female |
English |
en-IR |
Rishi |
Male |
English |
en-IN |
Sin-Ji |
Male |
Chinese |
zh-HK |
Mei-Ja |
Female |
Chinese |
zh-TW |
Tian-Tian |
Female |
Chinese |
zh-CN |
Mónica |
Female |
Spanish |
es-ES |
Paulina |
Female |
Spanish |
es-MX |
Thomas |
Female |
French |
fr-FR |
Amélie |
Female |
French |
fr-CA |
Milena |
Female |
Russian |
ru-RU |
Zuzana |
Female |
Czech |
cz-CZ |
Maged |
Male |
Arabic |
ar-SA |
Sara |
Female |
Danish |
da-DK |
Anna |
Female |
German |
de-DE |
Satu |
Female |
Finish |
fi-FI |
Carmit |
Female |
Hebrew |
he-IL |
Lekha |
Female |
Hindi |
hi-IN |
Mariska |
Female |
Hungarian |
hu-HU |
Damayanti |
Female |
Indonesian |
id-ID |
Alice |
Female |
Italian |
it-IT |
Kyoko |
Female |
Japanese |
ja-JP |
Yuna |
Female |
Korean |
ko-KR |
Ellen |
Female |
Dutch |
nl-BE |
Xander |
Male |
Dutch |
nl-NL |
Nora |
Female |
Norwegian |
nb-NO* |
Zosia |
Female |
Polish |
pl-PL |
Luciana |
Female |
Portuguese |
pt-BR |
Joana |
Female |
Portuguese |
pt-PT |
Ioana |
Female |
Romanian |
ro-RO |
Laura |
Female |
Slovak |
sk-SK |
Alva |
Female |
Swedish |
sv-SE |
Kanya |
Female |
Thai |
th-TH |
Yelda |
Female |
Turkish |
tr-TR |
* Note: in iOS 13, the Norwegian voice Nora reports its locale as "no-NO". macOS 10.15 lists it correctly as "nb-NO"
Chrome (Desktop) Voices
Name |
Gender |
Language |
Locale |
Google US English |
Male |
English |
en-US |
Google UK English Female |
Female |
English |
en-GB |
Google UK English Male |
Female |
English |
en-GB |
Google Deutsch |
Female |
German |
de-DE |
Google español |
Male |
Spanish |
es-ES |
Google español de Estados Unidos |
Female |
Spanish |
es-US |
Google français |
Female |
French |
fr-FR |
Google italiano |
Female |
Italian |
it-IT |
Google русский |
Female |
Russian |
ru-RU |
Google Nederlands |
Female |
Dutch |
nl-NL |
Google polski |
Female |
Polish |
pl-PL |
Google português do Brasil |
Female |
Portuguese |
pt-BR |
Google 日本語 |
Female |
Japanese |
ja-JP |
Google 한국의 |
Female |
Korean |
ko-KR |
Google 普通话(中国大陆) |
Female |
Chinese |
zh-CN |
Google 粤語(香港) |
Female |
Cantonese |
zh-HK |
Google 國語(臺灣) |
Female |
Chinese |
zh-TW |
Google Bahasa Indonesia |
Female |
Indonesian |
id-ID |
Google हिन्दी |
Female |
Hindi |
hi-HI |
Autoplay Disabled
For years, iOS disabled the ability for websites to autoplay videos, requiring
user interaction for these to play. This poilicy was extended to speechSynthesis.speak calls. Chrome followed suit and now prints a deprecation warning:
speechSynthesis.speak() without user activation is no longer allowed since M71, around December 2018. See https://www.chromestatus.com/feature/5687444770914304 for more details
The end result is that you can only play text to speech when the user clicked on something to initiate the action. Clicking on a link works if the link is relative (from another page on your site).
SpeechRecognition events count as a user-initiated event, which is nice.
Smooth Talkr allows you to create and play "scenes" with text-to-speech. Sometimes these are triggered by clicking on buttons or links on the site, but they can also be triggered by going to the scene's URL. In the latter case, the scenes will not play.
Solution: Detect when tts is stalled, and display a "play" button.
speechSynthesis.speak(new SpeechSynthesisUtterance("hello there"))
setTimeout(() => {
if (!speechSynthesis.speaking) {
// We are probably stalled, so display a play button here.
}
// Note: Most browsers will work with a 0 millisecond timeout, but safari
// on iOS will trigger a false positive, so I use 100 milliseconds.
// This delay can cause its own problems (what if the utterance is extremely
// short or empty?) and to solve these, I create a random ID for each speak
// call, and clear it when the utternace completes. I then test that the
// randomID is the same as when the speak call was issued before assuming
// we are stalled.
}, 100)
Event Handlers are Garbage Collected
SpeechSynthesisUtterance event handlers like onend, onpause, onerror, etc. are extremely useful. Unfortunately they may not work because
the browser can garbage collect the Utterance before it is finished playing. That's right, you tell the browser that you are interested in what happens after an utterance finishes playing, and it decides to free up memory associated with the utterance before playing is complete.
Thankfully, the fix isn't too bad. Just keep your SpeechSynthesisUtterance utterance around. In the example below, I'm using the this keyword to store the utterance, assuming that in your implementation it will be around until the utterance completes.
this.utterance = new SpeechSynthesisUtterance("hello there")
this.utterance.onend = () => {
console.log('finished!')
}
Cloud Voices Unavailable at Startup
speechSynthesis.getVoices() will not return cloud voices until it has had time to retrieve them. This is obvious when you think about it, but it's worth noting here. Use the speechSynthesis voiceschanged callback to get notified when new voices are available.
iOS Lies About Available Voices
It's hard to be too upset at iOS for its speech technology. Ever since Steve Jobs'
speech demo, TTS has played an important role in Apple's products. When you download a new version of iOS, a large percentage of what you are downloading is text to speech voices. There are 55 voices installed by default resulting in hundreds of MB of data.
With that said, the iOS implementation of the HTML 5 speechSynthesis standard leaves a lot to be desired. First and foremost, it downright lies about the voices you can use. Calling speechSynthesis.getVoices() will return all 55 voices (more if you have downloaded additional ones), but you can only select 36 of them. You get one voice per locale. This issue probably goes back to iOS 8 when
iOS used the locale to name the voice, but I can't be sure. The end result is that iOS may have more than one voice in a particular language (Alex and Samantha are both 'en-US' for example), but trying to use Alex will give you Samantha.
The solution is a bit dirty. Detect iOS, then make sure you only return voices that are in the predefined list.
var hasWindow = typeof window === 'object' && window !== null && window.self === window && window.navigator !== null
var bIsiOS = hasWindow && /iPad|iPhone|iPod/.test(window.navigator.userAgent) && !window.MSStream
var iOSVoiceNames = [
'Maged',
'Zuzana',
'Sara',
'Anna',
'Melina',
'Karen',
'Samantha',
'Daniel',
'Rishi',
'Moira',
'Tessa',
'Mónica',
'Paulina',
'Satu',
'Amélie',
'Thomas',
'Carmit',
'Lekha',
'Mariska',
'Damayanti',
'Alice',
'Kyoko',
'Yuna',
'Ellen',
'Xander',
'Nora',
'Zosia',
'Luciana',
'Joana',
'Ioana',
'Milena',
'Laura',
'Alva',
'Kanya',
'Yelda',
'Tian-Tian',
'Sin-Ji',
'Mei-Jia'
]
var voices = speechSynthesis.getVoices()
if (bIsiOS) {
let iOSVoices = []
for (var i = 0; i < voices.length; ++i) {
if (iOSVoiceNames.includes(voices[i].name))
iOSVoices.push(voices[i])
}
}
voices = iOSVoices
}
iOS Safari is Silent on Soft Mute
Even if your volume is all the way up, you won't hear tts on iOS safari if the "soft mute" switch is on. Chrome on iOS plays the sounds even when soft mute is on.
Android uses Device Default Voice
This is probably the most serious issue on any platform. There isn't a good workaround.
Just like iOS, Android isn't exactly honest about the voices that you can use at any one time. SpeechSynthesis.getVoices() will return several options for English (United States, Australia, Nigeria, India, and United Kingdom) but only one is available at a time. You can pick which one by going to the Settings app, then Controls->Language and input->Text-to-speech options. Select the gear icon next to Google Text-to-speech Engine, then under Language you can update the exact locale you want to use. If you select "Install voice data" you can even select from a sample of different voices for some locales. You need to restart the device after changing this setting for it to take effect.
The voice used on an Android device when you play a SpeechSynthesisUtterance will depend on what you have selected in the Android settings. You can choose which language you want to play from javascript (see below for details) but you have no control over the locale or exact voice used.
This problem occurs on Chrome and Firefox, so it is likely a problem with the Android platform's implementation of the speechSynthesis API. It's unlikely that a browser update will fix this, but different versions of Android might. (My test device is on Android 5.0.2, so if this is fixed in a future update, please let me know).
There isn't a great workaround for this problem, as the developer has no visibility into the default language and voice selected in Android's settings. If you do everything right, you can force an Android device to speak in any supported langauge, but the exact voice and locale that is used will depend on the Android settings.
Chrome on Android Requires SpeechSynthesisUtterance.lang
Most browsers do not require setting the lang property of the speech synthesis utterance. An utterance
also has a voice property, and the voice defines the language. But the
specification does say that SpeechSynthesisUtterance has a lang property that sets the language and defaults
to the html tag's lang value if unset. Having two settings to control the language doesn't make much sense to me,
and it leads to unspecified behavior like we see in Android's Chrome browser.
Firefox on Android handles this better, but to get different languages to work with Chrome on Android, you will have to set the lang property of the utterance:
var utterance = new SpeechSynthesisUtterance("hello there")
// set voice here.
// Always set the utterance language to the utterance voice's language
// to prevent unspecified behavior.
utterance.lang = utterance.voice.lang
Inconsistent Voice Locale Syntax on Android
The speechSynthesis specification is pretty clear that a
SpeechSynthesisVoice object's lang property should be a BCP 47 language code (e.g. "en-US", "ru-RU"). Android returns voices with lang properties like "en_US" and "ru_RU" however.
Solution: replace '_' characters in the language property with '-' before using them.
function getVoicesWithLangSubstring (langSubstr) {
return speechSynthesis.getVoices().filter(function (v) {
return v.lang.replace('_', '-').substring(0, langSubstr.length) === langSubstr
})
}
Note: the function above is useful for returning all voices in a particular language. You can get all English voices by calling getVoicesWithLangSubstring("en"). If you only want American English voices, call getVoicesWithLangSubstring("en-US")