Make my pictures talk!

SpeechSynthesis is a Dou$#@bag

Here at talkr, we've been developing iOS apps exclusively for a while, but finally Android and web users can get silly making things talk with Smooth Talkr. This post talks about some of the hurdles we faced developing with the speechSynthesis HTML5 interface.

First off, allow me to explain the title. It's only been 8 years since my favorite tech article was written Application Cache is a Dou$#@bag, but it's the 2020's now, and I can't get away with such a raunchy title anymore. Jake Archibald's experience with the Application Cache has so closely mirrored my own experience developing with the speechSynthesis javascript API that I had to pay homage to his post.

At first I was delighted by the simplicity of the HTML 5 speechSynthesis interface. A few dozen lines in a codepen demonstrate all of the voices and capabilities of the system. But relying on this technology in a production system has been a challenge. Getting reasonably consistent results from a diverse range of browsers and platforms was the hard part. Some of the problems are easy to fix once you know about them, and some are not.

Wide Range of Quantity and Quality of Voices

An obvious, but important challenge is that every browser and platform has access to different voices of varying quality. Some browsers like Chrome and Microsoft Edge provide access to cloud-powered voices that augment those provided by the OS. Other browsers rely on the operating system to supply voices. iOS and OSX give everyone a wide selection, Android provides a few, and Windows don't provide much out of the box.

iOS Voices

The following voices are available on iOS and OSX. Note that on iOS, this is a subset of the voices available to the operating system and speechSynthesis.getVoices() will report having many more (see below for an explanation). OSX has its version of these, along with many others. On OSX, any voices downloaded by the user are also available to the speechSynthesis API.
Name Gender Language Locale
Susan Female English en-US
Daniel Male English en-GB
Moira Female English en-IE
Tessa Female English en-ZA
Karen Female English en-IR
Rishi Male English en-IN
Sin-Ji Male Chinese zh-HK
Mei-Ja Female Chinese zh-TW
Tian-Tian Female Chinese zh-CN
Mónica Female Spanish es-ES
Paulina Female Spanish es-MX
Thomas Female French fr-FR
Amélie Female French fr-CA
Milena Female Russian ru-RU
Zuzana Female Czech cz-CZ
Maged Male Arabic ar-SA
Sara Female Danish da-DK
Anna Female German de-DE
Satu Female Finish fi-FI
Carmit Female Hebrew he-IL
Lekha Female Hindi hi-IN
Mariska Female Hungarian hu-HU
Damayanti Female Indonesian id-ID
Alice Female Italian it-IT
Kyoko Female Japanese ja-JP
Yuna Female Korean ko-KR
Ellen Female Dutch nl-BE
Xander Male Dutch nl-NL
Nora Female Norwegian nb-NO*
Zosia Female Polish pl-PL
Luciana Female Portuguese pt-BR
Joana Female Portuguese pt-PT
Ioana Female Romanian ro-RO
Laura Female Slovak sk-SK
Alva Female Swedish sv-SE
Kanya Female Thai th-TH
Yelda Female Turkish tr-TR
* Note: in iOS 13, the Norwegian voice Nora reports its locale as "no-NO". macOS 10.15 lists it correctly as "nb-NO"

Chrome (Desktop) Voices

Name Gender Language Locale
Google US English Male English en-US
Google UK English Female Female English en-GB
Google UK English Male Female English en-GB
Google Deutsch Female German de-DE
Google español Male Spanish es-ES
Google español de Estados Unidos Female Spanish es-US
Google français Female French fr-FR
Google italiano Female Italian it-IT
Google русский Female Russian ru-RU
Google Nederlands Female Dutch nl-NL
Google polski Female Polish pl-PL
Google português do Brasil Female Portuguese pt-BR
Google 日本語 Female Japanese ja-JP
Google 한국의 Female Korean ko-KR
Google 普通话(中国大陆) Female Chinese zh-CN
Google 粤語(香港) Female Cantonese zh-HK
Google 國語(臺灣) Female Chinese zh-TW
Google Bahasa Indonesia Female Indonesian id-ID
Google हिन्दी Female Hindi hi-HI

Autoplay Disabled

For years, iOS disabled the ability for websites to autoplay videos, requiring user interaction for these to play. This poilicy was extended to speechSynthesis.speak calls. Chrome followed suit and now prints a deprecation warning:

speechSynthesis.speak() without user activation is no longer allowed since M71, around December 2018. See https://www.chromestatus.com/feature/5687444770914304 for more details

The end result is that you can only play text to speech when the user clicked on something to initiate the action. Clicking on a link works if the link is relative (from another page on your site). SpeechRecognition events count as a user-initiated event, which is nice.

Smooth Talkr allows you to create and play "scenes" with text-to-speech. Sometimes these are triggered by clicking on buttons or links on the site, but they can also be triggered by going to the scene's URL. In the latter case, the scenes will not play.

Solution: Detect when tts is stalled, and display a "play" button.

  speechSynthesis.speak(new SpeechSynthesisUtterance("hello there"))
  setTimeout(() => {
    if (!speechSynthesis.speaking) {
      // We are probably stalled, so display a play button here.
    }
    // Note: Most browsers will work with a 0 millisecond timeout, but safari
    // on iOS will trigger a false positive, so I use 100 milliseconds.
    // This delay can cause its own problems (what if the utterance is extremely
    // short or empty?) and to solve these, I create a random ID for each speak
    // call, and clear it when the utternace completes.  I then test that the
    // randomID is the same as when the speak call was issued before assuming
    // we are stalled.
  }, 100)

Event Handlers are Garbage Collected

SpeechSynthesisUtterance event handlers like onend, onpause, onerror, etc. are extremely useful. Unfortunately they may not work because the browser can garbage collect the Utterance before it is finished playing. That's right, you tell the browser that you are interested in what happens after an utterance finishes playing, and it decides to free up memory associated with the utterance before playing is complete.

Thankfully, the fix isn't too bad. Just keep your SpeechSynthesisUtterance utterance around. In the example below, I'm using the this keyword to store the utterance, assuming that in your implementation it will be around until the utterance completes.

  this.utterance = new SpeechSynthesisUtterance("hello there")
  this.utterance.onend = () => {
    console.log('finished!')
  }


Cloud Voices Unavailable at Startup

speechSynthesis.getVoices() will not return cloud voices until it has had time to retrieve them. This is obvious when you think about it, but it's worth noting here. Use the speechSynthesis voiceschanged callback to get notified when new voices are available.

iOS Lies About Available Voices

It's hard to be too upset at iOS for its speech technology. Ever since Steve Jobs' speech demo, TTS has played an important role in Apple's products. When you download a new version of iOS, a large percentage of what you are downloading is text to speech voices. There are 55 voices installed by default resulting in hundreds of MB of data.

With that said, the iOS implementation of the HTML 5 speechSynthesis standard leaves a lot to be desired. First and foremost, it downright lies about the voices you can use. Calling speechSynthesis.getVoices() will return all 55 voices (more if you have downloaded additional ones), but you can only select 36 of them. You get one voice per locale. This issue probably goes back to iOS 8 when iOS used the locale to name the voice, but I can't be sure. The end result is that iOS may have more than one voice in a particular language (Alex and Samantha are both 'en-US' for example), but trying to use Alex will give you Samantha.

The solution is a bit dirty. Detect iOS, then make sure you only return voices that are in the predefined list.

  var hasWindow = typeof window === 'object' && window !== null && window.self === window && window.navigator !== null
  var bIsiOS = hasWindow && /iPad|iPhone|iPod/.test(window.navigator.userAgent) && !window.MSStream
  var iOSVoiceNames = [
    'Maged',
    'Zuzana',
    'Sara',
    'Anna',
    'Melina',
    'Karen',
    'Samantha',
    'Daniel',
    'Rishi',
    'Moira',
    'Tessa',
    'Mónica',
    'Paulina',
    'Satu',
    'Amélie',
    'Thomas',
    'Carmit',
    'Lekha',
    'Mariska',
    'Damayanti',
    'Alice',
    'Kyoko',
    'Yuna',
    'Ellen',
    'Xander',
    'Nora',
    'Zosia',
    'Luciana',
    'Joana',
    'Ioana',
    'Milena',
    'Laura',
    'Alva',
    'Kanya',
    'Yelda',
    'Tian-Tian',
    'Sin-Ji',
    'Mei-Jia'
  ]
  var voices = speechSynthesis.getVoices()
  if (bIsiOS) {
    let iOSVoices = []
    for (var i = 0; i < voices.length; ++i) {
      if (iOSVoiceNames.includes(voices[i].name))
        iOSVoices.push(voices[i])
      }
    }
    voices = iOSVoices
  }


iOS Safari is Silent on Soft Mute

Even if your volume is all the way up, you won't hear tts on iOS safari if the "soft mute" switch is on. Chrome on iOS plays the sounds even when soft mute is on.

Android uses Device Default Voice

This is probably the most serious issue on any platform. There isn't a good workaround.

Just like iOS, Android isn't exactly honest about the voices that you can use at any one time. SpeechSynthesis.getVoices() will return several options for English (United States, Australia, Nigeria, India, and United Kingdom) but only one is available at a time. You can pick which one by going to the Settings app, then Controls->Language and input->Text-to-speech options. Select the gear icon next to Google Text-to-speech Engine, then under Language you can update the exact locale you want to use. If you select "Install voice data" you can even select from a sample of different voices for some locales. You need to restart the device after changing this setting for it to take effect.

The voice used on an Android device when you play a SpeechSynthesisUtterance will depend on what you have selected in the Android settings. You can choose which language you want to play from javascript (see below for details) but you have no control over the locale or exact voice used.

This problem occurs on Chrome and Firefox, so it is likely a problem with the Android platform's implementation of the speechSynthesis API. It's unlikely that a browser update will fix this, but different versions of Android might. (My test device is on Android 5.0.2, so if this is fixed in a future update, please let me know).

There isn't a great workaround for this problem, as the developer has no visibility into the default language and voice selected in Android's settings. If you do everything right, you can force an Android device to speak in any supported langauge, but the exact voice and locale that is used will depend on the Android settings.

Chrome on Android Requires SpeechSynthesisUtterance.lang

Most browsers do not require setting the lang property of the speech synthesis utterance. An utterance also has a voice property, and the voice defines the language. But the specification does say that SpeechSynthesisUtterance has a lang property that sets the language and defaults to the html tag's lang value if unset. Having two settings to control the language doesn't make much sense to me, and it leads to unspecified behavior like we see in Android's Chrome browser.

Firefox on Android handles this better, but to get different languages to work with Chrome on Android, you will have to set the lang property of the utterance:

  var utterance = new SpeechSynthesisUtterance("hello there")

  // set voice here.

  // Always set the utterance language to the utterance voice's language
  // to prevent unspecified behavior.
  utterance.lang = utterance.voice.lang

Inconsistent Voice Locale Syntax on Android

The speechSynthesis specification is pretty clear that a SpeechSynthesisVoice object's lang property should be a BCP 47 language code (e.g. "en-US", "ru-RU"). Android returns voices with lang properties like "en_US" and "ru_RU" however. Solution: replace '_' characters in the language property with '-' before using them.

  function getVoicesWithLangSubstring (langSubstr) {
    return speechSynthesis.getVoices().filter(function (v) {
      return v.lang.replace('_', '-').substring(0, langSubstr.length) === langSubstr
    })
  }
Note: the function above is useful for returning all voices in a particular language. You can get all English voices by calling getVoicesWithLangSubstring("en"). If you only want American English voices, call getVoicesWithLangSubstring("en-US")