Lessons Learned Using the javascript speechSynthesis API

SpeechSynthesis is a Dou$#@bag

Here at talkr, we've been developing iOS apps exclusively for a while, but finally Android and web users can get silly making things talk with Smooth Talkr. This post talks about some of the hurdles we faced developing with the speechSynthesis HTML5 interface.

First off, allow me to explain the title. It's only been 8 years since my favorite tech article was written Application Cache is a Dou$#@bag, but it's the 2020's now, and I can't get away with such a raunchy title anymore. Jake Archibald's experience with the Application Cache has so closely mirrored my own experience developing with the speechSynthesis javascript API that I had to pay homage to his post.

At first I was delighted by the simplicity of the HTML 5 speechSynthesis interface. A few dozen lines in a codepen demonstrate all of the voices and capabilities of the system. But relying on this technology in a production system has been a challenge. Getting reasonably consistent results from a diverse range of browsers and platforms was the hard part. Some of the problems are easy to fix once you know about them, and some are not.

Wide Range of Quantity and Quality of Voices

An obvious, but important challenge is that every browser and platform has access to different voices of varying quality. Some browsers like Chrome and Microsoft Edge provide access to cloud-powered voices that augment those provided by the OS. Other browsers rely on the operating system to supply voices. iOS and OSX give everyone a wide selection, Android provides a few, and Windows don't provide much out of the box.

iOS Voices

The following voices are available on iOS and OSX. Note that on iOS, this is a subset of the voices available to the operating system and speechSynthesis.getVoices() will report having many more (see below for an explanation). OSX has its version of these, along with many others. On OSX, any voices downloaded by the user are also available to the speechSynthesis API.

Name	Gender	Language	Locale
Susan	Female	English	en-US
Daniel	Male	English	en-GB
Moira	Female	English	en-IE
Tessa	Female	English	en-ZA
Karen	Female	English	en-IR
Rishi	Male	English	en-IN
Sin-Ji	Male	Chinese	zh-HK
Mei-Ja	Female	Chinese	zh-TW
Tian-Tian	Female	Chinese	zh-CN
Mónica	Female	Spanish	es-ES
Paulina	Female	Spanish	es-MX
Thomas	Female	French	fr-FR
Amélie	Female	French	fr-CA
Milena	Female	Russian	ru-RU
Zuzana	Female	Czech	cz-CZ
Maged	Male	Arabic	ar-SA
Sara	Female	Danish	da-DK
Anna	Female	German	de-DE
Satu	Female	Finish	fi-FI
Carmit	Female	Hebrew	he-IL
Lekha	Female	Hindi	hi-IN
Mariska	Female	Hungarian	hu-HU
Damayanti	Female	Indonesian	id-ID
Alice	Female	Italian	it-IT
Kyoko	Female	Japanese	ja-JP
Yuna	Female	Korean	ko-KR
Ellen	Female	Dutch	nl-BE
Xander	Male	Dutch	nl-NL
Nora	Female	Norwegian	nb-NO*
Zosia	Female	Polish	pl-PL
Luciana	Female	Portuguese	pt-BR
Joana	Female	Portuguese	pt-PT
Ioana	Female	Romanian	ro-RO
Laura	Female	Slovak	sk-SK
Alva	Female	Swedish	sv-SE
Kanya	Female	Thai	th-TH
Yelda	Female	Turkish	tr-TR

_{* Note: in iOS 13, the Norwegian voice Nora reports its locale as "no-NO". macOS 10.15 lists it correctly as "nb-NO"}

Chrome (Desktop) Voices

Name	Gender	Language	Locale
Google US English	Male	English	en-US
Google UK English Female	Female	English	en-GB
Google UK English Male	Female	English	en-GB
Google Deutsch	Female	German	de-DE
Google español	Male	Spanish	es-ES
Google español de Estados Unidos	Female	Spanish	es-US
Google français	Female	French	fr-FR
Google italiano	Female	Italian	it-IT
Google русский	Female	Russian	ru-RU
Google Nederlands	Female	Dutch	nl-NL
Google polski	Female	Polish	pl-PL
Google português do Brasil	Female	Portuguese	pt-BR
Google 日本語	Female	Japanese	ja-JP
Google 한국의	Female	Korean	ko-KR
Google 普通话（中国大陆）	Female	Chinese	zh-CN
Google 粤語（香港）	Female	Cantonese	zh-HK
Google 國語（臺灣）	Female	Chinese	zh-TW
Google Bahasa Indonesia	Female	Indonesian	id-ID
Google हिन्दी	Female	Hindi	hi-HI

Autoplay Disabled

For years, iOS disabled the ability for websites to autoplay videos, requiring user interaction for these to play. This poilicy was extended to speechSynthesis.speak calls. Chrome followed suit and now prints a deprecation warning:

speechSynthesis.speak() without user activation is no longer allowed since M71, around December 2018. See https://www.chromestatus.com/feature/5687444770914304 for more details

The end result is that you can only play text to speech when the user clicked on something to initiate the action. Clicking on a link works if the link is relative (from another page on your site). SpeechRecognition events count as a user-initiated event, which is nice.

Smooth Talkr allows you to create and play "scenes" with text-to-speech. Sometimes these are triggered by clicking on buttons or links on the site, but they can also be triggered by going to the scene's URL. In the latter case, the scenes will not play.

Solution: Detect when tts is stalled, and display a "play" button.

  speechSynthesis.speak(new SpeechSynthesisUtterance("hello there"))
  setTimeout(() => {
    if (!speechSynthesis.speaking) {
      // We are probably stalled, so display a play button here.
    }
    // Note: Most browsers will work with a 0 millisecond timeout, but safari
    // on iOS will trigger a false positive, so I use 100 milliseconds.
    // This delay can cause its own problems (what if the utterance is extremely
    // short or empty?) and to solve these, I create a random ID for each speak
    // call, and clear it when the utternace completes.  I then test that the
    // randomID is the same as when the speak call was issued before assuming
    // we are stalled.
  }, 100)

Event Handlers are Garbage Collected

SpeechSynthesisUtterance event handlers like onend, onpause, onerror, etc. are extremely useful. Unfortunately they may not work because the browser can garbage collect the Utterance before it is finished playing. That's right, you tell the browser that you are interested in what happens after an utterance finishes playing, and it decides to free up memory associated with the utterance before playing is complete.

Thankfully, the fix isn't too bad. Just keep your SpeechSynthesisUtterance utterance around. In the example below, I'm using the this keyword to store the utterance, assuming that in your implementation it will be around until the utterance completes.

  this.utterance = new SpeechSynthesisUtterance("hello there")
  this.utterance.onend = () => {
    console.log('finished!')
  }

Cloud Voices Unavailable at Startup

speechSynthesis.getVoices() will not return cloud voices until it has had time to retrieve them. This is obvious when you think about it, but it's worth noting here. Use the speechSynthesis voiceschanged callback to get notified when new voices are available.

iOS Lies About Available Voices

It's hard to be too upset at iOS for its speech technology. Ever since Steve Jobs' speech demo, TTS has played an important role in Apple's products. When you download a new version of iOS, a large percentage of what you are downloading is text to speech voices. There are 55 voices installed by default resulting in hundreds of MB of data.

With that said, the iOS implementation of the HTML 5 speechSynthesis standard leaves a lot to be desired. First and foremost, it downright lies about the voices you can use. Calling speechSynthesis.getVoices() will return all 55 voices (more if you have downloaded additional ones), but you can only select 36 of them. You get one voice per locale. This issue probably goes back to iOS 8 when iOS used the locale to name the voice, but I can't be sure. The end result is that iOS may have more than one voice in a particular language (Alex and Samantha are both 'en-US' for example), but trying to use Alex will give you Samantha.

The solution is a bit dirty. Detect iOS, then make sure you only return voices that are in the predefined list.

  var hasWindow = typeof window === 'object' && window !== null && window.self === window && window.navigator !== null
  var bIsiOS = hasWindow && /iPad|iPhone|iPod/.test(window.navigator.userAgent) && !window.MSStream
  var iOSVoiceNames = [
    'Maged',
    'Zuzana',
    'Sara',
    'Anna',
    'Melina',
    'Karen',
    'Samantha',
    'Daniel',
    'Rishi',
    'Moira',
    'Tessa',
    'Mónica',
    'Paulina',
    'Satu',
    'Amélie',
    'Thomas',
    'Carmit',
    'Lekha',
    'Mariska',
    'Damayanti',
    'Alice',
    'Kyoko',
    'Yuna',
    'Ellen',
    'Xander',
    'Nora',
    'Zosia',
    'Luciana',
    'Joana',
    'Ioana',
    'Milena',
    'Laura',
    'Alva',
    'Kanya',
    'Yelda',
    'Tian-Tian',
    'Sin-Ji',
    'Mei-Jia'
  ]
  var voices = speechSynthesis.getVoices()
  if (bIsiOS) {
    let iOSVoices = []
    for (var i = 0; i < voices.length; ++i) {
      if (iOSVoiceNames.includes(voices[i].name))
        iOSVoices.push(voices[i])
      }
    }
    voices = iOSVoices
  }

iOS Safari is Silent on Soft Mute

Even if your volume is all the way up, you won't hear tts on iOS safari if the "soft mute" switch is on. Chrome on iOS plays the sounds even when soft mute is on.

Android uses Device Default Voice

This is probably the most serious issue on any platform. There isn't a good workaround.

Just like iOS, Android isn't exactly honest about the voices that you can use at any one time. SpeechSynthesis.getVoices() will return several options for English (United States, Australia, Nigeria, India, and United Kingdom) but only one is available at a time. You can pick which one by going to the Settings app, then Controls->Language and input->Text-to-speech options. Select the gear icon next to Google Text-to-speech Engine, then under Language you can update the exact locale you want to use. If you select "Install voice data" you can even select from a sample of different voices for some locales. You need to restart the device after changing this setting for it to take effect.

The voice used on an Android device when you play a SpeechSynthesisUtterance will depend on what you have selected in the Android settings. You can choose which language you want to play from javascript (see below for details) but you have no control over the locale or exact voice used.

This problem occurs on Chrome and Firefox, so it is likely a problem with the Android platform's implementation of the speechSynthesis API. It's unlikely that a browser update will fix this, but different versions of Android might. (My test device is on Android 5.0.2, so if this is fixed in a future update, please let me know).

There isn't a great workaround for this problem, as the developer has no visibility into the default language and voice selected in Android's settings. If you do everything right, you can force an Android device to speak in any supported langauge, but the exact voice and locale that is used will depend on the Android settings.

Chrome on Android Requires SpeechSynthesisUtterance.lang

Most browsers do not require setting the lang property of the speech synthesis utterance. An utterance also has a voice property, and the voice defines the language. But the specification does say that SpeechSynthesisUtterance has a lang property that sets the language and defaults to the html tag's lang value if unset. Having two settings to control the language doesn't make much sense to me, and it leads to unspecified behavior like we see in Android's Chrome browser.

Firefox on Android handles this better, but to get different languages to work with Chrome on Android, you will have to set the lang property of the utterance:

  var utterance = new SpeechSynthesisUtterance("hello there")

  // set voice here.

  // Always set the utterance language to the utterance voice's language
  // to prevent unspecified behavior.
  utterance.lang = utterance.voice.lang

Inconsistent Voice Locale Syntax on Android

The speechSynthesis specification is pretty clear that a SpeechSynthesisVoice object's lang property should be a BCP 47 language code (e.g. "en-US", "ru-RU"). Android returns voices with lang properties like "en_US" and "ru_RU" however. Solution: replace '_' characters in the language property with '-' before using them.

  function getVoicesWithLangSubstring (langSubstr) {
    return speechSynthesis.getVoices().filter(function (v) {
      return v.lang.replace('_', '-').substring(0, langSubstr.length) === langSubstr
    })
  }

Note: the function above is useful for returning all voices in a particular language. You can get all English voices by calling getVoicesWithLangSubstring("en"). If you only want American English voices, call getVoicesWithLangSubstring("en-US")