I have another long-standing interest in text-to-speech rendering from my time at CNIB, where the two main outputs we were generating were xml for braille full-text production and synthetically voiced DAISY 2.02 back matter components.
The reason we were TTS’ing back matter was that spending time reading indexes and bibliographies is an enormous waste of human resources — it’s a lag on getting books out to readers and would result in a precipitous drop in total output.
Very few people ever read the back matter, too, at least in general circulating libraries like we had. TTS meant that we didn’t have deny readers information that otherwise would have been omitted.
But to the point of this post, when I first saw the enhancements in EPUB 3 to improve text-to-speech playback, and a means of distributing high-quality text for rendering on the client side, I had stars in my eyes. Here was a way to bring high-quality voicing without huge audio downloads. But two plus years on, how close are we to realizing the potential?
TTS the pros
I suppose it bears some further explanation why I care about TTS in EPUB, since it wasn’t like we were distributing TTS-ready DAISY files — we were recording the TTS output and including it as audio in the final books.
A key reason I see a TTS future for ebooks is the potential for impressive reductions in audio size while still giving the reader a quality listening experience. Pre-recorded narration — whether human or synthetic — contributes to the existing distribution models for accessible texts that lean heavily on circulating CDs or overnight downloads.
If you could send an enhanced text EPUB that the user could have their reading system voice, you’re moving from hundreds of megabytes of audio, or more, down to around a single megabyte or so.
(While progressive downloading is helpful where the user will listen to a work straight from beginning to end, it’s not so effective for works where the user will be jumping around a lot.)
Narration also takes a long time to do, which makes it a barrier for time-dependent texts, like university course packs. Analyzing the text for problematic keywords, on the other hand, can be optimized by script, reducing the work to adding new words to a master lexicon. That’s a task that can easily be achieved in hours instead of days.
Enhancing the text on the publisher side can also make listening to synthetic voice readings more palatable to those who might otherwise not consider using them. It’s painful to listen to TTS for anything more than short information gathering purposes when the rendering is filled with garbled and incomprehensible words.
TTS the cons
Like all things in life, there are downsides to TTS, too, so it’s only fair to give them some airing.
The quality of synthetic voice rendering on tablets and desktops is generally quite poor, for one. I’ve heard newspapers read where you’d swear the voice was a real human, putting anything you hear in reading systems to shame.
Granted, these high-quality voices cost a lot of money, as they’re designed for large telephony systems. In other words, don’t expect them anytime soon in a reading system near you, and that quality discrepancy is enough to make TTS playback a turn off to most readers not accustomed to assistive technologies.
The other major downside of TTS is that it is largely expressionless. If you’re listening to playback for basic information extraction purposes (e.g., education, news), monotone readings aren’t going to be much of a consideration. If you want to kick back, relax and be immersed in storytelling, TTS is not the choice for you.
To temper that with a pro, however, not all books are going to be narrated, so given the choice between waiting forever for a book and having a decent TTS option, there’s reason to make an effort to enhance the text for readers.
(Ponder this section a little more and you can understand why TTS playback is unlikely to ever meaningfully impact on audiobook sales.)
The “Big Three”
Okay, so the pros and cons aren’t really what I set out to talk about (but when has that stopped me?), so back to where are we in terms of support…
SSML support is the one technology I expected would find adoption the quickest, if only because it’s the oldest, but in hindsight that was probably overly-optimistic. For one, the flavour of SSML used in EPUB 3 is unique to EPUB 3. It takes the
phoneme element and breaks out the
vocab attributes, which is significantly different from processing an SSML file.
The uniqueness of the approach means that support requires special handling of the attributes — only an EPUB 3 reading system will recognize them and be able to process them for passing on to a synthetic speech engine. Even if a regular AT has SSML support, in other words, it would need to add support for these attributes into how it generates phonemes for rendering. Just having the attributes in the DOM isn’t enough for any vanilla AT to be able to do anything with them.
When you consider that problem, it’s not terribly surprising that there remains a distinct lack of support. We’re still seeing EPUB 3 reading systems that don’t handle list styles, after all.
Somewhat worrisome for future support is the trend in EPUB toward namespace agnosticism to allow both serializations of HTML5. SSML-namespaced attributes aren’t forbidden, but they aren’t valid HTML, either.
Whether a model of support builds with the Speech API remains to be seen, but the incubator report came out a couple of years ago and I don’t see SSML any closer to having native support, or at least not this kind of embedded SSML.
I’m not so surprised that PLS lexicons haven’t found support yet. The technology is relatively new, and support only arrived a couple of years back in the speech engine we were using at CNIB — the editor of the PLS specification also worked for the company whose voices we were using.
Even the “pronunciation”
rel type that EPUB uses is mired in a seeming perennial proposed status. I looked at the proposal page again today and it says that it’s used in EPUB 3 which is still a draft but should be a recommendation in 2011. Hm. (At least we have “sweetheart” as a
rel value, but I digress. ;)
PLS support is fortunately not burdened by the same custom implementation issue as SSML, so if any browser or reading system begins supporting PLS in HTML, nothing special needs to happen for EPUB.
But, again, I’m not aware of any movement toward support of PLS lexicons.
Despite being the one true HTML-capable technology of the three, I didn’t expect quick support for CSS3 Speech across reading systems, and things really haven’t changed much since EPUB 3 was finalized.
You can find support for the properties in Opera, and supposedly WebKit has support for the
speak property, but I couldn’t find any reading systems supporting it. Firevox and its hacked-up support has long since stopped working. So we’re effectively where we were when EPUB 3 launched.
I’m not too upset if CSS3 Speech support lags behind, though, as its most useful properties can be done in SSML and/or PLS (speaking digits and spelling out acronyms). The more sophisticated speech control properties are likely beyond what most reading systems would support (multiple voices, age of voice, etc.).
So, given the state of support, it still doesn’t make a lot of sense to go out of your way to produce enhanced TTS.
That doesn’t mean if you have the information in your production stream you shouldn’t be generating it in your EPUB 3 output. The argument about not having to re-release content when support catches up applies here.
You may even have enhancement potential and not be aware of it. While an operation like CNIB clearly has lexicons that can be parsed and used in other outputs, think about whether you’re providing voicings of content already.
Do you include IPA spellings (either inline or as a pop-out) with a voice clip for new or difficult words? If so, how hard would it be to scan the content files to extract these and build a PLS lexicon file that could apply every time the word is encountered?
Attach your auto-generated — or hand compiled — lexicon to each HTML document and the user doesn’t have to live in a world where they can hear the correct pronunciation one time from an audio clip and the rest of the time listen to a TTS-engine garbled equivalent.
But to wrap up, I have to admit that it’s disheartening when I was searching on the above three technologies that best practices book excerpts and/or the IDPF accessibility guidelines I wrote are the top results.
If I’ve learned anything since moving into accessible production, though, it’s that nothing ever changes without much perseverance. It took fourteen years from the first DAISY standard to integration with a mainstream publishing format, after all. That the accessibility features aren’t as developed as the mainstream features is no surprise in that light.
I won’t give up the dream of enhanced TTS just yet, in other words.
(Note: In case it’s not clear, text-to-speech synthesis is alive and well in reading systems (VoiceOver, etc.) and doesn’t necessitate SSML, PLS lexicons or CSS3 Speech to work. Without these technologies, however, the quality of rendering is limited, which is why they matter so much.)