Some Nuances of Media Overlays

Although support for text and audio synchronization in EPUB reading systems still isn’t complete or perfect, we’ve been learning more about the optimal ways to generate the content.

I keep having conversations about overlays and then keep forgetting to write down what the issues and resolutions are, so decided to make that the topic for the day.

This post will probably be a bit of a hodgepodge of ideas, so you’ll just have to bear with me.

Reflowable and Fixed Layout

Content dynamically paginated by the reading system… author-defined fixed layout pages… who cares when we’re just talking about synchronizing prerecorded audio with the text, right?

Wrong. Your choice of content flow is actually more significant than you might first think, at least at this stage.

For one, if you choose to do a plain old reflowing publication, your media overlays aren’t going to work in iBooks. It only handles media overlays when they’re associated with fixed layout documents, so your users won’t have access to the audio. Maybe that’s not a concern for you, but it is an impediment to getting accessible books in the mainstream.

That said, there is a sneaky advantage in not dealing with reflowable documents. Not that I expect Apple had designs on sneakiness, but it makes an easy segue into the next problem: dynamic page turns.

It’s easy and predictable for a reading system to keep the current page synchronized with the text/audio playback when every document represents a page. That’s fixed layouts in a nutshell. When all the synchronization points have been played, it means the current page is finished and its time to move on to the next.

By not creating/supporting reflowable text, you never have to think about the problem of what happens when the page boundaries are fluid. If your content is synchronized to the paragraph, for example, what happens when first half the paragraph is at the bottom of one page and the latter half at the top of the next, particularly when that next page is not visible?

Naturally enough, anyone following the text with the audio will be stuck listening as the narration continues on to the unseen content, as the page turn won’t occur until the next paragraph is reached.

The problem doesn’t go away if you synchronize to the sentence, either, it’s just mitigated somewhat by the shorter length of a sentence. Only word-level synchronization avoids this problem entirely (which is why you don’t find it with text-to-speech playback).

You can’t blame the reading systems for this potential oddity of media overlays, either. While it might look like a reading system bug when playing, pouring resources into heuristic tests to try and guess the right point to flip the page is unrealistic (e.g., average guess of seconds per word times the number of words showing and flip the page when the resulting time has elapsed).

It really shouldn’t be incumbent on reading systems to make up for a content design decision, but it will affect your decision making if you want to produce full text and audio synchronized books. The more granular you can get the better, but as I outlined in the book it can come at a cost.

To finish the earlier thought, there’s simply no win if you want full text and full audio synchronization and playback for a reflowable book in iBooks, at least at this time. If you’re only looking at structured audio (see the next section), you might want to consider putting the headings into fixed layout documents. But a is always a hack…

Structured Audio

The first thing to note is that EPUB 3 does not have an equivalent of DAISY NCX-only audio files. Audio has to be synchronized to text in EPUB, and media overlays are attached to content documents, so there must be at least one content document in the spine (no referencing SMIL files in the spine, like DAISY 3 did).

What that means is that you have to include the publication headings, and synchronize to them to create the most minimal of EPUB audio books.

The first thought you might have at this point is to use the EPUB navigation document as your sole spine entry, and synchronize the audio to the entries.

A conceptual problem with this approach, however, is that the navigation document is a list of links with “table of contents” as its heading. Since the user will have to enter the first document before activating overlays playback, discovering the table of contents and realizing that playback is synchronized to its entries not exactly intuitive.

If you drop in and out of playback to move around, it’s also going to be confusing to find yourself in a list instead of being able to move by heading shortcuts, but I’m not sure if that’s a real concern or not. Navigating lists and headings isn’t the same, is more the key.

It’s consequently recommended that you include proper HTML headings in the content documents and synchronize to them.

File Size Optimization

If you’re used to DAISY production, including the entire content of a publication in a single file might not seem out of the ordinary, but you really don’t want to perpetuate this habit into EPUB 3.

Reading systems don’t generally handle loading an entire book in one go, particularly ones running on low-power devices. You can overwhelm the reading system memory, which is why EPUB content is broken up by major section and assembled as a single work using the spine.

The other problem with having all the content in one file is that you’ll have to generate one massive media overlay file, since only one media overlay can be attached to any given content document. That’s another performance hit in itself, especially when added on top of the memory necessary for rendering the content.

Follow the content chunking recommendations for EPUB when creating media overlays. Break your publication up by chapter, section or whatever major structural feature makes sense. By extension, your media overlays will be broken up into similarly smaller chunks.

That’s all I wanted to put down for now, so I’ll keep this post shorter than some of the recent ones and cut off here. If you have any feedback to offer on your own attempts at accessible EPUBs using overlays, particularly coming from DAISY production, feel free to drop them in the comments.

2 Replies to “Some Nuances of Media Overlays”

  1. Disclosure: I am one of the founders of ReadBeyond, an Italian company developing SW for producing/distributing/enjoying ebooks, currently specializing in reflowable EPUB3 Audio-eBooks with MO.

    Interesting post, touching almost all the difficulties of creating reflowable Audio-eBooks.

    I have a couple of observations.

    – reflow + pagination + MO: as you state, the “fragment split between two pages” problem cannot be avoided in general. In an ideal world, a MO-markup-aware rendering engine might distribute the text fragments in the “pages” to avoid the issue, if a sufficiently fine granularity is chosen by the book producer. (Personally, I do not think “pagination” is an essential ingredient of ebooks, so I think this is a minor problem, especially considering the current state of the art.)

    – iBooks not supporting reflow + MO: shame on Apple. An experimental JS-based workaround is here: but it might break at any moment, given the fact that Apple might alter the pagination mechanism of iBooks, or ban JS altogether.

    – Actually, no major EPUB3 RS offers good support for (reflowable) Audio-eBooks with synchronized audio/text. For example, Readium lacks the click-text-to-play function; the Kobo app is buggy; etc. The reason? As far as I can tell, there is little-to-none interest in the RS industry in supporting MO for reflowable content.

    – Why is this the case? Let’s see how many reflow + MO EPUB3 ebooks are out there. AFAIK:

    1) 1 IDPF “Moby Dick” sample (incomplete, EN):
    2) 1 AZARDI “Christmas Carol” sample (complete, EN):
    3) 4 DAISY Japan samples (complete, JA):
    4) 10 ReadBeyond samples (complete EN, IT, RU):
    5) ~40 ReadBeyond/ilNarratore audiolibri (complete, IT):

    Note that only the ebooks of bullet 5) are actual commercially-sold ebooks.

    So, it looks like this is an egg-and-chicken problem: the RS implementers do not see a sufficiently large demand to justify the development effort, hence they do not implement the reflow+MO code. But without proper RS support, the publishers are deterred from investing in creating such reflow+MO ebooks.

    – Since we believe in the effectiveness of reading+listening (e.g. for kids or for learning a foreign language), we developed Menestrello, a free Android/iOS app esplicitly designed for reading+listening reflowable EPUB3 Audio-eBooks with MO. It does not paginate, although a “block scroll” is (sort of) a replacement for it.

    More info:

    Clearly this is a “proof of concept” attempt: without adoption into the major RS’s, I do not see a bright future for reflow+MO ebooks.

    – Finally, a technical note on the granularity of the text fragments: considering normal speech recordings, e.g. a common audiobook (as opposed to slowed down recordings for children common in FXL), highlighting each single word produces an annoying “flash effect”. The eye of the reader struggles to “follow the highlighting”. We found out that a sentence-level granularity works better in this case.

    1. Thanks for the detailed reply, Alberto!

      I don’t disagree with any of your points, of course, but I’d just qualify that the questions coming my way are coming from DAISY producers looking to migrate up to EPUB 3. While that won’t hit on the mainstream radar in the same way that the commercial publishers would, I do see a growing demand for reflowable MO documents as we move forward. Of course, in many cases these won’t be full text/full audio synchronization, but simple structured audio with playback linked to headings.

Leave a Reply

Your email address will not be published. Required fields are marked *