Navigating EPUB CFIs – Part 1

Ugh! EPUB CFIs…

It’s not a topic I grew up dreaming of writing about one day, but it seems like a useful topic to cover, since questions about how to read them come up from time to time (see this recent thread on CFIs in the IDPF forums). I squeezed a quick explanation of them into the Best Practices book, but I had to keep it short, so maybe I can do better justice to them here.

I will say that the specification can be much more daunting to read than CFIs actually are, but that’s not an indictment of the specification. It’s just that CFIs are one of the more technically complex features that were included in EPUB 3.

But let’s start at the beginning: what is a “CFI”?

CFIs Explained

The acronym is a shorthand for Canonical Fragment Identifier, but while knowing that bit of information might seem irrelevant, hold on one second. You’re probably already familiar with fragment identifiers, even if you don’t realize it. Ever seen a web address like this:

http://www.example.com/epub.html#navigation

Parsing this apart is pretty simple, right? If you were to drop it in your browser, at least in theory, the page epub.html would be opened from the server example.com. And that “#navigation” bit at the end means to scroll some element with the ID “navigation” into sight.

But, wait, what exactly is that “#navigation” thing? Funny you should ask. The hash (#) is a special character that indicates that what follows represents a … fragment identifier!

Each element within an (X)HTML document represents a fragment of the content, and each one that can be reached in this manner must have an ID. Hence, fragment identifiers.

See how easy that was?

Okay, so knowing that URL fragment identifiers allow you to reach a specific point in a web document, why did the engineers of the EPUB CFI specification need to come up with a new mechanism?

What they do

The answer to why EPUB CFIs were needed becomes obvious when you think about the things people want to do with their ebooks: highlight them, annotate them, bookmark them, etc. IDs present a very limited means of referencing points in a document. They’re also typically too crude to be useful for these kinds of tasks, or even for the reading system to store the last reading position. Content authors can’t anticipate each possible point and tag and ID them.

EPUB CFIs, as I’ll get to later, allow much greater pinpointing granularity. They’re not just for reaching an element in a document, but any text position within it. And they’re not just for text positions. CFIs can point to specific spatial coordinates within an image, for example. They can also point to a specific time offset in an audio or video clip. Pretty powerful stuff, I’m sure you’d agree. (I bet you’re starting to like those engineers now!)

EPUB CFIs are also a key part of the answer for how to share bookmarks, highlights, annotations and other reader-definable features between reading systems and between readers. By standardizing the referencing mechanism, it ensures that any reading system will be able to correctly dereference the location referenced by these CFIs, avoiding proprietary solutions to the same problem. (But, for true interop, a standard means of structuring the information for sharing is still necessary.)

EPUB CFIs are also a potential answer to how to link into a publication hosted on the web. Imagine being able to create a URL that points to an EPUB and have an EPUB CFI on the end that instructs a reading system how to drill down into the publication to the specific point you want the person to see.

Constructing CFIs

Enough talk, though, it’s time to get down to explaining what makes EPUB CFIs unique, and how you construct them.

Like regular fragment identifiers, EPUB CFIs follow a hash character, but that’s as similar as they get to the ID-based fragment identifiers we saw above. EPUB CFIs are easily idenitifiable as they are always enclosed in what will look to many like a function call: epubcfi(). The actual fragment identifier goes between the parentheses. (For the programmers reading this, note that you don’t put quote characters inside the parentheses like you would do for parameters in a functional call, as they’ll break the CFI.)

Here’s what a typical EPUB CFI looks like:

mybook.epub#epubcfi(cfi-goes-here)

You’ll notice that this CFI includes the name of the EPUB file at the start, as this particular form is used to point into an EPUB: inter-publication linking. You wouldn’t use this particular form to link from one location in your publication to another, however. For intra-publication linking, you instead replace the EPUB file name with the location of the package file in the container.

For example, say our publication had this structure inside the container:

/EPUB
   /XHTML
      chapter01.xhtml
   package.opf

In this case, the CFI would first reference the package file one directory up, regardless of whether the video clip being referenced is in the same file or another one:

<a href="../pacakage.opf#epubcfi(cfi-goes-here)">this point in the video</a>

Path Resolution

The easy part out of the way, let’s now start looking at the actual EPUB CFI syntax.

You don’t have to be an XML expert in order to understand these identifiers, but it does help to understand a few concepts before jumping right into the deep end.

The first is that there is always space between two elements, even if you don’t see it. Take this markup for example:

<tag1><tag2/><tag2/></tag1>

Even though you don’t see any text or whitespace between these elements, it doesn’t mean inter-element slots don’t exist:

      v       v       v
<tag1> <tag2/> <tag2/> </tag1>

If you already read the post I put into the IDPF forums, try to act surprised as I explain why this matters now. If you think about how content gets tagged, every element begins with a slot for inter-element content, whether it contains text, whitespace or nothing at all.

If the tag has no children, it only has one contiguous section of this kind of content, which I’ll identify here as /1:

              /1
<tag>text and whitespace</tag>

If the tag has a single child element, it now has two instances of inter-element content, one on either side:

       /1    /2    /3
<tag> text <tag2/> text </tag>

Now, for fun, what if we add in another tag:

       /1    /2     /3    /4     /5
<tag> text <tag2/> text <tag2/> text </tag>

Are you starting to notice a pattern to the numbering? Inter-element locations are always odd, and child elements are always even when sequenced this way. It doesn’t matter how many tags you add in, there will always be an inter-element position on either side, so this odd/even numbering scheme keeps going on and on ad infinitum.

And that’s the first key to understanding how to read an EPUB CFI. The fragment identifier is composed of numeric step references that walk you through the content, and the odd/even numbering of these references refer to elements and inter-element locations, respectively.

There’s one more rudimentary concept to understand before seeing how to put this newfound information together, and that is the distinction between children and descendants. So far, all of the examples I’ve been showing have included empty elements, but to understand CFIs you need to wrap your head around the idea that you are progressively digging deeper and deeper into the content.

But to explain this better, take a look at this more familiar-looking HTML markup:

<p>I <em>know</em> what I <em>know</em>!</p>

A natural assumption based on discussion so far is that you might sequence every component of text and markup, like this:

   /1 /2  /3   /4    /5    /6  /7   /8 /9
<p>I <em>know</em> what I <em>know</em>!</p>

But this would be wrong on a number of counts. Yes, the odd and even numbering match what they’re supposed to, but we’ve started overlapping different levels.

Each element represents it’s own uniquely sequenced set of content, so you have to first look at which content is a direct child of the p tag. Re-arranging the above markup should help clarify what I mean:

<p>
/1  I
/2  <em>know</em>
/3  what I 
/4  <em>know</em>
/5  !
</p>

The contents of the em tags are not direct children of the p tag, so they don’t count in the sequencing (end tags are never counted, as they always delimit the end of the sequence). Each new tag represents the start of another level, in other words.

Or, to modify the example yet again:

<p>
/1  I
/2  <em>
  /1  know
    </em>
/3  what I 
/4  <em>
  /1  know
    </em>
/5  !
</p>

So if “/2” refers to the em element and “/1” refers to the first instance of text in it, then by simple concatenation “/2/1” gets you a path from the p tag to the first emphasized word “know”, and “/4/1” will get you to the second.

And that’s how the concept of step resolution works in EPUB CFIs: you keep stepping further and further into your content until you reach the final destination.

File Resolution

We’re almost ready to look at some real EPUB CFIs, but there’s still one more topic to cover. If step resolution allows you to move through the elements and text in a file, how do you get from the package document, as I said all CFIs begin at, into other documents.

First, you need to know that all EPUB CFI step sequences begin with the spine. As the spine is the required third element of the package file, it’s not surprising that they all begin with a “/6” (the metadata section is “/2” and the manifest “/4”).

An even step always follows the “/6”, as the next step identifies the itemref of the content document you’re pointing into. Here’s a sample spine for demo purposes:

<spine>
  <imteref idref="cover"/>
  <imteref idref="toc"/>
  <itemref idref="chapter01"/>
</spine>

To get to chapter one, whose itemref is the third child, we’ll first append another “/6”.

Let’s start building a real EPUB CFI for this:

package.opf#epubcfi(/6/6)

If you know your EPUB, the idref attribute on this element points to an item in the manifest that contains the location of the document. This is the information that reading systems use to determine which document to load as you read through a book.

In other words, it’s possible to get from the itemref we’ve reached to the content document we want to point into, we just need some way to say to follow that trail. And that’s where the special exclamation point character comes in:

package.opf#epubcfi(/6/6!)

Adding this character to the CFI tells the reading system to do exactly what we just said: follow the reference trail to the content document. Doesn’t get any easier than that! (pun intended)

We’re now into a content document and can continue writing “/#” steps as defined above to walk through to a specific location.

Identifiers

Although the “/6/#!” pattern to start an EPUB CFI never changes, sometimes you’ll see bracketed text after the second number. These brackets contain the ID of the element.

For example, say the chapter one itemref had an ID like this:

<itemref id="c01ref" idref="chapter01"/>

If I want to be specific that my CFI points to chapter one, I can include this ID in the CFI by putting in it brackets after its step:

package.opf#epubcfi(/6/6[c01ref]!)

Although this might strike you as a pointless exercise, what if you later revise your publication and add an introduction to it? When you re-release the publication, the spine now looks like this:

<spine>
  <imteref idref="cover"/>
  <imteref idref="toc"/>
  <imteref idref="intro"/>
  <itemref id="c01ref" idref="chapter01"/>
</spine>

Without the ID, everyone with a bookmark, highlight or link into your publication would find them broken if they tried to migrate them to the new version. But with the ID on the itemref and in the EPUB CFI, a reading system can try to fix the problem. If it finds that the third itemref no longer has the ID “c01ref”, it can scan through all the other itemref elements to see if any of them do. Suddenly you have auto-correctable links!

Putting it all together

Now that we know how to step into any content document, let’s quickly look at putting the steps together to make some very basic paths before moving on to the more powerful features oF CFIs.

First, we’ll flesh out the earlier paragraph into a sample chapter to work with:

<html xmlns="http://www.w3.org/1999/xhtml">
  <head/>
  <body>
    <section>
      <h1>Chapter One</h1>
      <p>I <em>know</em> what I <em>know</em>!</p>
    </section>
  </body>
</html>

To point to the start of the first emphasized instance of the word “know”, the EPUB CFI now looks like this:

package.opf#epubcfi(/6/6[c01ref]!/4/2/4/2/1)

Breaking this down, we already know “/6/6[c01ref]!” gets us to this content document. The next “/4” points to the body (the head is “/2”). The “/2” following points to the first section, and the “/4” following it refers to the second child element, the p tag.

At this point we’ve circled back to where we started, with “/2/1” referencing the text inside the first em tag.

See, it’s not nearly so hard to read a CFI as you might think. Insane to write them out by hand, perhaps, but not impossible for a human to read.


With the basic formula for stepping into an EPUB under our belts, I’ll give anyone masochistic enough to have read this far a break.

I’ll write another post detailing character, temporal and spatial offsets and ranges next, as just getting to an element or text position isn’t what makes CFIs interesting. I’ll also point out why you might want to leave these links to reading system-derived features, at least at this time, even if they would make for cool referencing within your own publication.

(And for the record, my periodic sarcasm is not an indicment of CFIs! They’re a fascinating feature, but a tough concept to explain in layman terms…)

Tags: , , , , ,

  1. Daniel Weck’s avatar

    “The acronym is a shorthand for Content Fragment Identifier”

    ==> Canonical, not “Content” :)

    Thanks for the article Matt, it’s on my reading list for the week-end ;)

    Reply

  2. matt.garrish’s avatar

    Oh, wow, there’s a brain fart for you! Thanks, Dan!

    Reply

  3. Barry O'Neill’s avatar

    I’ve been looking for an automated process to index .EPUB files using the CFI approach. In a nutshell, the system my team is developing needs to automatically scan a repository of .EPUB’s, index each of them using CFI, and then have them ready for downloading to a users device. The Search part of it is a second task, but for the moment I’d really like to know how to batch process .EPUB’s to build CFI indexes.

    Thanks,

    Barry

    Reply

  4. Aaron’s avatar

    Oh, so clear! The official documentations are just driving me crazy! Thanks for this wonderful post!

    Reply

  5. Rajesh’s avatar

    Clearly explained!thanks a ton

    Reply

  6. Peter Kosenko’s avatar

    Is there a second part to this article, or did that not get written? Thanks for this much, however. It helped a lot trying to figure out the CFIs generated by epub.js

    https://github.com/futurepress/epub.js

    Reply

  7. Rajesh Kumar’s avatar

    Hi Ton,

    Thanks for the great articles. I just need a clarification for the following point.

    Copied from your articles:

    package.opf#epubcfi(/6/6[c01ref]!/4/2/4/2/1)

    Breaking this down, we already know “/6/6[c01ref]!” gets us to this content document. The next “/4″ points to the body (the head is “/2″). The “/2″ following points to the first section, and the “/4″ following it refers to the second child element, the p tag.

    As you mentioned that “/2” for the first section, “/4” for second child element P tag. For h1?

    My assumption, package.opf#epubcfi(/6/6[c01ref]!/4/2/6/2/1).

    Please advise.

    Thanks in advance,
    Rajesh

    Reply

  8. Alain Chautard’s avatar

    Thanks for the great article. I was struggling with CFI and now feel like I understand everything!

    Reply

Reply

Your email address will not be published. Required fields are marked *