EPUB Previews

I’m sure at some point you’ve downloaded a preview of an ebook from your ebookstore of choice. Sometimes the content is utterly useless in terms of deciding whether you want to purchase the book — a bunch of front matter splatter. Is reading the copyrights, preface, dedication, etc. really useful? What if there’s an index at the back you’d like to look at? What if you’re the content creator who wants the reader to have access to that index?

Of course, sometimes you get useful previews, but I’m trying to highlight the problem of previews being in the hands of vendor: you can never be completely sure what will be in them. Unless the publisher runs the ebookstore, it’s not their choice what you see.

The EPUB Previews specification seeks to flip that paradigm back around so that the content creator is the one who decides what goes in the preview. The specification is not yet a recommendation, but should be sometime (hopefully early) this year. As it’s one of the more stable specifications the working group has published, I thought I’d give it a quick run-through, but caveat emptor if you try to jump the gun on it becoming a recommendation.

The specification actually defines two ways that you can create previews. The one you might expect is to create a standalone publication containing only the preview content, or what is called a preview publication. The other method is to identify the preview content within the package document, allowing the ebookstore or reading system to generate the preview (whether as a standalone publication in the case of an ebookstore, or by limiting access only to the specified preview content in the case of a reading system). Not surprisingly, these are referred to as embedded previews in the spec.

We’ll dig into each of these preview types for the rest of the post…

Standalone Previews

There’s not a lot surprising about standalone preview publications. As you’d expect, a preview publication is essentially a stripped-down version of the full publication, containing only the content you want to give readers for free.

It doesn’t mean that a preview will come DRM free, but it means that reading systems should allow full access to the content.

So if a preview is just a minimal publication, why a spec to define it, right? Well, there are a few necessary requirements that the spec standardizes even for these most basic of previews:

  • First and foremost, it standardizes how you identify a preview publication. You have to add a dc:type element with the value “preview” to the package metadata:
    <dc:type>preview</dc:type>
  • It also recommends identifying the parent publication using the dc:source element. For example, the ISBN of the parent could be specified like this:
    <dc:source>urn:isbn:9782367932095</dc:source>

    A related recommendation is to not assign the parent identifier to the preview (i.e., don’t use the same ISBN in a dc:identifier in both). As a preview is typically not considered a distinct work, it’s actually recommended not to assign it an ISBN at all.

  • And finally, the specification standardizes how to provide a link to obtain the parent publication, using the link element with the “acquire” relationship value. A link to the ebookstore page where the reader can buy the full publication could be included as follows:
    <link href="http://example.org/book/9781448103706"
          rel="acquire"
          type="text/html" />

    If you wanted to point to an OPDS catalogue entry, you’d similarly add a link like this:

    <link href="http://example.org/book/9781448103706.atom"
          rel="acquire"
          type="application/atom+xml;type=entry;profile=opds-catalog" />

The takeaway from the above is that identification of a standalone preview publication is primarily a metadata issue; there are no restrictions or requirements on the content itself. A possible complete set of preview-specific metadata only looks like this:

<package …>
    <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
        <dc:type>preview</dc:type>
        <dc:source>urn:isbn:9782367932095</dc:source>
        <link href="http://example.org/book/9781448103706"
              rel="acquire"
              type="text/html" />
        <link href="http://example.org/book/9781448103706.atom"
              rel="acquire"
              type="application/atom+xml;type=entry;profile=opds-catalog" />
        …
    </metadata>
</package>

But that’s not to suggest that there aren’t any content issues you’ll have to consider when putting together a preview. The primary one is linking to the unavailable content, which we’ll look at next.

Linking

One of the open questions about previews — both standalone and the embedded ones we’ll look at next — is what to do with the table of contents, specifically linking to the content that’s not available as part of the preview. You typically want to give the reader a full table of contents with the preview, even if the bulk of the content is not accessible, if your goal is to convince them there’s plenty of other content worth paying for.

Like many things you’ll find in EPUB, how best to do that is debatable, and that debate is still open for previews. The one thing you absolutely cannot do is strip the a tags from the table of contents entries, as it’s invalid and epubcheck will barf errors at you.

The currently proposed solutions to this problem are:

  1. Include a generic page with a message that the content is only available in the full publication, and modify all links to content not in the preview to point to it.
  2. Remove the href attribute from all a tags that point to content not included in the preview.

I’m partial to the latter approach because it potentially makes clearer what content is in the preview. If you look at the table of contents and only find a few active links, it’s obvious what you have access to. With a dummy page, all the links will appear active, so it will be harder for readers to determine which parts they can read. (Of course, many will just flip through the content provided and not look at the table of contents, but they don’t count!)

The open question about stripping the href attributes is what reading systems will do: are they going to explode? I ask that facetiously, but it’s a valid question that still needs real-world testing. Removing the href attribute is perfectly valid to the navigation document requirements, but whether developers have accounted for links without href attributes is another matter. It might be safer to provide a dummy page.

You’ll also need to handle linking to unavailable content in the viewable content. I’d again probably opt to remove the href and style the links grey to give a visual cue that they are there but not active. From an accessibility perspective, it might be misleading to only convey potential linking only to sighted readers, but balancing that out I can’t imagine AT users being happy with a whole bunch of links that don’t go to real content.

Embedded Previews

Embedded previews are a lot more interesting than standalone previews, at least if you like markup solutions that can integrate with an existing publication. Don’t get discouraged if they sound kind of complex to create, as I’ll come back to why it’s not actually the case at the end.

The metadata for embedded previews is largely the same as what we just looked at for standalone publications, but where it’s expressed differs, as embedded previews are defined in the new package document collection element. (FYI, the collection element is basically a generic means of defining new package features without having to overload the file with more and more new elements.)

The key differences from a standalone preview are: 1) the collection’s role attribute defines that it contains a preview (you don’t specify a dc:type at the package metadata level, as the entire publication is not a preview); and 2) you can omit specifying a dc:source since the preview is embedded in its parent publication (if a standalone preview is generated from an embedded preview, the program generating it would be responsible for adding the proper metadata).

<collection role="preview">
    …
</collection>

The one consistency is that you provide a link to where to purchase the book in the package metadata, but I’m not going to repeat the same info I already detailed above. I’m lazy that way.

Anyway, it’s great that we have an empty preview collection element now, but you still have to tell the reading system/vendor which part(s) of the publication are previewable for it to be useful. There are two steps to doing this, which basically amount to mimicking the EPUB manifest and spine.

The first requirement is to create the manifest of resources necessary to render all the preview content, which is done by embedding a collection with the role of “manifest“. In it, you use link elements to point to the location of each resource.

For example, an embedded preview consisting of a preface and first chapter might also list a supplementary style sheet and images as follows:

<collection role="preview">
    <collection role="manifest">
        <link href="css/epub.css" media-type="text/css" />
        <link href="xhtml/nav.xhtml" media-type="application/xhtml+xml" />
        <link href="xhtml/preface.xhtml" media-type="application/xhtml+xml" />
        <link href="xhtml/chapter01.xhtml" media-type="application/xhtml+xml" />
        <link href="images/c01-img01.xhtml" media-type="image/jpeg" />
        <link href="images/c01-img02.xhtml" media-type="image/jpeg" />
    </collection>
    …
</collection>

Note that the publication’s navigation document is included as-is; you do not modify it yourself. The reading system (or vendor) is the one who processes its links, which means you have no control over whether they are removed and/or a dummy page inserted (or what the text of that dummy page says). I’m not sure if the lack of author control is a deficiency in the spec or not, but it sort of feels like one. I might have to raise that when work resumes.

The manifest probably sounds like a pain in the you-know-what to create, because it appears at first glance to just duplicate a part of the package manifest, but that difference is exactly the reason for its existence. If you want to generate a standalone publication from the embedded information, or just want to extract the preview from the larger publication, you need to know all the resources required in the rendering. The package manifest is no help, as it lists everything. To extract only the necessary files, without the burden of processing each content document to find out what it references, someone has to create the list. And that someone is going to be you if you’re a content creator.

The next step is to define a spine, or reading order, for the preview content documents, which is also done using link elements. These links are direct children of the preview manifest and follow the manifest collection.

Although the previous manifest contained six resources, only two are actual content documents that are to be rendered: the preface and first chapter. Here then is the full embedded preview, placing these two documents in the preview spine:

<package …>
    …
    <collection role="preview">
        <collection role="manifest">
            <link href="css/epub.css" media-type="text/css" />
            <link href="xhtml/nav.xhtml" media-type="application/xhtml+xml" />
            <link href="xhtml/preface.xhtml" media-type="application/xhtml+xml" />
            <link href="xhtml/chapter01.xhtml" media-type="application/xhtml+xml" />
            <link href="images/c01-img01.xhtml" media-type="image/jpeg" />
            <link href="images/c01-img02.xhtml" media-type="image/jpeg" />
        </collection>
        <link href="xhtml/preface.xhtml" />
        <link href="xhtml/chapter01.xhtml" />
    </collection>
</package>

I’ll briefly note here that it’s not a requirement to give access to an entire content document. If you specify a fragment identifier on the link, that indicates that the reader is only allowed to access the content up to, but not including, that point.

For example, if we only wanted to allow access to the first few paragraphs of each chapter, we could add the id “preview-end” to the paragraph where access ends and then include links like these:

<collection role="preview">
    <collection role="manifest">
        …
    </collection>
    <link href="xhtml/chapter01.xhtml#preview-end" />
    <link href="xhtml/chapter02.xhtml#preview-end" />
    …
</collection>

One bit I haven’t shown is that you can also include a metadata element in the preview collection, but as no metadata is required at the preview level, there’s no spec-related reason to do so. Detailing completely optional elements falls into my optional list of things to do. Metadata is obtained directly from the package metadata section, since it’s generally not selective to specific documents, which is why the acquisition link stays there.

But to wrap this section up, I’ll quickly return to my promise to explain why creating an embedded preview is not so bad as it might seem. If you think about creating the collection manifest and spine equivalents, is it really any more complex than taking a publication and stripping it down? You can copy and paste manifest and spine items from the publication and just tweak them to be links.

You also don’t have to worry about all the manual work to make the standalone publication valid, like removing manifest and spine items and tweaking the table of contents and links. With an embedded preview, it’s the vendor or reading system that has to do all that work. Granted, you lose control over the final product, but that may not be a pressing concern so much as getting the content you want to the reader.

Choosing a Preview Type

To be honest, it’s probably not worth answering this question at this time, but what the hey. You’re not going to be able to make previews as outlined above until both the 3.0.1 and previews specs are finalized anyway, as epubcheck errors are going to get in your way.

That said, if I were to predict a future for previews, I’d expect that content creators will be asked to provide embedded previews with their publication, but vendors will generate standalone previews for distribution to reading systems, sending the full publication only after the content has been purchased. I can’t really see embedded publications living outside a vendor ecosystem, at any rate, reasons including:

  • Bandwidth — Do you really want readers downloading the full publication if potentially only a small percentage of readers who sample the ebook will buy it? In the case of simple headings-and-text novel with no embedded fonts of other resources that would bloat the container size, this consideration is probably moot. If you have a rich multimedia publication, with all the content embedded in the container, it’s not a decision to take lightly.
  • DRM — Although it’s not stated that embedded previews can only exist within a restricted EPUB, it doesn’t make a lot of sense to include a preview in unlocked content. You might not be the one applying the DRM (the vendor would likely be the one doing that), but an open EPUB is an open EPUB to reading systems. How the reading system determines when to unlock the content is left to vendors to best determine within their ecosystems, so DRM’ed ebooks floating around waiting for some random vendor to unlock them is unlikely.

Preview publications are more likely to fill the void of content creators wanting to get their publications out to the reading public independent of a vendor ebookstore. You can make such things right now, of course, but links to buy the content have to be embedded in the content. The preview specs addition of acquisition links will provide greater flexibility, especially since standardizing them together with the preview identifier, will allow reading systems to recognize previews and present the purchase options to the reader.

But I see that this quick post has turned into a weight tome, so I’ll cut myself off at this point. The last thing I’ll say is that as the specification is still only a working draft, there’s plenty of time to make comments and requests if you find it lacking. Feedback is best directed at the google issue tracker, as always.

Tags: ,

  1. eurythrace’s avatar

    I posted this already on the IDPF forum here, but after reading this article, it seems mandatory to restate the issue about the content of the <dc:identifier> and <dc:source> elements that is constantly being given as examples.

    As shown above, the typical format for an ISBN identifier is:

    <dc:identifier id="isbn">urn:isbn:9780375704024</dc:identifier>
    <meta refines="#isbn" property="identifier-type" scheme="onix:codelist5">15</meta>

    HOWEVER!

    From the Onix for Books Codelists Issue 25:

    15 ISBN-13 International Standard Book Number, from 2007, unhyphenated (13 digits starting 978 or 9791–9799).

    22 URN Uniform Resource Name: note that in trade applications an ISBN must be sent as a GTIN-13 and, where required, as an ISBN-13 – it should not be sent as a URN.

    Therefore, as I read it, the typical example format is completely incorrect. If one uses the

    urn:isbn:9780375704024

    notation, then the onix:codelist5 value should be 22, not 15, as follows:

    <dc:identifier id="isbn">urn:isbn:9780375704024</dc:identifier>
    <meta refines="#isbn" property="identifier-type" scheme="onix:codelist5">22</meta>

    BUT!

    The onix:codelist5 specifically says do NOT use the URN notation for ISBNs. Further, the format defined for code 15 is a 13 digit non-hyphenated number. Therefore, as I read it, the correct syntax for an ISBN identifier should be:

    <dc:identifier id="isbn">9780375704024</dc:identifier>
    <meta refines="#isbn" property="identifier-type" scheme="onix:codelist5">15</meta>

    In this same vein, the International DOI Foundation maintains DOIs are NOT a URN and using that type of notation is incorrect.

    URN architecture assumes a DNS-based Resolution Discovery Service (RDS) to find the service appropriate to the given URN scheme. However no such widely deployed RDS schemes currently exist…. DOI is not registered as a URN namespace, despite fulfilling all the functional requirements, since URN registration appears to offer no advantage to the DOI System. It requires an additional layer of administration for defining DOI as a URN namespace (the string urn:doi:10.1000/1 rather than the simpler doi:10.1000/1) and an additional step of unnecessary redirection to access the resolution service, already achieved through either http proxy or native resolution. If RDS mechanisms supporting URN specifications become widely available, DOI will be registered as a URN.

    — International DOI Foundation, Factsheet: DOI System and Internet Identifier Specifications

    Instead, a URL type of notation is preferred. Therefore a DOI identifier should be of the form:

    <dc:identifier id="doi">http://dx.doi.org/10.1000/182</dc:identifier>
    <meta refines="#doi" property="identifier-type" scheme="onix:codelist5">06</meta>

    The use of UUIDs or OIDs should still be correct as the typical examples show:

    <dc:identifier id="uuid">urn:uuid:f47ac10b-58cc-4372-a567-0e02b2c3d479</dc:identifier>
    <meta refines="#uuid" property="identifier-type" scheme="onix:codelist5">22</meta>

    <dc:identifier id="oid">urn:oid:1.3.6.1.4.1.41672.2.0.1.0</dc:identifier>
    <meta refines="#oid" property="identifier-type" scheme="onix:codelist5">22</meta>

    As a personal note, I was looking over the available alternatives for identifiers for ebooks rather than having to purchase ISBNs and discovered the OID system. Although UUIDs are also free, there IS a chance that duplicates can be generated since no one can use a pure random number generator, etc.

    The PEN (Private Enterprise Number) system for OIDs administered by IANA is nice because it provides a registered system that is FREE to register. Once an entity has been assigned its PEN, it has complete control over the sub-tree assignment of numbers to objects. Further, that sub-tree can remain private, or it can be registered at an OID repository.

    If I have misunderstood or misinterpreted anything, please correct my misapprehension. Thank you in advance.

    Reply

    1. matt.garrish’s avatar

      Sorry, looks like you got caught in a perfect storm of no-notifications. Your posts here got flagged as spam, so I never got an alert, and for some reason I haven’t been getting IDPF forum alerts so never thought to check that there were new posts until today.

      The use of 15 for the scheme values does look incorrect in the specs, I agree. There was a late change from numeric values to URNs during the 3.0 revision, and it appears the schemes weren’t updated to reflect the change. I’m not going to jump in on whether it’s appropriate or not to replace it by 22, though. Since the scheme is only used in epub package metadata to identify the nature of the identifier, and it is a URN, how much of ONIX’s recommendations have to be followed I’d only be speculating on. I’ve asked some people who definitely will know the answer to that question, and either they or I will follow up to your question in the IDPF forum. (Similarly, yes, the urn: on the doi: should be removed.)

      Whatever the answer, though, what ONIX says doesn’t mean you can’t use URNs for ISBNs in the EPUB package metadata, so I don’t see an issue with the previews spec metadata since it doesn’t specify schemes for any of the identifiers. The issue is more pertinent when suggesting the value conforms to a specific scheme.

      But thanks very much for pointing this out, as after a while the spec examples stop registering as needing review.

      Reply

Reply

Your email address will not be published. Required fields are marked *