So how many metadata standards is too many metadata standards? How many metadata frameworks is too many metadata frameworks? How many roads must a man walk down, before you call him a man? (Seven!)
EPUB has its own share of metadata headaches. On the one hand, you need enough centralized metadata to facilitate reading system presentation of your book to readers. On the other, you need metadata to move the publication through a distribution chain. And even then there are people who want MARC, MODS, JATS and other metadata to facilitate their own processing.
The EPUB specification sort of floats between problems neatly enough, but it’s yet another custom framework that in my experience causes as much confusion as it solves, and leaves me wondering if there were another way to handle metadata, like going all-in on schema.org in a future version.
(Note: I’m defining a metadata framework as the mechanisms for expressing metadata and a standard as a definition of properties, for those who care about words.)
Multiplicity
One of the more annoying parts of EPUB is that you have to do metadata twice, at least if you’re selling your content. You have the package document metadata to fill out so you get more than just a cover image in bookshelves, and then you need an ONIX record for distribution.
The usual question is why not just use ONIX and require reading systems to parse it? The primary reason for not going that route, at least in my travels, has been the complexity factor. Not all EPUBs are sold, and IDPF isn’t aiming solely at trade publishing adoption of the format (look at IBM’s announcement of their intention to use for documentation, or use by accessibility-focused organizations).
Package metadata should be simple to implement and to parse. ONIX can be as complex as necessary.
And so EPUB is in the metadata game. Like HTML, trying to provide a framework for expressing metadata, but also getting caught up in having to define metadata that reading systems can use.
I don’t see the duality going away any time soon.
Enter RDFa/Microdata + Schema.org
The EPUB 3.0.1 revision took a couple of interesting steps in allowing RDFa and microdata attributes in content documents and also opening the package document to simpler inclusion of schema.org properties (the latter being a bit of a hack because package metadata itself is a hack on RDFa).
While cool in its own right to have more mechanisms for discovery (search indexing of books), we’re now up to expressing metadata two to three times for maximal coverage.
Had these technologies been further along back in 2010-11 when the 3.0 revision was going on, it might have led to a more interesting outcome (although in hindsight I doubt it). There was interest in them, of course, but those standards folks were in their own pitched battles at the time, and schema.org wasn’t yet the behemoth of metadata it’s become.
But picture metadata that could be both machine processable and live as a human-readable document.
Sure, you’re probably saying that’s already what RDFa facilitates, but what if the package document pushed metadata out to a content document, the same way that navigation was moved from the NCX to the new EPUB navigation document.
What if your title page verso with your CiP info could be tagged up for reading system extraction?
What if metadata in the package were reduced to a set of links, one required to point to the publication metadata and the others doing what they do now pointing out to other record formats.
Cons
Okay, to come out of fantasy land for a moment, there are plenty of obstacles to this ever becoming a reality…
The big one is that such an approach is completely incompatible with how metadata is expressed now. It would take an EPUB 4 that doesn’t fall into the trap of having to accommodate EPUB 2 or 3. (But Dublin Core must die eventually!)
EPUB’s also probably stuck in its current ways thanks to the new collection
element. It adapts the package metadata element, which might turn metadata into a proliferation of mini-HTML fragments or documents (depending on how referencing would work: to the file or fragment).
The use of the metadata.xml
file by the Multiple-Rendition Publications specification is another headache, although we realized that trying to standardize metadata at that level is futile because of EPUB’s legacy, so it’s pretty weak in terms of using metadata.
The use of RDFa and microdata also might not fall into some people’s definition of simple; it will take some work to extract values.
Pros
The simple pro of one way of expressing metadata, and getting IDPF out of the metadata game once and for all easily outweigh all the cons, in my rarely humble opinion.
But beyond that, a move away from the artificial separation of metadata from content would be another plus. It’s hard to argue it’s needed for EPUB when the primary use of the metadata is for reading system rendering. Metadata could be much more accessible, and better handle i18n, if it weren’t restricted to text strings, too.
There’s also an opportunity to enrich metadata for the web generally. Any move to schema.org would require evaluating the existing CreativeWork properties in more detail to see what might be missing. (I looked a long time back and while there being a lot of overlap with Dublin Core, not everything people do now was covered.)
Alas, none of this will come to pass any time soon, if ever, I expect. I doubt an EPUB 4 that could make this kind of break is anywhere on the horizon; 3.x updates make the most sense until the next tide rises in digital publishing.
Still, you’d be wise to consider RDFa+schema.org metadata in your content documents, regardless.
Anyway, I thought I’d take a break from tutorials on IDPF specs for a day to engage in a little rampant speculation.
Take it for what it is. :)