Semantically Structuring HTML

I don’t know what has me thinking about the use of the epub:type attribute to structure HTML markup today, except for the obvious sad fact that I like thinking about markup issues.

The attribute is increasingly being used to build a kind of semantic scaffolding to prop up the generic markup that is HTML — going beyond simple semantic inflection of structures into semantic markup models where there are required parent and child relationships.

It’s not a new idea, but can it succeed in EPUB?

Semantic Models

I should qualify the change I’m seeing in more detail before getting in too deep. When I discussed semantic inflection in the best practices book, the semantics were mostly limited to identifying a structure: this section is a chapter, or a bibliography, or an index, or whatever.

There were some linkable semantics — like a noteref pointing to a note or a rearnote being in a section of rearnotes — but most of the EPUB Structural Semantics vocabulary was about basic labelling.

That simple model has been changing as a result of the work on indexes, dictionaries and educational profiles of EPUB. To provide actual meaning beyond just “this section contains an index, make of it what you will”, richer markup was needed, and richer semantic inflection was the natural method for doing that without stepping outside HTML.

(The indexes specification is a good skim if you just want to see how semantics are being nested to create richer structures using basic HTML elements.)

The question it leaves me with, however, is not whether this markup is valid, but whether it is changing the nature of the HTML itself? And whether that’s even a bad thing considering that EPUB is a distinct format, not just a series of generic web pages.

Semantics on the Web

I’m not sure there’s a good equivalent for what EPUB is doing, which is kind of what you would hope to find when starting something new. Reinventing wheels… well you know how that goes.

When you talk about semantics and the web, the traditional expectation is that you’re talking about “The Semantic Web”, but that’s linked data and exposing the information on pages. We’re not going there today.

Structural semantics do have a home in the ARIA role attribute, but ARIA defines more specialized semantics tailored to making web page structures accessible, not exactly what EPUB is after, either.

The lack of a good mechanism for expressing structural semantics has probably played some role in holding back development more broadly on the web, with the class attribute being overloaded and largely useless except for authoring styles.

The lack of a purpose for structural semantics has probably also held back development. They don’t improve your ranking in Google, and browsers don’t stray into special handling of markup these days.

But structural semantics are important in publishing, as they facilitate content production and combination, and enable reading system behaviours.

Content Creation

From a make-or-break perspective, production is key. If you can’t produce these structures, it won’t matter what reading systems might do with them.

While the initial structural semantics can be easily applied — even if it means manually pasting into the markup — the development of the more complex semantic structures is pushing the envelope of what non-specialized tools, or human hand coders, can generate.

If you layer in too much semantic information that has no appreciable value, for example, are you just making semantics for the sake of semantics? Will people be able to keep them all straight in their heads? Will anyone spend the time to author them?

The flip side of the coin is if you limit the semantics to only those that can be acted upon by a reading system in some meaningful way, you make it difficult to use the markup in back-end systems.

And that’s not an unimportant consideration, whatever your personal feelings, as publishers get handcuffed when it’s not possible to make their data as rich as it needs to be. Some are looking to move to XHTML as their internal format (or at least an intermediary to web-based exports), but are finding the semantics problematic.

But to pull back, the problem of semantics is that they often lead to manual creation/validation, which is a pain for everyone regardless of their proficiency with markup.

Validation

It’s true that any semantic-based extension will be more difficult to validate than native markup, as schema languages don’t all lend themselves to the potential fluidity of semantic inflection. Neither do many production tools.

The reason is that semantic inflection creates a layer of specificity on top of the markup it annotates that is not as rigid as element content models, and mixing the layers isn’t easy.

Or, put another way, it’s often possible for many elements to carry a semantic, and for the structure of the child semantics to not necessarily be tied to child elements (e.g., there could be layers of purely presentational markup between the parent semantic and its child).

This fluidity is why semantics only makes sense, and are only usable, to processing systems and user agents that know what to make of them. In a way, you’re changing the meaning of the content — bending HTML to your will, if you will.

I’m okay with that, if only because it’s the only way to make HTML more useful for storage, combination and processing in back-end content management systems. That it enables reading systems to present the content in meaningful ways is the icing on the cake. (Others might see those benefits the other way around, but that’s my data bent coming out.)

The problem of this scaffolding, as I’m now taking to calling it, is that the content model of the underlying HTML is static. It doesn’t change to fit the semantics unless you impose your requirements on top of it. That’s where the authoring/validation complexity emerges.

Epubcheck can handle custom validation for the final format, and schematron is a natural fit for checking semantic relationships, but most people want on-the-fly validation as they work. That’s more difficult, and will always be a problem when it comes to off-the-shelf HTML authoring tools.

Even tools like Oxygen that can validate on the fly using schematron can’t tell you what semantic comes next. You might get a hint from an error on a missing semantic, but what about the optional ones?

Life is hard in these parts, eh?

There’s also a critique of semantic structures that the deeper you nest them the more careful you have to be in ensuring conformance. If I define “part” as a semantic for divisions of a book and also for divisions of an index, I have to stop looking up the ancestor chain at the first matching parent, otherwise I might try to validate an index part like a book part.

That, in turn, leads to the temptation to get overly strict about the parent/child elements that can carry semantics, which is made complex by the million-and-one ways that people craft their markup. It might make authoring easier, but it also frustrates authors when there is only “One Way To Do It”.

So far I haven’t seen any patterns that would cause semantic ambiguities, but it is a problem that will always have to be on the radar.

Changing HTML

Production and validation aside, perhaps the most interesting effect is the way these semantics change the meaning of HTML to better reflect publishing, as I’ve been hinting at.

Where RDFa, microdata and microformats better expose information about the content for machine processing, structural semantics better expose the underlying structure that would otherwise be lost to HTML’s generic heart.

I don’t know if HTML people will ever take exception to this subtle reworking of core structures to reflect publishing-defined needs. It’s not like it’s running under the radar of the W3C, at least. (Full disclaimer that I’m not a part of the digital interest working group they’ve spawned, but I do know discussions are taking part there.)

The beauty, I suppose, is that the semantics live in a space where they can be ignored if you don’t care about them and used if you do. EPUB has cleared the class attribute hurdle, at least, in trying this direction.

epub-type

Before signing off, how to evolve the epub:type attribute has never fallen off the radar, as I’ve said numerous times on this blog. Current discussions are around an epub-type attribute to replace epub:type, similar to how ITS extended HTML with its-* attributes.

So if you’ve been reading this post with the mindset that it further entrenches EPUB into custom XML semantics and CURIE datatypes, not so. Change is eventually coming.


I guess in an evolutionary sense, it’s inevitable that simple semantics would begin to grow and coalesce into more complex semantics. If anything, it proves the value of having added epub:type to the format.

I won’t wrap up by trying to predict the future, as I’ve gone on long enough, but it’ll be interesting to see what it holds…

Leave a Reply

Your email address will not be published. Required fields are marked *