So I’ve decided my new year’s resolution is to never again explain zipping an EPUB.
While it’s probably unrealistic to expect the question won’t be posed to me by someone somewhere — thinking IDPF forums here — my dream is just to point people to this post and not spend any more time on the subject.
If you’ve mastered the nuances of EPUB archives, warning in advance that this is probably not the read for you.
The whole EPUB as web site in a box thing gets played up enough, where the “box” is the zip container. So if an EPUB is just a zipped set of resources, I think it’s a fair expectation that any computer-using person of any ability should be able to unzip one. Zipping is so ubiquitous it’s built into most modern operating systems, after all. Even if you can’t figure out how to unzip a file with a .epub extension, you just have to change the extension to .zip and away you go. It really is that simple.
While you might not always think about having to unzip an EPUB you’ve generated from (insert EPUB authoring program here), problems do arise. There were a spat of reports in the IDPF forums of dtb:uid values in the NCX not matching the unique identifier in the package document metadata, for example.
But it may not even be an error that forces you to muck around with your generated content. Maybe you noticed some nasty CSS styling you want to go in and tweak. Maybe you crafted your whole EPUB by hand are now figuring out how to package it. Let’s just agree that there are many reasons why you’ll unpack/pack an EPUB at some point in your digital life and move on.
So why, if you can EPUB unzip an EPUB with any zip program, can’t you just zip the contents right back and have a valid EPUB?
That’s the real trick. Although EPUBs are zip files, there are requirements on the order of files and how they are compressed that out-of-the-box zip programs have no concept of so aren’t going to follow. The truth is that you can re-zip your content using any zip program provided it lets you manually add files and select the compression level for them, but you have to know what you’re doing. It’s those latter two points that cause headaches for neophytes.
My standard answer at this point is to not worry yourself about the nuances of zipping an EPUB and just let epubcheck do the hard work for you. It has a built-in option to zip unpacked content provided the publication passes validation. You just call the program with the -save option on the directory where you unzipped the resources:
java -jar c:\epubcheck\epubcheck.jar c:\path-to-book-content\ -mode "exp" -save
I understand that not everyone likes command line java programs, but getting a properly zipped EPUB file isn’t any harder than adding that one additional option to end. You’re going to have to master epubcheck at some point if you deal in raw data.
But my goal is completeness of explanation today, so while I’d like to leave off with epubcheck, let’s start digging into why it takes a specialized program like this to generate a valid file. The problems all boil down to one file: the mimetype. This file identifies that the zip file contains an EPUB publication, but to do so it has to be the first, uncompressed file in the container. If it’s not, epubcheck is going to throw errors at you and it’s not always clear whether a reading system will accept the file or not (some don’t care about the mimetype position, some do). Any vendor who expects a cleanly validating EPUB file isn’t going to be so forgiving.
If you want to create your zip container by hand, you’re going to need a real zip program. The Windows’ “send to” zip menu option isn’t going to cut it, in other words. The following screen grabs are all of WinRAR.
But getting back to business, here are the steps:
- Create an empty zip file:
- Next, add only the mimetype file. When I do this in WinRAR, I get the following screen of options before the file is actually added:
The key is to find the compression method setting and make sure no compression is applied. Zip programs normally apply compression by default, as shrinking the final archive size is one of the key advantages of zipping files. When you find this setting in the program you’re using, make sure that it is set to “store” (i.e., just store the file, don’t compress it). If you’re presented a number range, you probably need to set compression to 0 to disable.
- Now that the hard part is done, just add all the rest of the resources to the archive. You should be able to add them in one big batch at this point by selecting them all. You can compress these files to save space.
- And finally, double-check you’ve set the extension to .epub. Epubcheck actually cares about the extension, though technically it shouldn’t.
It’s not a terribly complex process to get a valid EPUB container, but it is a nuisance to do this manually over and over.
You should now be able to run epubcheck and not receive an error about the mimetype file not being first or the value not matching the expected string “application/epub+zip”.
If you do get an error, there’s a chance that your zip program has included optional data fields. I’ve not run into this problem in commercial products, but I’ve not tested all of them by any stretch. I know the problem exists because I’ve bumped up against it in zip libraries trying to automate production. If you get an error relating to the mimetype position or value, and you followed the above steps and didn’t introduce any typos, you’re unfortunately on your own to figure out whether this is the problem and how to prevent it. (Again, this is where it pays to use programs like epubcheck and not deal with these potential headaches.)
Now that the process is clear, let me tackle a few why’s…
The common first why is why does EPUB have this annoying requirement that makes zipping a pain? While you might not care what the first file is, having the mimetype first provides an easily inspectable clue that the file contains an EPUB publication. Any program can look 30 bytes into the zip file and find the mimetype file name with the value “application/epub+zip” starting at the 38 byte mark. If that sounds like a lot of technical mumbo jumbo, at a certain level it is. It’s magic number stuff (like the fact your zip file always starts with ‘PK’).
You can argue that it’s not going to make or break the format to find the mimetype at an random location. If I tried to argue otherwise, I’d likely just paint myself into some corner I couldn’t get out of. Having the convention is predictable and spares reading system developers the nuisance of inspecting all the files in the container, so maybe just accept it for that reason. The mimetype is not even an EPUB requirement, per se, but comes from the ODF specification that underpins the container format. So you can blame the strictness on legacy, and sometimes legacy is hard to kick no matter how much better things could be.
That admission sometimes leads to the question of why have a mimetpye file at all? While flexibility of position is theoretically possible, dropping the file is not, at least not without a radical departure from the container in the future. Why it’s unlikely the mimetype will be dropped entirely is that the only thing that tells you you have an EPUB. Without it, you couldn’t tell a zipped EPUB publication from an iBooks Author publication from any other format that might try to use the package file for its own purposes (e.g., OpenEFT). The package document only identifies a version number, and the container.xml file in the META-INF directory only tells you where to find a package document. Only the mimetype and extension identify the format, and extensions are unreliable things.
The other why people sometimes ask is why does my invalid zip container look exactly like a valid one in a zip program, even if I don’t add the mimetype file first? The reason is that when you look at the contents of a zip file in a program like WinRAR, it presents them like a traditional directory on the operating system, not according to the order in which the files were added. So whether you add the mimetype first or not, the view you get from the zip program will show directories followed by files. In other words, looks are deceiving in this case.
But that honestly has to be more than you ever wanted to read about zipping EPUB files, and even more than I expected to write, so let’s wrap this up and call it a day.
I’ll trail off on this topic by pointing to a couple of other weapons at your disposal for dealing with EPUB containers.
If you hate dealing with zipping and unzipping content altogether, I can’t recommend Oxygen enough. It’s not free, but is a case where you do get what you pay for, especially if you live in a bigger data world than just EPUB. You can open the container in the program, modify the files inside it, and revalidate with epubcheck all in one place. And unlike other programs, Oxygen won’t do any unwanted funky things to your data, like run tidy on it.
For the Mac crowd, there are also some free apple zip scripts to do the zipping and unzipping if epubcheck isn’t to your liking and you don’t want to go the completely manual route.
You can search on “epub zip” to find other programs, I’m sure.
Try to enjoy your new year if you’ve endured this far!