Experimenting with eBook Creation

If you’re going to go down the self-publishing route, it’s useful to get familiar with the software that’s available.

In preparation of eventually publishing something one day, I’ve been playing around with authoring eBooks (ePub 3.0 format, mainly) through a combination of Scrivener, Calibre and Sigil.

Scrivener, I think, would be great for fiction books or non-fiction with minimal graphics or pictures, mainly because there aren’t many options when it comes to manipulating layout when compiling an ePub; in order to do that, you’d have to drill into the XHTML and CSS and Scrivener doesn’t give you a view into that. It also doesn’t allow you to set any fonts, which sucks if you’re big on typography and have your own ideas on how to make things look.

What does allow you to manipulate the innards of an ePub file (including setting fonts) is Sigil and Calibre. They both have great XHTML editors, although Sigil has a slightly more powerful editor than Calibre. That said, Calibre’s automated features when it comes to format conversions is pretty good; in some cases, its Heuristic Processing functionality does a better job of importing or converting a file that was scanned by OCR than Sigil does through its related plugins.

The way I practiced was by converting some of my old computer programming PDFs into ePub files, as well as converting static HTML files (mostly GNU reference manuals) into ePub files (which are just glorified zipped bundles of HTML files anyway; you can verify this by changing the .epub extension to .zip and unzipping the file).

Manipulating the PDFs were a pain; basically, you export OCR’d text into a Word .docx file (using something like Adobe Acrobat or the Tesseract OCR project), and then import that into either Sigil or Calibre. The Sigil .docx import plugin does its best to try to create as clean an HTML file as it can, but it does so by stripping out a lot of extraneous tags that are MS Word specific, many of them affecting formatting. Calibre does a better job of keeping formatting intact, but the outputted HTML isn’t as clean as it could be (and it still might have artifacts in the formatting regardless).

That said, if there’s fancy text formatting or graphics like tables or figures in that PDF file, Calibre had a better chance of preserving that stuff. With Sigil, a lot of that layout information would be lost when importing the .docx file, and you’d need to use a combination of HTML and CSS to migrate and approximate that information to look like it does in the PDF, which is a very manual process.

Either way though, with OCR’d text from a PDF, the result was substandard as there were extraneous newlines everywhere. While the Sigil .docx import plugin tries to account for weird newlines, Calibre’s Heuristic Processing does a better job of unwrapping lines and stripping those away using the default settings (and you can tweak the settings to make things more accurate too, although it isn’t perfect and will still need some massaging afterwards). There is an MS Word plugin called ePUBTools that can do some OCR post processing in trying to recover that original formatting, but it isn’t as good as Calibre (although it doesn’t hurt running it on the file first before importing it into Calibre as a type of first pass).

However, regardless of what route you go, everything still needs to be eyeballed after conversion to make sure the algorithm wasn’t too aggressive in taking out newlines and that the outputted text looks like the original source material, especially when it comes to the content of paragraphs.

So, at least when it comes to PDFs that are primarily text based, a decent workflow when it comes to converting OCR’d PDFs to epub files is:

PDF -> Adobe Acrobat (save as .docx) -> MS Word (import .docx via ePUBTools’ Post Process OCR functionality) -> Calibre (convert .docx to ePub with Heuristic Processing enabled) -> Sigil (manually edit ePub file to ensure output is sane).

With PDFs that contain a lot of custom graphics or more complicated text layouts, you may have a better time with importing the file straight into Calibre (i.e. without converting it to a .docx file first) and making Calibre convert it to an ePub directly. You might very well end up with an ePub full of images (if that was what the PDF file was made of in the first place), but for a quick-and-dirty result with minimal finessing later, it’ll get the job done.

Converting static HTML files to ePub format was a lot easier, but ensuring that the resulting ePub files passed validation was the most time consuming part. There is a tool called epubcheck, which is the standard program that is used to ensure that ePub files adhere to the various ePub standards. Most online bookstores won’t take an ePub file unless it passes validation first, which is why it’s important to ensure your files make the cut before uploading.

That said, the HTML2Epub Sigil plugin chokes if there are any tables in the HTML file, so if that’s the case (which was the case with the GNU manuals I was playing with), you’re better off importing those to Calibre instead and using that to convert the HTML files into ePubs that you can then use Sigil to put the finishing touches on. Or, I suppose you could just add the HTML file to Sigil directly without running it through a conversion plugin, but that’s something I didn’t try.

In the case of ePub 3.0 (which uses HTML5 and CSS3), replacing deprecated HTML tags with HTML5 compliant ones was what took me the longest, mainly because I’m not a web programmer at all so I had to do a lot of web searching to learn what the new way of doing things with CSS was. The file was still readable by my Kobo if I did nothing, but if I ever wanted to post these on the Kobo or Google Play stores or what not, I’d need to make sure they passed validation first. If I stuck with ePub 2.0, I think I could have gotten away with doing nothing, but this is 2019 and so why publish to a format that’s almost a decade old (other than for compatibility with older eReaders, although to be fair, ePub3s are supposed to be somewhat backwards compatible)?

Anyway, all of the GNU manuals that I experimented with are open source and are allowed to be modified, so I figured I’d post my work in case people wanted ePub copies of these things for offline access later, rather than the PDF copies that the various projects already provide. I even embedded a set of IBM Plex fonts, which really gives them that “these are computer books!” feeling, at least when viewing on an eReader device.

They are:

  • GNU C Library Manual, glibc 2.29 ( epub | mobi )
  • GNU C Reference Manual, v0.2.5 ( epub | mobi )
  • GNU Emacs 26.2 Manual ( epub | mobi )
  • The Org Mode 9.2 Reference Manual ( epub | mobi )

I chose these ones to work with because they were all documents that I had first encountered when I started my Computer Science degree a long time ago, and I had either printed them out via the department’s old school LPR line printers (two pages per sheet, double sided to save on paper) and/or either bought a printed set of (or had wanted to in the past but couldn’t because they had limited print runs and were always out-of-stock), mainly for my own reference. Plus, I figured it’d be nice to have reference copies of programs or languages that I used to be very proficient in back in the day again (not that I ever anticipate coding in C again using EMACS anytime soon; I’m a Vi guy now and I’ve given up on most types of programming these days, lol).

All in all, this was a fun exercise, and if I’m ever bored, I’ll probably attempt to convert more documents to ePubs just so I can have them on my eReader (it really gave me an opportunity to indulge my OCD, which I can really appreciate on days where I’m not feeling productive). However, I don’t think I’ll try and keep pace by creating new versions of the above files whenever upstream updates the documentation; it takes more time than I want to spend to make sure everything looks good (it’s usually a day or two worth of work) and things don’t change much between minor versions anyway.

However, I do feel more confident that when I’m ready to publish whatever it is I end up writing (whether that be fiction or nonfiction; at this stage, it’s 50/50 on which concept I ultimately go with), that I’ll be able to author and typeset various book files on my own, rather than relying on a third-party source or having to pay another person to do it (assuming I improve my (non-existent) skills in HTML and CSS, of course). I think that’ll make going the digital self-publishing route much, much easier for me. Print layout for a traditional paperback is a different story though; my friends that have self-published say that sometimes, the platform or service you publish on will take care of that stuff for you automatically if you want it to, although my preference would be to publish things directly to various services myself in order to maximize my earnings. For everything else in regards to layout in print, I suppose there is Scribus to learn.

As for the Kindle formats (mobi/azw3/KFX), I haven’t forgotten. However, I’ll leave figuring out how to directly author into those file formats for another day (it’s pretty easy using kindlegen from what I’ve read; just need to remember about adding media queries to images first. Or maybe just use Kindle Previewer to convert them? Not sure.). But if you have a Kindle device and wish you could sideload the above reference manuals into it, just use Calibre to convert them to .mobi files first and you should be fine.

Edit:  .mobi versions now available. I just used Kindle Previewer to convert them from the original .epub files, although I have no idea why the file sizes ended up so big compared to the .epub versions. If you have a Kindle device, let me know how they turned out!

(Photo Credit:  mac42 via Flickr/CC)