Why do teh ebooks bad so look?
October 11, 2011 12:00 PM   Subscribe

Why are pirated/cracked ebooks commonly of such low quality? This is for an academic paper.

I have to admit to downloading the odd pirated book for my Kindle. But every one has had errors in their texts to a certain degree. Minor errors include missing punctuation or missing line breaks (between lines of dialogue, for example). Larger mistakes include whole chapters in italics, or chunks of text appearing in the wrong place (sometime ending up several pages down the road).

I'm basically looking for the reasons why ebooks that are cracked end up with these kinds of problems. If there is an online resource for cracking books that discusses these issues, that would be fantastic, since I would be able to cite it (this is academia, so there's no fear of prosecution).

posted by hiteleven to Technology (15 answers total) 3 users marked this as a favorite
I think ebook conversion software is pretty good, but the source files vary widely and have lots of quirks (like you say, line breaks, weird typographical symbols, running heads and feet inserted in nonstandard ways, etc. ) that can cause issues in conversion. There are millions of ways to typeset something badly---and then those errors can wreak even more havoc during a file conversion process. Are you looking mostly at .MOBI files? I have to say, my experience does not match yours and that many pirated ebooks I see are of high quality.
posted by mattbucher at 12:05 PM on October 11, 2011

Oh and if you want to get an idea about some of the things that need to be removed for a clean ebook conversion, I'd check this out (especially the stuff about heuristic processing): http://manual.calibre-ebook.com/conversion.html
posted by mattbucher at 12:07 PM on October 11, 2011 [1 favorite]

Before eBooks became widely available, book piracy was basically people with scanners and OCR software doing this as a hobby. Many of the books you're getting probably aren't "cracked" copies of commercially released eBooks, but much older. Now, the hobby isn't "get the best copy out there," it's "get as much stuff out there as fast as possible." Between the OCR software and a automated spellcheck, no one involved in this is going to read the book and tick off problems before throwing it up on the trackers and IRC. It's not like making sure that an AVI or MP3 file is okay for distribution and even those are imperfect.

Now, considering that books edited by people who get paid for it still come out with typos and other errors, that's your reason. Shitty eBook editing is a pretty big deal right now; Neal Stephenson had his latest pulled from Amazon because the formatting was godawful.
posted by griphus at 12:07 PM on October 11, 2011 [5 favorites]

I'm assuming that most books are scanned and 'read' by software, and not checked by human eyes before being disseminated. After all, why WOULD anyone bother to edit them? They're giving them away for free! I think that, as time goes on and more officially produced ebooks emerge (and are then pirated), the overall quality will go up.
posted by showbiz_liz at 12:09 PM on October 11, 2011 [1 favorite]

I should emphasize that I am not complaining about the quality of pirated books, nor am I going to wade into this issue in any political sense. The essay will be about linking the historic practice of regular book piracy with ebook piracy. Thanks again!
posted by hiteleven at 12:11 PM on October 11, 2011

Do you need technical reasons? As in, what's going on in the software that this happens? And by "regular book piracy" do you mean the old Scan-and-OCR fashion or some sort of physical book bootlegging?

The question is a little difficult to answer outside of "people aren't paying attention."
posted by griphus at 12:14 PM on October 11, 2011

Yes, I'm looking for technical reasons, or at least as technical as possible (the scan/OCR angle, for example, is very promising). By "regular" book piracy, I mean the old practice of running a renegade printing press and cranking out unlicensed copies of bestsellers...something that used to be a big deal.
posted by hiteleven at 12:19 PM on October 11, 2011

This essay can help you out. It's not great, but there's quite a few leads in the bibliography. Some relevant terms to google are "bookwarez IRC" and "bookz IRC"
posted by griphus at 12:23 PM on October 11, 2011

FWIW: I've noticed many of these same problems with books I've bought off amazon, the legit digital versions of novels. Especially if they're older books.
posted by royalsong at 12:25 PM on October 11, 2011

Are you sure that the cracked books are of worse quality than the legitimate ebooks?

I say this because my legitimate ebooks are often HORRENDOUS. Blatant, blatant typographical errors on every page. If you want an example, try the Nook version of "Why Zebras Don't Get Ulcers".

The actual relevant question (one I have pondered at length myself) may be why publishers feel it is acceptable to deliver such a dreadful product.
posted by endless_forms at 12:27 PM on October 11, 2011 [4 favorites]

I volunteer with Distributed Proofreaders, which converts books in the public domain into .txt, .html, and .ePub formats. My feeling is that both crackers and most traditional publishers sacrifice quality for speed - it can take DP several months to a year to produce even a simple eText, and I have personally worked on books that took several years due to complicated texts and lack of interest from volunteers.

Here are the general steps to produce a quality etext from a physical book (I'll note where hastily-prepared etexts can go wrong in italics):

(1) Scan the book page-by-page, using the correct settings for each book. Scan quality can make or break OCR. If someone is working fast and can sacrifice the book, they will slit the binding and auto-feed the pages, which means they might be out of order.

(2) Do some pre-processing on the images (crop to the size of the text, straighten the page, make the page square instead of distorted at the spine) - each of these improve the quality of the OCR. People working fast would probably skip this step entirely.

(3a) Use OCR software to convert images to text. OK, this is a step that nearly everyone can get right. OCR software tends to miss a lot of stuff - things that are "scannos" like "be/he" or "and/arid", it doesn't know if paragraphs are in the wrong place or interrupted by images, it has to guess at paragraph breaks, etc.

(3b) OCR software can also catch and insert formatting like bold or italics, although it has a high error rate.

(4) Use automated tools as a "first pass" to catch common OCR errors like missing punctuation (DP had to develop these tools on their own - I wouldn't surprised if faster ebook producers don't use them)

(5) Thoroughly proofread and format the whole book. DP strives for a pretty high quality, which involves at least 4 pairs of eyes looking at every page for either proofreading errors or formatting like italics, bold, illustrations, tables, etc. Publishers might do one pass over their ebooks, but not sufficient to catch every error. Crackers will probably skip this step or do only a cursory pass.

(6) Post-process the book, which among other DP-specific tasks includes formatting the text for the different editions. This is probably where formatting mistakes such as entire paragraphs of italics would be introduced by hasty publishers or crackers.

For publishers, the "best way" to get an ebook version of a text would be from the digital master copy, which has already been proofread and formatted and would only need to go through Step 6. But this is rarely an option for piraters who are working from a physical text.
posted by muddgirl at 12:48 PM on October 11, 2011 [15 favorites]

I once uploaded an out-of-copywrite (just) Robert E. Howard story to Wikibooks as they didn't have a scan yet, just the Gutenburg copy, which didn't have any version information and had conflicts with the text in my hardcopy (At least two versions of the text are known to exist)

Anyway, I managed to get 2 pages out of order, despite the fact I didn't unbind the book, and even using the Gutenburg edition as a base the text was still pretty messy. You should have seen the straight OCR of it! Stupid errors all over the place, letters merged, ligatures turned into random things, line breaks and hyphens screwing things up, the software meant to undo line-break hyphens removing real hyphens, odd characters like mdashes turning to other characters. I'm amazed pirated books are not worse then they are.

Also; Quality might go up as publishers masters get leaked. I know Wizards of the Coast had a problem with the 4e masters they sent to the publisher getting leaked. It was quite obvious from what I hear, as all the colour codes and automated printing instructions and such were visible in the margins.
posted by Canageek at 3:30 PM on October 11, 2011

I work in epublishing, so I have some experience of what it takes to go from a printed copy to an ebook. We have to do it more often than I'd like, and getting a well-proofed book out of it is labor-intensive. Depending on the typeface that the book is set in, OCR software can do anything from a miraculous job to a brutal one. Several passes of thorough proofreading are time consuming (and time is money), and concerning yourself with little details like scene breaks here and smallcaps there and ensuring that quotes are all curling the right way, your em-dashes and ellipsis are correct… that's all about caring. If your goal is to pack the collected works of Dick Francis into a RAR and torrent it far and wide, you probably have different priorities.

The actual relevant question (one I have pondered at length myself) may be why publishers feel it is acceptable to deliver such a dreadful product.

The dirty little secret reason that traditional publishers find this acceptable is that technically it keeps the work in "print", so that contractually, the rights don't revert to the author. As long as they're selling a Kindle edition, it hasn't gone out of "print", and consequently the publisher retains the rights (and pays the author a pretty dismal royalty). So, from their perspective, for a lot of works, having any digital edition is better than none.
posted by mumkin at 10:04 PM on October 11, 2011

This is from early 2010, an eternity ago when it comes to the ebook world, but it indicates that at least at that time, most pirated ebooks originated from pre-publication manuscripts. That means they're not final. Possibly even not anywhere close to final, depending. Publishers make a lot of changes to books during and after manuscript stage and before printing, most of which involve setting formatting and fixing mistakes.

So, even though there are a lot of errors that can be introduced in the process of making a book an ebook, many of the errors you see in a pirated copy actually may be from the fact that it's a manuscript that hasn't even been made into a book at all yet.

Also when I try to put an actual legit manuscript on my Kindle to read, something that in theory should be much easier than either cracking a DRM'd ebook or scanning in a hard copy, (presumably the reason pirated copies tend to be manuscripts), it's still tricky to get it to look right.
posted by lampoil at 9:49 AM on October 12, 2011

Yeah, I initially speculated that book piraters might work from manuscripts or Advanced Review Copies (on the theory that they will want to get the book out as early as possible), but I'm not really involved with that side of it so it was just a guess.
posted by muddgirl at 11:59 AM on October 12, 2011

« Older How do I tackle long term projects?   |   IT Ticketing System Newer »
This thread is closed to new comments.