Looking to digitize my book library, but worried about corrupted PDFs
November 14, 2015 9:50 PM   Subscribe

I have a rather extensive (read: physically burdensome) library of books I'd collected over the years that I'd like to digitize to take with me as I move on to my next adventure. I'm wondering what the best (most efficient, reliable, and economically efficient) way is to digitally preserve a large library of books as PDFs. I'm especially concerned to prevent files from becoming corrupted over time.

Even if it isn't really possible to guarantee against file corruption, I'd like to find some way to detect corruption quickly and automatically so that I can repair it before my backup service deletes old backups with copies of the non-corrupted version of the file.

I've been using a Fujitsu ScanSnap to process the pages of books once the spines have been removed, and have been saving the resultant files to a Seagate Backup Plus Slim 2TB external hard drive. (I have a 2010 Macbook Pro running El Capitan.)

Everything's gone mostly well. But the other day, after running Repair Disk Permissions in Disk Utility on my Mac, I noticed that a folder called "DamagedFiles" was created on the Seagate. I looked into the folder and, to my horror, there were about 40 aliases to files--all PDFs--that had been corrupted. I do use a backup service (Backblaze), so I downloaded the oldest copy of the Seagate that I could restore. Even though the copy was made before I'd run Disk Utility, the same handful of PDFs (the ones later linked by the aliases made by Disk Utility) were corrupted--many beyond repair.

I'm at a loss because it seems like the files became corrupted simply with the passing of time, as I was positive I'd accessed them successfully before, and I had only run Repair Disk Permissions on the drive that one time.

I always thought that the overall best way to construct my archive would be to save all of these large PDF files to an external drive and back up the drive using Backblaze. But now I'm worried that by the time I find some instance of a corrupted file, it'll be far too late for me to access the non-corrupt version of the file.

Does anybody have any advice, potential explanations, and/or novel solutions? Is it the hard drive the culprit; is it simply bad luck? Will an online cloud storage service (something like Dropbox, etc.) be my best bet if I'm paranoid about things going awry with an external HD? Any and all advice appreciated.
posted by scriptible to Technology (13 answers total) 10 users marked this as a favorite
A RAID might be a better option for you. If you require small drives, I have a friend that uses one of these and is pretty happy with it. If physical space isn't as much of an issue, the larger synologies might be an option as well, and allow remote access.
posted by el io at 10:10 PM on November 14, 2015

There is a free file recovery system called "par2" or parity archives for just this scenario. The system builds recovery files which can be used to detect file corruption and repair the files. The best implementation is called quickpar. I run it on my Mac using wine. Works great.

posted by hh1000 at 10:38 PM on November 14, 2015 [1 favorite]

I'm at a loss because it seems like the files became corrupted simply with the passing of time

Far more likely scenario is that the files became corrupted due to one or more episodes of unsafe disconnection of an external hard drive, that the corruption was then propagated to your Backblaze backup, and that all of this happened long enough ago that the thirty day previous version feature didn't help.
posted by flabdablet at 10:52 PM on November 14, 2015 [3 favorites]

What if you bought a dedicated external hard drive just for the books? You could backup the external hard drive to Backblaze once, and then if you don't add anything else to the drive, you wouldn't have to do further back ups which I assume would help minimize the risk of corruption. If one of the factors in the corruption is unsafe disconnection as flabdablet posits, this would probably help minimize that risk as well, since you wouldn't be using the drive for regular back ups.

Another option might be saving the books in multiple formats. For example, you could also convert the pdfs to the ebook format. I use calibre to do this; the software is free. That way, if one format gets converted, maybe the other one wouldn't.

In general, when it comes to preserving digital files, it seems like the more varied your backup methods and the more copies of the original files you have, the better, at least if we're talking about static files. (This gets more complicated if it's files you're constantly accessing and changing, because then you have to keep track of everything.)

Also, I've personally had really good experiences with Western Digital hard drives, on the off chance this is related to a problem with the Seagate.
posted by litera scripta manet at 11:07 PM on November 14, 2015

I'm hoping this isn't a detail, but my advice is to think carefully about the value and accessibility of these books. Are they easily available from your local library or for relatively cheaply from your local bookstore? Then I wouldn't bother to scan. The time and effort to digitize all those books might be more than the effort you'd expend later on looking for a copy. Books aren't necessarily precious objects.
posted by bluedaisy at 1:04 AM on November 15, 2015 [6 favorites]

there are two issues here.

first, how did these files get damaged? while files can degrade over time, what you describe is crazy bad. either you have a bad disk (which you can check - and probably did - using disk utility), or flabdablet is on the money and this is related to disconnecting the USB disk incorrectly, or there's some other weird issue (seems unlikely).

second, how can you manage things better in the future?

backups: you're doing backups, which is good, but you need a backup that can let you get the good versions (and it's not clear to me that you have done that). so you need to find out how to do that in backblaze, or find an alternative that you can use (you want something that lets you see when a file changed and retrieve the version before the change).

parity checks: you can avoid some kinds of corruption by using parity checks, using either RAID or other software (as described above). but these are generally designed to protect from rare disk errors. i am not sure they will protect from corruption from bad unplugging.

better hardware: you can avoid the usb unplugging scenario by using network storage, rather than USB. NAS (network attached storage) devices often use RAID, so you may kill two birds with one stone but you must check that it is not "RAID 0" (which does not use parity bits).

so if this is important to you, and you have the money, i'd try finding a NAS That uses RAID (1 or 10 is best, but RAID 5 is ok). if you don't have the money, take much more care with unplugging the USB drive.
posted by andrewcooke at 3:46 AM on November 15, 2015 [2 favorites]

Other people have data integrity suggestions down but I thought I'd just point out that while you still have a physical copy of the book handy you can open it up and plug a random sentence or phrase into a search engine with quotes wrapped around it and specify PDF format like so:
"An alien mastermind bursts into Louis Wu's life" filetype:pdf
...and you may be able to find PDFs other people have scanned. It doesn't work perfectly because of line breaks and OCR errors, so you might need to make a few different tries or look at non-PDF sources. You can also try this if all you've got is a fragment of a file, of course.
posted by XMLicious at 4:29 AM on November 15, 2015 [4 favorites]

Some very good replies above as usual, just adding my 2 cents:

1. Silent corruption of files on disk is very easy to mitigate but I'd say your worst threat is simply your HDD dying and losing everything at once. The solution is more than one (integrity protected) backup copy.

2. You may have already looked into this, but the most efficient way of getting a digital library is downloading whatever you can from Gutenberg / P2P sites.

3. Use parity! hh1000 suggests QuickPar which is certainly fine enough, but there's something even better: MultiPar (just don't use the experimental par3 format, stick with par2). Windows-only, but also runs perfectly on wine (watch out for filenames). There's also a CLI option if you prefer.

Whenever you scan a book, immediately create a par2 file to protect it (or collect a bunch of books in a folder and protect them all at once). Note that MultiPar and par2cmdline both support recursive directory scanning, but the same isn't true of QuickPar and older par clients, so I recommend you avoid using the recursive option for compatibility.
posted by Bangaioh at 4:42 AM on November 15, 2015

parchives/par2/multipar are just doing what RAID does, but in software. If you are concerned about malicious corruption, par2 uses a weak hashing algorithm that's very easily tampered with. If these books are privately valuable to you, a local network storage unit plus multiple backups and regular testing (maybe with majority voting) would do it. Likely expensive and time-consuming.

If these books are publicly valuable (and you're okay with this), torrent 'em/embed 'em in the blockchain/code them in the Linux kernel source (okay, maybe not that last one). Those methods externalize the effort needed to keep them intact.
posted by scruss at 6:23 AM on November 15, 2015

I have an app called Shelfie that tracks down digital copies of books you own based on photos of your book shelf. The majority of the books it finds are available for sale at a good discount over standard ebook prices but it will also link you to public domain copies of older texts. Paying for cheap e-copies may in fact be more economical than dismantling and scanning books your library (I know people who have been involved in digitization projects and it was incredibly hard on the body after a while).
posted by bibliotropic at 9:12 AM on November 15, 2015

You said any and all advice, so YMMV but for important but large files that don't change much I use different backup frequencies. I occasionally copy (NOT mirror) files from one computer to another (Windows tower, Windows laptop, Linux file server) plus a couple different external drives. You can use options that won't overwrite so you're only adding new files to your backup locations.

You will find that the above tends to cause bloat, so you have to have some idea that the amount of hard drive space you can dedicate to this process, as compared to the hard drive space needed on each of those machines for their own use, is compatible. Also a couple times a year I copy (same scheme) to a couple Western Digital My Passport Ultra 1TB drives and rotate one in safe deposit box and one at home. Those drives are less expensive than they sound, are USB3 and seem to be quality devices. From my own bad experiences I can testify there are a lot of crap external drives out there.

My large file directories are too big for the free levels of Google Drive, Copy and Drop Box. Yours may be smaller. Or you may have fewer computers than me, or might not have a SDB. So these are just ideas.
posted by forthright at 7:56 PM on November 15, 2015

You may also look into Amazon Cloud Drive or Amazon Glacier as a network backup solution. Both are pretty reasonably priced. ACD is $60/year for "unlimited" storage. Glacier costs $.007 per GB per month; which could be ore or less expensive depending on data volumes.
posted by chazlarson at 8:32 AM on November 17, 2015

Does anybody have any advice

Ideally I think you want the filesystem doing this, but you can also manually do checksums like md5sum to verify that files haven't changed, and then diff old_list_of_checksums new_list_of_checksum to see if files have changed. And of course this is scriptible and something people have worked on and written up.

md5deep looks interesting as well. (via)

From a bit of poking around, it looks like unison is still going strong.

I'm not much of a techie right now, but I think I'd go with a combination of a home NAS synced to the cloud like dropbox or something, making sure my setup and operating system was doing incremental backups.
posted by sebastienbailard at 10:49 PM on November 17, 2015

« Older Can anyone suggest a great ear-bud headset for an...   |   Recommend me some blogs Newer »
This thread is closed to new comments.