PDF Filter: How can I split a 100 page PDF document, where each PDF page has two book pages, into a 200 page PDF document?
May 16, 2010 7:28 PM   Subscribe

I have many scans of books in PDF format. The problem is that each PDF page has two side-by-side book pages and I want to optimize them for reading on a kindle. Because the scans are pretty regular, I am hoping it would be easy to automate the slicing of one page into two and then recompiling the halfpages into a single document. Does anyone know of any open source programs to do this? Perhaps a python library? A feature of Adobe Acrobat?
posted by refractal to Technology (11 answers total) 5 users marked this as a favorite
 
If the scans are of decent quality you could have the processed by OCR software (MS Document and Imaging comes with their Office suite). You can then simply copy/paste into a word processor. With just a bit of editing and formatting you'll have nice readable files.

It's really not too much work. (Although doing it for a long book could be daunting, you could more than likely set-up a few macros to smooth things along.)
posted by oddman at 8:00 PM on May 16, 2010


In Acrobat Pro, you could create two copies of your scanned book. In one of the files you would crop all pages on the right, and then in the other you could crop the pages on the left. Then you could merge the two files. But reordering your even and odd pages may be a bit of a slog.
posted by wabbittwax at 8:07 PM on May 16, 2010


Oh, I'm building a book scanner so a bit back, while I was doing my research to make sure I could put my scanned books on a Kindle somehow -- the only reason I want to scan them -- I typed up the following information for myself:
Book-scanning forums that often include talk of this sort of thing:
http://www.diybookscanner.org/

The problem is that the scans are saved as graphics, not text, and this will display poorly on an eBook reader. To convert it over to text, I need to run them through OCR (Optical Character Recognition) software. There is a lot of this software out there and some of it is free, so I would need to do some research. ABBYY Finereader isn’t free but the cheapest version is only $50 and there’s a whole thread recommending it on the forums.

I haven’t had any luck with this site but some people said it will do comparable OCR for free:
http://www.cometdocs.com/

After that, there may still be some formatting necessary. I’m not entirely clear on this. This forum thread talks about the steps necessary to get a scanned book into a format that eBook readers can use:
http://www.diybookscanner.org/forum/viewtopic.php?f=1&t=115&start=0&hilit=kindle

I think the way to go is scanned images -> OCR to text (something like RTF that preserves italics and such) -> eBook format.

Then I’ve heard this is supposed to be helpful in converting it to the Kindle formatting… it will convert from PDF or RTF to Kindle:
http://calibre-ebook.com/
Hope that is helpful.
posted by Nattie at 8:09 PM on May 16, 2010


I just realized I should specify: apparently even if you have it split into left and right pages per "page" of the PDF, OCR readers will pull the text out in the correct order.
posted by Nattie at 8:28 PM on May 16, 2010


If you have acrobat pro, export pdf to tiff images 300dpi at least, if its colourful and photographic, 600dpi. Put all of those files in one folder. If you don't have acrobat pro, there are still utilities that can do this job. You can do it easily on linux but I don't know how.

Install ScanTailor and import this folder in ScanTailor. Follow On screen directions and at the end, export as individual pages.

Combine these images together as a pdf. Run OCR if you really need.

I follow this procedure all the time as I am scanning whole of my library. Its very handy and if you have fast computer, it could be very quick. Make sure to use automate function which is just a button next to every step.
posted by zaxour at 9:03 PM on May 16, 2010 [1 favorite]


You could do this with LaTeX (I'm not saying it's the world's best idea, but it's doable, and was in fact a fun exercise). You would need to change values in three places, I have marked these with comment lines starting with a percent symbol %. (It's set up for 8.5 by 11 letter paper and splitting down the middle right now.) Add a ~\newpage line for every page you want in the output (if your current document is 50 pages, the output would be 100 pages, and you would need ~\newpage 100 times). Then run it through a pdflatex compiler (freely available).

\documentclass[12pt]{book}

\usepackage[papersize={4.25in,11in},width=4.25in,height=11in]{geometry}
%this is the size of your final output, twice

\usepackage{graphicx,fancyhdr}

\newcounter{pagein} \setcounter{pagein}{1}

\pagestyle{fancy}
\fancyhead{} \fancyfoot{}
\headheight 11.1in
%the height of your final output, plus a little bit of padding
\fancyhead[CO]{\includegraphics[trim=0in 0in 4.25in 0in,clip=true,page=\value{pagein}]{INPUTFILE.pdf}}
\fancyhead[CE]{\includegraphics[trim=4.25in 0in 0in 0in,clip=true,page=\value{pagein}]{INPUTFILE.pdf}\addtocounter{pagein}{1}}
%the two above lines: specify the name of your input file, for trim, specify the amount to remove from each page (left, bottom, right, top), CO = odd pages, CE = even pages, first page is odd

\renewcommand{\headrulewidth}{0pt} \renewcommand{\footrulewidth}{0pt}

\begin{document}
~\newpage
~\newpage
%have a ~\newpage line for every page you want in the output file
\end{document}

posted by anaelith at 9:11 PM on May 16, 2010


TIFFRotateSplitMakePDF splits side-by-side book-scan TIFs to create 1-up pdf pages using ImageMagick. But the latter also reads pdfs, and you seem willing to get hands-on, so perhaps you can get ImageMagick to do what you want on pdfs directly, without the intermediate TIFs.
posted by Dave 9 at 9:22 PM on May 16, 2010


You know, if you have a recent version of Microsoft Office, OneNote does pretty good OCR for free...
posted by Master Gunner at 9:34 PM on May 16, 2010


Best answer: A-PDF Page Cut will do what you need. The full licence is $27, but the free download works fine, as long as you don't mind the small watermark at the top of the first page. I have found this to be a very handy program, along with PDF Split and Merge, which will split and recombine the individual pages of a pdf document, and is free.
posted by Tawita at 11:51 PM on May 16, 2010


Best answer: If you have acrobat pro here's some javascript that will do it for you.

If you can boot up into linux you can use the poppler library to do this.
posted by stratastar at 4:46 PM on May 17, 2010


I've just stumbled onto this thread trying to to approach the same problem half a year later. For posterity, here's what I found to work well:

Scan Tailor takes a TIFF file input and automatically identifies orientation, splits pages, and even corrects skew. Has an excellent UI and even works on Linux. Absolutely amazing.

However, it outputs a TIFF file for each page. For posterity, I'll You can convert that to a PDF using ImageMagick and pdftk:

mogrify -format pdf *.tiff
pdftk 0*.pdf cat output mybook.pdf


(this is a non-OCR solution. I'm converting a math textbook, I don't want OCR to mangle the notation)
posted by qxntpqbbbqxl at 2:44 PM on January 24, 2011


« Older Obscure words combine to form common phrase   |   Help a Mother Out Newer »
This thread is closed to new comments.