Help me wrangle PDFs
July 2, 2010 1:59 AM Subscribe
I'm trying to go paperless, and have scanned and OCR'd huge stacks of paperwork into PDF documents. Can you recommend a tool to split, merge, delete pages etc from PDFs?
So my workflow so far has been: stick a huge pile of related docs into the sheet feeder of our networked scanner at work, and scan them, double sided to a PDF. Then feed that PDF through ABBYY FineReader.
So each PDF contains multiple documents. For example I have all the bank statements for a single account in one huge document. Double sided scanning means there are also a lot of blank pages in the docs.
So, now that I've OCR'd them, I'd like to separate them into separate docs, remove blank pages and probably merge a few docs where they've been split across batches.
I'm imagining something that works like a standard PDF viewer, but lets me drag pages around to re-order, delete pages, and Ctrl-click multiple pages and save them to a file.
Do you know of such a tool. Preferably linux based, but windows will do too. Oh yeah - and free is better :)
So my workflow so far has been: stick a huge pile of related docs into the sheet feeder of our networked scanner at work, and scan them, double sided to a PDF. Then feed that PDF through ABBYY FineReader.
So each PDF contains multiple documents. For example I have all the bank statements for a single account in one huge document. Double sided scanning means there are also a lot of blank pages in the docs.
So, now that I've OCR'd them, I'd like to separate them into separate docs, remove blank pages and probably merge a few docs where they've been split across batches.
I'm imagining something that works like a standard PDF viewer, but lets me drag pages around to re-order, delete pages, and Ctrl-click multiple pages and save them to a file.
Do you know of such a tool. Preferably linux based, but windows will do too. Oh yeah - and free is better :)
PDF Split and Merge is in the ubuntu repositories, so it runs under linux.
pdfshuffler might also be worth a try.
posted by Triton at 4:04 AM on July 2, 2010
pdfshuffler might also be worth a try.
posted by Triton at 4:04 AM on July 2, 2010
On Windows ConcatPDF is my favorite free one. It's a GUI-based app although I don't remember if it has exactly the interface you're talking about. Before installing ConcatPDF 1.1 you have to install Microsoft .NET Framework Version 1 and Visual J# .NET Redistributable Package.
posted by XMLicious at 4:19 AM on July 2, 2010 [1 favorite]
posted by XMLicious at 4:19 AM on July 2, 2010 [1 favorite]
Best answer: If you don't mind working at the command line, look at pdftk. It is not hard to learn and very powerful. I use it all the time to split and concatenate documents for work.
posted by metroidhunter at 4:47 AM on July 2, 2010
posted by metroidhunter at 4:47 AM on July 2, 2010
You might want to download dotImage from my company Atalasoft. It is an imaging SDK that includes PDF manipulation. The evaluation gives you 30 days to play with it and one of the included samples is a tool that lets you merge/split PDF files via drag-and-drop. IIRC, that tool will run on its own even after the eval expires.
FWIW, I worked on Acrobat version 1, 2, 3, and 4, and while I didn't write the sample tool, I wrote all the code underneath that manipulates the files. If you have problems with it, be sure to memail me.
posted by plinth at 6:12 AM on July 2, 2010
FWIW, I worked on Acrobat version 1, 2, 3, and 4, and while I didn't write the sample tool, I wrote all the code underneath that manipulates the files. If you have problems with it, be sure to memail me.
posted by plinth at 6:12 AM on July 2, 2010
Response by poster: Thanks everyone!
Just for anyone looking at this later: I found most of the apps had horrendous, unintuitive UIs, to the extent where writing my own Python glue to make pdftk do what I wanted was less painful. PDFShuffler was the closest to what I was looking for, but seemed to mess up the output PDFs - I have the OCRed text under the original scanned image, and PDFShuffler seemed to either move the text to the top, or loose the image.
Sorry Plinth, I didn't get round to trying yours - a combination of being jaded from all the other tools, and having to fill in a form to try it made me give up. No offence!
posted by blacksky at 8:54 AM on July 3, 2010
Just for anyone looking at this later: I found most of the apps had horrendous, unintuitive UIs, to the extent where writing my own Python glue to make pdftk do what I wanted was less painful. PDFShuffler was the closest to what I was looking for, but seemed to mess up the output PDFs - I have the OCRed text under the original scanned image, and PDFShuffler seemed to either move the text to the top, or loose the image.
Sorry Plinth, I didn't get round to trying yours - a combination of being jaded from all the other tools, and having to fill in a form to try it made me give up. No offence!
posted by blacksky at 8:54 AM on July 3, 2010
I use PaperPort -- very intuitive UI.
posted by blue_wardrobe at 10:38 PM on July 3, 2010
posted by blue_wardrobe at 10:38 PM on July 3, 2010
You can do all of these things in Mac's Preview, BTW
posted by rockindata at 10:00 AM on February 11, 2011
posted by rockindata at 10:00 AM on February 11, 2011
This thread is closed to new comments.
posted by megatherium at 2:32 AM on July 2, 2010