pdf unpaper 'ed
May 16, 2007 1:25 PM   Subscribe

The unix program 'unpaper' was recently recommended for cleaning up artifacts on scanned images. Unfortunately it doesn't natively do pdfs, only "pnm family" -- pbm, pgm and ppm formats.

Using Ubuntu- feisty; I've installed Imagemagick and unpaper -- welcome any suggestions. My first challenge is conversion from pdf to pbm
The pdf file in question. (23 mb)
posted by acro to Computers & Internet (10 answers total) 1 user marked this as a favorite
 
$ sudo apt-get install imagemagick
$ convert foo.pdf tmp.pbm && unpaper tmp.pbm ... && convert tmp.pbm foo.pdf
posted by cmiller at 1:41 PM on May 16, 2007


Response by poster: Thanks cmiller.

When I tried earlier to convert the pdf to pbm, imagemagick output only the the first page of the multi page pdf, and the pbm file was 3x the entire original pdf (~80 mb).
posted by acro at 1:52 PM on May 16, 2007


PDF man (iirc) LZ-compress bitmap data. I'd expect the PBM to be large.

If it's more than one page, you can put each page into its own file (also IIRC):

$ convert in.pdf tmp%03d.pbm
$ for inname in tmp*.pbm; do
outname=out`basename $inname .pbm`
unpaper ... $inname ... $outname
done
$ convert 'out*.pbm' result.pdf #mind the single quotes!

Note, I haven't tried this in a /long time/, so it may need tweaking.
posted by cmiller at 2:47 PM on May 16, 2007


Response by poster: First line ran successfully. (Thanks!)
-- Since the pages are two up (double)...

$ for inname in tmp*.pbm; do outname=out`basename $inname .pbm`; unpaper --layout double --sheet-size a4-landscape ... $inname ... $outname; done

Any suggestions for the option syntax?

*** error: sheet size unknown, use at least one input file per sheet, or force using --sheet-size.
posted by acro at 3:43 PM on May 16, 2007


Dunno exactly what that error means, but are you sure you don't either want to use --sheet-size a4 or --layout double-rotated

Otherwise you will be scaling the pages substantially, which doesn't seem what you want.
posted by rajbot at 4:30 PM on May 16, 2007


Also, it would help if you posted the pdf.
posted by rajbot at 4:36 PM on May 16, 2007


Response by poster: The pdf is the last link [more inside] ... it's a similar layout to the example picture here, regular book scan.
posted by acro at 4:59 PM on May 16, 2007


Ah sorry, missed your more inside.

Your options are fine, but your bash script is probably messed up. Using --layout double --sheet-size a4-landscape on the third page, I get this result. I didn't do a good job on the bitonalization, but was able to filter out the dark page edges.

Since your pages are well-registered, another way to approach this problem is to white-fill or filter the edges and the center gutter, deskew, and then scale to A4. That way you don't have to go to bitonal, as required by unpaper.
posted by rajbot at 6:15 PM on May 16, 2007


Response by poster: Since your pages are well-registered, another way to approach this problem is to white-fill or filter the edges and the center gutter, deskew, and then scale to A4. That way you don't have to go to bitonal, as required by unpaper.

I've done a similar 'crop all pages' in Adobe Acrobat; can you suggest a howto for unix? Using Imagemagick?
posted by acro at 6:34 PM on May 16, 2007


I would do it using Leptonica c library, but that requires that you write some glue code in c.

If you don't mind going to bitonal, unpaper is great, but since you have such clean images you might want to turn off some of the noise filters, which can be too agressive using the default settings and actually mess with the text.
posted by rajbot at 9:54 PM on May 16, 2007


« Older DIY Bridal Shower Projects   |   Is $320 too much to replace the thermocouple on my... Newer »
This thread is closed to new comments.