pdf unpaper 'ed
May 16, 2007 1:25 PM Subscribe
The unix program 'unpaper' was recently recommended for cleaning up artifacts on scanned images. Unfortunately it doesn't natively do pdfs, only "pnm family" -- pbm, pgm and ppm formats.
Using Ubuntu- feisty; I've installed Imagemagick and unpaper -- welcome any suggestions. My first challenge is conversion from pdf to pbm
The pdf file in question. (23 mb)
Using Ubuntu- feisty; I've installed Imagemagick and unpaper -- welcome any suggestions. My first challenge is conversion from pdf to pbm
The pdf file in question. (23 mb)
Response by poster: Thanks cmiller.
When I tried earlier to convert the pdf to pbm, imagemagick output only the the first page of the multi page pdf, and the pbm file was 3x the entire original pdf (~80 mb).
posted by acro at 1:52 PM on May 16, 2007
When I tried earlier to convert the pdf to pbm, imagemagick output only the the first page of the multi page pdf, and the pbm file was 3x the entire original pdf (~80 mb).
posted by acro at 1:52 PM on May 16, 2007
PDF man (iirc) LZ-compress bitmap data. I'd expect the PBM to be large.
If it's more than one page, you can put each page into its own file (also IIRC):
$ convert in.pdf tmp%03d.pbm
$ for inname in tmp*.pbm; do
outname=out`basename $inname .pbm`
unpaper ... $inname ... $outname
done
$ convert 'out*.pbm' result.pdf #mind the single quotes!
Note, I haven't tried this in a /long time/, so it may need tweaking.
posted by cmiller at 2:47 PM on May 16, 2007
If it's more than one page, you can put each page into its own file (also IIRC):
$ convert in.pdf tmp%03d.pbm
$ for inname in tmp*.pbm; do
outname=out`basename $inname .pbm`
unpaper ... $inname ... $outname
done
$ convert 'out*.pbm' result.pdf #mind the single quotes!
Note, I haven't tried this in a /long time/, so it may need tweaking.
posted by cmiller at 2:47 PM on May 16, 2007
Response by poster: First line ran successfully. (Thanks!)
-- Since the pages are two up (double)...
$ for inname in tmp*.pbm; do outname=out`basename $inname .pbm`; unpaper --layout double --sheet-size a4-landscape ... $inname ... $outname; done
Any suggestions for the option syntax?
*** error: sheet size unknown, use at least one input file per sheet, or force using --sheet-size.
posted by acro at 3:43 PM on May 16, 2007
-- Since the pages are two up (double)...
$ for inname in tmp*.pbm; do outname=out`basename $inname .pbm`; unpaper --layout double --sheet-size a4-landscape ... $inname ... $outname; done
Any suggestions for the option syntax?
*** error: sheet size unknown, use at least one input file per sheet, or force using --sheet-size.
posted by acro at 3:43 PM on May 16, 2007
Dunno exactly what that error means, but are you sure you don't either want to use --sheet-size a4 or --layout double-rotated
Otherwise you will be scaling the pages substantially, which doesn't seem what you want.
posted by rajbot at 4:30 PM on May 16, 2007
Otherwise you will be scaling the pages substantially, which doesn't seem what you want.
posted by rajbot at 4:30 PM on May 16, 2007
Response by poster: The pdf is the last link [more inside] ... it's a similar layout to the example picture here, regular book scan.
posted by acro at 4:59 PM on May 16, 2007
posted by acro at 4:59 PM on May 16, 2007
Ah sorry, missed your more inside.
Your options are fine, but your bash script is probably messed up. Using --layout double --sheet-size a4-landscape on the third page, I get this result. I didn't do a good job on the bitonalization, but was able to filter out the dark page edges.
Since your pages are well-registered, another way to approach this problem is to white-fill or filter the edges and the center gutter, deskew, and then scale to A4. That way you don't have to go to bitonal, as required by unpaper.
posted by rajbot at 6:15 PM on May 16, 2007
Your options are fine, but your bash script is probably messed up. Using --layout double --sheet-size a4-landscape on the third page, I get this result. I didn't do a good job on the bitonalization, but was able to filter out the dark page edges.
Since your pages are well-registered, another way to approach this problem is to white-fill or filter the edges and the center gutter, deskew, and then scale to A4. That way you don't have to go to bitonal, as required by unpaper.
posted by rajbot at 6:15 PM on May 16, 2007
Response by poster: Since your pages are well-registered, another way to approach this problem is to white-fill or filter the edges and the center gutter, deskew, and then scale to A4. That way you don't have to go to bitonal, as required by unpaper.
I've done a similar 'crop all pages' in Adobe Acrobat; can you suggest a howto for unix? Using Imagemagick?
posted by acro at 6:34 PM on May 16, 2007
I've done a similar 'crop all pages' in Adobe Acrobat; can you suggest a howto for unix? Using Imagemagick?
posted by acro at 6:34 PM on May 16, 2007
I would do it using Leptonica c library, but that requires that you write some glue code in c.
If you don't mind going to bitonal, unpaper is great, but since you have such clean images you might want to turn off some of the noise filters, which can be too agressive using the default settings and actually mess with the text.
posted by rajbot at 9:54 PM on May 16, 2007
If you don't mind going to bitonal, unpaper is great, but since you have such clean images you might want to turn off some of the noise filters, which can be too agressive using the default settings and actually mess with the text.
posted by rajbot at 9:54 PM on May 16, 2007
This thread is closed to new comments.
$ convert foo.pdf tmp.pbm && unpaper tmp.pbm ... && convert tmp.pbm foo.pdf
posted by cmiller at 1:41 PM on May 16, 2007