<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: pdf unpaper 'ed</title>
	<link>http://ask.metafilter.com/62789/pdf-unpaper-ed/</link>
	<description>Comments on Ask MetaFilter post pdf unpaper 'ed</description>
	<pubDate>Wed, 16 May 2007 13:41:22 -0800</pubDate>
	<lastBuildDate>Wed, 16 May 2007 13:41:22 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: pdf unpaper &apos;ed</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed</link>	
		<description>The unix program &apos;unpaper&apos; was recently recommended for cleaning up artifacts on scanned images. Unfortunately it doesn&apos;t natively do pdfs, only &quot;pnm family&quot; -- pbm, pgm and ppm formats. &lt;br /&gt;&lt;br /&gt; Using Ubuntu- feisty; I&apos;ve installed Imagemagick and unpaper -- welcome any suggestions. My first challenge is conversion from &lt;a href=&quot;http://linux.about.com/library/cmd/blcmdl1_pdftopbm.htm&quot;&gt;pdf to pbm&lt;/a&gt;&lt;br&gt;
&lt;a href=&quot;http://www.b612foundation.org/papers/NASA-finalrpt.pdf&quot;&gt;The pdf file in question. (23 mb)&lt;/a&gt;</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2007:site.62789</guid>
		<pubDate>Wed, 16 May 2007 13:25:42 -0800</pubDate>
		<dc:creator>acro</dc:creator>
		
			<category>pdftopbm</category>
		
			<category>unpaper</category>
		
			<category>pdf</category>
		
			<category>scan</category>
		
			<category>book</category>
		
			<category>scanner</category>
		
			<category>Imagemagick</category>
		
	</item> <item>
		<title>By: cmiller</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#944764</link>	
		<description>$ sudo apt-get install imagemagick&lt;br&gt;
$ convert foo.pdf tmp.pbm &amp;amp;&amp;amp; unpaper tmp.pbm ... &amp;amp;&amp;amp; convert tmp.pbm foo.pdf</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-944764</guid>
		<pubDate>Wed, 16 May 2007 13:41:22 -0800</pubDate>
		<dc:creator>cmiller</dc:creator>
	</item><item>
		<title>By: acro</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#944778</link>	
		<description>Thanks cmiller.&lt;br&gt;
&lt;br&gt;
When I tried earlier to convert the pdf to pbm, imagemagick output only the the first page of the multi page pdf, and the pbm file was 3x the entire original pdf (~80 mb).</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-944778</guid>
		<pubDate>Wed, 16 May 2007 13:52:56 -0800</pubDate>
		<dc:creator>acro</dc:creator>
	</item><item>
		<title>By: cmiller</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#944853</link>	
		<description>PDF man (iirc) LZ-compress bitmap data.  I&apos;d expect the PBM to be large.&lt;br&gt;
&lt;br&gt;
If it&apos;s more than one page, you can put each page into its own file (also IIRC):&lt;br&gt;
&lt;br&gt;
$ convert in.pdf tmp%03d.pbm&lt;br&gt;
$ for inname in tmp*.pbm; do&lt;br&gt;
   outname=out`basename $inname .pbm`&lt;br&gt;
   unpaper ... $inname ... $outname&lt;br&gt;
done&lt;br&gt;
$ convert &apos;out*.pbm&apos; result.pdf   #mind the single quotes!&lt;br&gt;
&lt;br&gt;
Note, I haven&apos;t tried this in a /long time/, so it may need tweaking.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-944853</guid>
		<pubDate>Wed, 16 May 2007 14:47:39 -0800</pubDate>
		<dc:creator>cmiller</dc:creator>
	</item><item>
		<title>By: acro</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#944922</link>	
		<description>First line ran successfully. (Thanks!)&lt;br&gt;
-- Since the pages are two up (double)... &lt;br&gt;
&lt;br&gt;
$ for inname in tmp*.pbm; do outname=out`basename $inname .pbm`; unpaper &lt;strong&gt;&lt;em&gt;--layout double --sheet-size a4-landscape&lt;/em&gt;&lt;/strong&gt; ... $inname ... $outname; done&lt;br&gt;
 &lt;br&gt;
Any suggestions for the option syntax?&lt;br&gt;
&lt;br&gt;
*** error: sheet size unknown, use at least one input file per sheet, or force using --sheet-size.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-944922</guid>
		<pubDate>Wed, 16 May 2007 15:43:10 -0800</pubDate>
		<dc:creator>acro</dc:creator>
	</item><item>
		<title>By: rajbot</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#944958</link>	
		<description>Dunno exactly what that error means, but are you sure you don&apos;t either want to use &lt;tt&gt; --sheet-size a4&lt;/tt&gt; or &lt;tt&gt;--layout double-rotated&lt;/tt&gt;&lt;br&gt;
&lt;br&gt;
Otherwise you will be scaling the pages substantially, which doesn&apos;t seem what you want.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-944958</guid>
		<pubDate>Wed, 16 May 2007 16:30:34 -0800</pubDate>
		<dc:creator>rajbot</dc:creator>
	</item><item>
		<title>By: rajbot</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#944963</link>	
		<description>Also, it would help if you posted the pdf.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-944963</guid>
		<pubDate>Wed, 16 May 2007 16:36:29 -0800</pubDate>
		<dc:creator>rajbot</dc:creator>
	</item><item>
		<title>By: acro</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#944968</link>	
		<description>The pdf is the last link [more inside] ...  it&apos;s a similar layout to &lt;a href=&quot;http://unpaper.berlios.de/#overview&quot;&gt;the example picture&lt;/a&gt; here, regular book scan.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-944968</guid>
		<pubDate>Wed, 16 May 2007 16:59:53 -0800</pubDate>
		<dc:creator>acro</dc:creator>
	</item><item>
		<title>By: rajbot</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#945032</link>	
		<description>Ah sorry, missed your more inside.&lt;br&gt;
&lt;br&gt;
Your options are fine, but your bash script is probably messed up. Using &lt;tt&gt;--layout double --sheet-size a4-landscape&lt;/tt&gt; on the third page, I get &lt;a href=&quot;http://homeserver.us.archive.org/~rkumar/acro3-out.png&quot;&gt;this result&lt;/a&gt;. I didn&apos;t do a good job on the bitonalization, but was able to filter out the dark page edges.&lt;br&gt;
&lt;br&gt;
Since your pages are well-registered, another way to approach this problem is to white-fill or filter the edges and the center gutter, deskew, and then scale to A4. That way you don&apos;t have to go to bitonal, as required by unpaper.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-945032</guid>
		<pubDate>Wed, 16 May 2007 18:15:38 -0800</pubDate>
		<dc:creator>rajbot</dc:creator>
	</item><item>
		<title>By: acro</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#945049</link>	
		<description>&lt;em&gt;Since your pages are well-registered, another way to approach this problem is to white-fill or filter the edges and the center gutter, deskew, and then scale to A4. That way you don&apos;t have to go to bitonal, as required by unpaper.&lt;/em&gt;&lt;br&gt;
&lt;br&gt;
I&apos;ve done a similar &apos;crop all pages&apos; in Adobe Acrobat; can you suggest a howto for unix? Using Imagemagick?</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-945049</guid>
		<pubDate>Wed, 16 May 2007 18:34:28 -0800</pubDate>
		<dc:creator>acro</dc:creator>
	</item><item>
		<title>By: rajbot</title>
		<link>http://ask.metafilter.com/62789/pdf-unpaper-ed#945206</link>	
		<description>I would do it using Leptonica c library, but that requires that you write some glue code in c.&lt;br&gt;
&lt;br&gt;
If you don&apos;t mind going to bitonal, unpaper is great, but since you have such clean images you might want to turn off some of the noise filters, which can be too agressive using the default settings and actually mess with the text.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2007:site.62789-945206</guid>
		<pubDate>Wed, 16 May 2007 21:54:42 -0800</pubDate>
		<dc:creator>rajbot</dc:creator>
	</item>
	</channel>
</rss>
