<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:admin="http://webns.net/mvcb/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
	<channel> 

	<title>Comments on: How do I seek and destroy image-only PDF's?</title>
	<link>http://ask.metafilter.com/109420/How-do-I-seek-and-destroy-imageonly-PDFs/</link>
	<description>Comments on Ask MetaFilter post How do I seek and destroy image-only PDF's?</description>
	<pubDate>Tue, 16 Dec 2008 23:48:52 -0800</pubDate>
	<lastBuildDate>Tue, 16 Dec 2008 23:48:52 -0800</lastBuildDate>
	<language>en-us</language>
	<docs>http://blogs.law.harvard.edu/tech/rss</docs>
	<ttl>60</ttl>

	<item>
		<title>Question: How do I seek and destroy image-only PDF&apos;s?</title>
		<link>http://ask.metafilter.com/109420/How-do-I-seek-and-destroy-imageonly-PDFs</link>	
		<description>Is there Mac-based software available that will search my entire hard drive or designated folders for image-only PDF files (have not been OCR&apos;ed) and then automatically run OCR (using Acrobat Pro or whatever) and override the original file with a searchable version? &lt;br /&gt;&lt;br /&gt; I am in the process of scanning all of my personal and business files. I have been very pleased with the results so far and love the ease of using Spotlight or Google Desktop to locate searchable PDF files, etc.&lt;br&gt;
&lt;br&gt;
However, I have hundreds of older PDF&apos;s that are image-only randomly scattered in different folders.  Currently, I have been opening each questionable PDF and manually checking whether it is image-only or searchable.  If the PDF is image-only I manually run OCR using Acrobat Pro and then save the searchable version over the original file.&lt;br&gt;
&lt;br&gt;
I am looking for a way to automate this tedious process.  So far, I have only been able to find scripts and the like that will allow you to batch process groups of PDF&apos;s.  I am looking for something that will &quot;search and destroy&quot; on its own.  &lt;br&gt;
&lt;br&gt;
I have a MacBook running OS X 10.5.6, Adobe Acrobat Pro 8 and a Fujitsu ScanSnap scanner.</description>
		<guid isPermaLink="false">post:ask.metafilter.com,2008:site.109420</guid>
		<pubDate>Tue, 16 Dec 2008 22:47:59 -0800</pubDate>
		<dc:creator>randex8</dc:creator>
		
			<category>Mac</category>
		
			<category>PDF</category>
		
			<category>OCR</category>
		
			<category>Software</category>
		
			<category>Paperless</category>
		
	</item> <item>
		<title>By: suedehead</title>
		<link>http://ask.metafilter.com/109420/How-do-I-seek-and-destroy-imageonly-PDFs#1575509</link>	
		<description>&lt;a href=&quot;http://www.devon-technologies.com/products/devonthink/uniquefeatures.html&quot;&gt;Devonthink&lt;/a&gt; will create a database. You don&apos;t have to import them -- you can just index them, but once you do that it&apos;ll OCR them and save them. In addition, it has this very nifty algorithm where it&apos;ll find documents that are similar to another document. It seems that it &lt;a href=&quot;http://www.devon-technologies.com/support/faqs.php?cat=21&quot;&gt;supports the ScanSnap&lt;/a&gt; pretty well.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.109420-1575509</guid>
		<pubDate>Tue, 16 Dec 2008 23:48:52 -0800</pubDate>
		<dc:creator>suedehead</dc:creator>
	</item><item>
		<title>By: Happy Dave</title>
		<link>http://ask.metafilter.com/109420/How-do-I-seek-and-destroy-imageonly-PDFs#1575510</link>	
		<description>I believe &lt;a href=&quot;http://evernote.com/&quot;&gt;Evernote&lt;/a&gt; will index and scan PDFs (and image files like JPGs) for text and make the lot searchable.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.109420-1575510</guid>
		<pubDate>Tue, 16 Dec 2008 23:52:09 -0800</pubDate>
		<dc:creator>Happy Dave</dc:creator>
	</item><item>
		<title>By: cschneid</title>
		<link>http://ask.metafilter.com/109420/How-do-I-seek-and-destroy-imageonly-PDFs#1577145</link>	
		<description>The beginnings of an approach in totally untested ruby.&lt;br&gt;
&lt;br&gt;
files = Dir[&quot;**.pdf&quot;]&lt;br&gt;
files.each do |file|&lt;br&gt;
  content = `pdftotext #{file}`&lt;br&gt;
  next if content&lt;br&gt;
&lt;br&gt;
  ## Now fire up the OCR job, and shuffle files around&lt;br&gt;
end&lt;br&gt;
&lt;br&gt;
It&apos;d take a fair amount of tweaking and playing to make it all work, but that&apos;s the approach I&apos;d use.  The pdftotext program doesn&apos;t come with OSX, but can be installed via macports, via the xpdf package.</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.109420-1577145</guid>
		<pubDate>Thu, 18 Dec 2008 10:18:52 -0800</pubDate>
		<dc:creator>cschneid</dc:creator>
	</item><item>
		<title>By: randex8</title>
		<link>http://ask.metafilter.com/109420/How-do-I-seek-and-destroy-imageonly-PDFs#1577637</link>	
		<description>Thanks for the quick replies!</description>
		<guid isPermaLink="false">comment:ask.metafilter.com,2008:site.109420-1577637</guid>
		<pubDate>Thu, 18 Dec 2008 19:58:26 -0800</pubDate>
		<dc:creator>randex8</dc:creator>
	</item>
	</channel>
</rss>
