A program that searches .tif files?
December 5, 2010 9:40 AM Subscribe
Is there a inexpensive or free way to search a mass quantity of .tif files for words or phrases within the documents?
I have 10 cd's of documents with 10K + documents overall. Most of the information is not pertinent to what I need, but there very well may be 2 or 3 things I need for a project I am working on. I know there are eDiscovery products like Concordance and Summation that are made for this, but that is not cost effective and this may be the only time I will need to do this.
I have been using IrfanView to run the docs as a slide show but this is very time consuming. Any ideas?
I have 10 cd's of documents with 10K + documents overall. Most of the information is not pertinent to what I need, but there very well may be 2 or 3 things I need for a project I am working on. I know there are eDiscovery products like Concordance and Summation that are made for this, but that is not cost effective and this may be the only time I will need to do this.
I have been using IrfanView to run the docs as a slide show but this is very time consuming. Any ideas?
Best answer: Get a trial version of Acrobat and make it batch-convert the TIFFs to PDFs. Then batch-OCR the PDFs. Optionally batch-index them. Done. Probably still 2 to 3 hours of work, though.
posted by oxit at 9:48 AM on December 5, 2010
posted by oxit at 9:48 AM on December 5, 2010
Seconding batch conversion and running through Evernote... but watch the upload limit!
posted by Master Gunner at 10:53 AM on December 5, 2010
posted by Master Gunner at 10:53 AM on December 5, 2010
Best answer: oxit has it.
Specifically, File->Create PDF->From Multiple Files (namely, all the TIFFs, temporarily copied onto your local hard drive) and then Document->OCR Text Recognition->Recognize Text using OCR.
It will take a while to run and the accuracy will, of course, vary with the quality of the image. DO NOT downsample them to jpegs, because that is a lossy compression format that will give the OCR algorithm less data and reduce its accuracy.
If your images have no pagination or other embedded reference marks, note the order in which Adobe reads them in, so you can cite your discoveries later.
posted by d. z. wang at 11:11 AM on December 5, 2010
Specifically, File->Create PDF->From Multiple Files (namely, all the TIFFs, temporarily copied onto your local hard drive) and then Document->OCR Text Recognition->Recognize Text using OCR.
It will take a while to run and the accuracy will, of course, vary with the quality of the image. DO NOT downsample them to jpegs, because that is a lossy compression format that will give the OCR algorithm less data and reduce its accuracy.
If your images have no pagination or other embedded reference marks, note the order in which Adobe reads them in, so you can cite your discoveries later.
posted by d. z. wang at 11:11 AM on December 5, 2010
For optimum clarity i would first run them through Scantailor. To clean them up a little if they are scanned.
Then use either ABBY or Acrobat to convert to a searchable PDF. I prefer ABBY but YMMV.
Would take an hour or so to set it all up, then it can be betch processed while you are away from the computer.
posted by moochoo at 1:34 PM on December 5, 2010
Then use either ABBY or Acrobat to convert to a searchable PDF. I prefer ABBY but YMMV.
Would take an hour or so to set it all up, then it can be betch processed while you are away from the computer.
posted by moochoo at 1:34 PM on December 5, 2010
If you have Microsoft Office (XP up to 2007), you already have an OCR that will work directly with the TIF files, no need to convert to PDF.
Instructions here!
posted by geodave at 6:56 PM on December 5, 2010
Instructions here!
posted by geodave at 6:56 PM on December 5, 2010
if you batch convert the images to JPG/PNG then you can use google docs to achieve this.
posted by asymptotic at 5:43 AM on December 6, 2010
posted by asymptotic at 5:43 AM on December 6, 2010
This thread is closed to new comments.
posted by kdern at 9:45 AM on December 5, 2010