"Nah, boss; I'm just a speedreader"
February 12, 2014 9:32 AM   Subscribe

Before building my own... does this software exist? I need to search a collection of PDF & Word files for key phrases, and dump the surrounding lines (x-m to x+n characters/lines, where x is the found phrase) into text files. I occasionally need to search a few dozen files for a few dozen data items, which usually have some identifying text nearby. This needs to be automated. Big bonus if it implements OCR, but that's not essential. Freeware, or cheapware, obviously is best. Windows-based is preferable, but I can do Linux.
posted by IAmBroom to Computers & Internet (9 answers total) 8 users marked this as a favorite
 
Best answer: How much of the automation do you want ? I want to think sharepoint can do the indexing and search, but not as sure for the surrounding text.

DIY: lucene, pdfbox and poi libs (java) will give you almost everything you want. (as in, I've done more or less what you want using those 3 libs)

I don't know about OCR.
posted by k5.user at 9:36 AM on February 12, 2014 [1 favorite]


Best answer: I don't know if there is something that does what you want that is cheap (I know some electronic discovery software will do what you want at high cost), but if you roll our own, to build on k5.user's answer, tesseract-ocr is freeware OCR software.
posted by procrastination at 10:43 AM on February 12, 2014


Best answer: Microsoft OneNote does this. I'm afraid I don't have a copy here to see how the PDF search works, and see how easy it is to OCR the PDF.

But it's included in all versions of Office.
posted by ambrosen at 11:00 AM on February 12, 2014


Best answer: FYI - "lines" is a funny concept in PDF, especially when the distinction between text and text painted in an image on the page is not always apparent to the reader.

That said, let's just say for grins that you can code in VB or C# and downloaded a trial version of my company's products and were OK just dealing with PDFs or images in general. Then you could write something like this:

public void SearchPdfs(string[] files, string outputDir, string searchTerm)
{
    RegisteredDecoders.Add(new PdfDecoder()); // not there by default it's an add-on
    foreach (string file in files) {
        SearchPdf(file, outputDir, searchTerm);
    }
}

public void SearchPdf(string file, string outputDir, string searchTerm)
{
    string outfile = Path.Combine(outputDir, Path.GetFileNameWithoutExtension(file) + ".txt");
    using (StreamWriter writer = new StreamWriter(outfile)) {
        FileSystemImageSource source = new FileSystemImageSource(file, true);
        SearchPdf(source, writer, searchTerm);
    }
}

public void SearchPdf(ImageSource source, TextWriter writer, string searchTerm)
{
    _ocrEngine.Initialize();
    int i = 0;
    try {
        OcrDocument doc = _ocrEngine.Recognize(source);
        foreach (OcrPage page in doc.Pages) {
            SearchPage(i, page, writer, searchTerm);
            i++;
        }
    }
    finally { _ocrEngine.ShutDown(); }
}

public void SearchPage(int pageNo, OcrPage page, TextWriter writer, string searchTerm)
{
    string textInPage = GetTextInPage(page);
    int index = _engine.RecognitionCulture.CompareInfo.IndexOf(textInPage, searchTerm,
                           CompareOptions.IgnoreCase);
    if (index > = 0) {
        int start = Math.Max(index - _m, 0);
        int end = Math.Min(textInPage.Length, index + textInPage.Length + _n);

        TextWriter.WriteLine("Found on page " + pageNo + textInPage.Sub(start, end - start);
    }
}


public string GetTextInPage(OcrPage page)
{
     StringBuilder builder = new StringBuilder();
     foreach (OcrRegion region in page.Regions) {
         OcrTextRegion textRgn = region as OcrTextRegion;
         if (text != null) builder.Append(textRgn.Text);
     }
     return builder.ToString();
}


Notes:
  • I wrote this off the top of my head. It's likely to have syntax errors.
  • I don't show you how to make the member variable _ocrEngine. Best to look at our sample code - we license several OCR engines and nearly every manufacturer goes out of their way to make the licensing process challenging. I don't say this lightly. I recommend using the GlyphReader engine if the source is academic papers.
  • This is intentionally a simple solution. It will probably cap out on documents with 300-350 pages and is not the way I would do it for anything that needed to be super-reliable
  • If you know you have real text in the documents (as opposed to images), there's a better way to do this with our PDF text extraction tools.
  • with an evaluation license, you'll get a month to try this out. If you want it past that, you'll get hounded by our alert and tenacious sales department who will want you to commit
  • You asked for cheap/free - this is both unless you need to go past the eval, then this particular solution will be very expensive. You get what you pay for
  • you could probably do something similar in a shell script with GhostScript and Tesseract. You get what you pay for.

posted by plinth at 11:22 AM on February 12, 2014 [2 favorites]


Best answer: This sounds a little like concordancing. Maybe try this site.
posted by mukade at 1:09 PM on February 12, 2014


Best answer: Agent Ransack gets you very close; the only thing is that it doesn't have a way to configure how much surrounding text is dumped, it just dumps whatever else is on the same CR/LF-delimited line. It's possible that their paid product, FileLocator Pro, has that feature, but I haven't investigated that. Neither of them do OCR though, so you'll need to find something else to do the OCR dumps first.
posted by Aleyn at 2:51 PM on February 12, 2014


Best answer: On Linux, I would use a combination of pdftotext, libreoffice --convert-to, and grep -C n. For example,
$ pdftotext document.pdf document.txt
$ grep -C 5 'phrase' document.txt
will get you every line with phrase in it, and 5 lines on either side. It will be more or less accurate depending on the layout of the PDF and how well pdftotext deals with it.

For Word documents,
$ libreoffice --convert-to txt:Text document.doc
$ grep -C 5 'phrase' document.txt
If you have LibreOffice installed in Windows and want to use the command line there, the option is -convert-to (single hyphen).
posted by WasabiFlux at 6:11 PM on February 12, 2014 [1 favorite]


Best answer: Try In-Com's SmartTS. It's not cheap but it's great software.
SmartTS
posted by CathyG at 9:40 PM on February 12, 2014


Response by poster: WOW!!! Glad I asked!

There's such a wealth of possibilities here, I'm going to have to think about how to start winnowing.

Special shout-out to plinth for his generous, off-the-cuff coding.

Thanks, everybody!
posted by IAmBroom at 1:52 PM on February 13, 2014


« Older Reputable purchaser of gold in San Francisco?   |   Confidentiality in bar admissions Newer »
This thread is closed to new comments.