Join 3,562 readers in helping fund MetaFilter (Hide)


"Nah, boss; I'm just a speedreader"
February 12, 2014 9:32 AM   Subscribe

Before building my own... does this software exist? I need to search a collection of PDF & Word files for key phrases, and dump the surrounding lines (x-m to x+n characters/lines, where x is the found phrase) into text files. I occasionally need to search a few dozen files for a few dozen data items, which usually have some identifying text nearby. This needs to be automated. Big bonus if it implements OCR, but that's not essential. Freeware, or cheapware, obviously is best. Windows-based is preferable, but I can do Linux.
posted by IAmBroom to Computers & Internet (9 answers total) 7 users marked this as a favorite
 
How much of the automation do you want ? I want to think sharepoint can do the indexing and search, but not as sure for the surrounding text.

DIY: lucene, pdfbox and poi libs (java) will give you almost everything you want. (as in, I've done more or less what you want using those 3 libs)

I don't know about OCR.
posted by k5.user at 9:36 AM on February 12 [1 favorite]


I don't know if there is something that does what you want that is cheap (I know some electronic discovery software will do what you want at high cost), but if you roll our own, to build on k5.user's answer, tesseract-ocr is freeware OCR software.
posted by procrastination at 10:43 AM on February 12


Microsoft OneNote does this. I'm afraid I don't have a copy here to see how the PDF search works, and see how easy it is to OCR the PDF.

But it's included in all versions of Office.
posted by ambrosen at 11:00 AM on February 12


FYI - "lines" is a funny concept in PDF, especially when the distinction between text and text painted in an image on the page is not always apparent to the reader.

That said, let's just say for grins that you can code in VB or C# and downloaded a trial version of my company's products and were OK just dealing with PDFs or images in general. Then you could write something like this:

public void SearchPdfs(string[] files, string outputDir, string searchTerm)
{
    RegisteredDecoders.Add(new PdfDecoder()); // not there by default it's an add-on
    foreach (string file in files) {
        SearchPdf(file, outputDir, searchTerm);
    }
}

public void SearchPdf(string file, string outputDir, string searchTerm)
{
    string outfile = Path.Combine(outputDir, Path.GetFileNameWithoutExtension(file) + ".txt");
    using (StreamWriter writer = new StreamWriter(outfile)) {
        FileSystemImageSource source = new FileSystemImageSource(file, true);
        SearchPdf(source, writer, searchTerm);
    }
}

public void SearchPdf(ImageSource source, TextWriter writer, string searchTerm)
{
    _ocrEngine.Initialize();
    int i = 0;
    try {
        OcrDocument doc = _ocrEngine.Recognize(source);
        foreach (OcrPage page in doc.Pages) {
            SearchPage(i, page, writer, searchTerm);
            i++;
        }
    }
    finally { _ocrEngine.ShutDown(); }
}

public void SearchPage(int pageNo, OcrPage page, TextWriter writer, string searchTerm)
{
    string textInPage = GetTextInPage(page);
    int index = _engine.RecognitionCulture.CompareInfo.IndexOf(textInPage, searchTerm,
                           CompareOptions.IgnoreCase);
    if (index > = 0) {
        int start = Math.Max(index - _m, 0);
        int end = Math.Min(textInPage.Length, index + textInPage.Length + _n);

        TextWriter.WriteLine("Found on page " + pageNo + textInPage.Sub(start, end - start);
    }
}


public string GetTextInPage(OcrPage page)
{
     StringBuilder builder = new StringBuilder();
     foreach (OcrRegion region in page.Regions) {
         OcrTextRegion textRgn = region as OcrTextRegion;
         if (text != null) builder.Append(textRgn.Text);
     }
     return builder.ToString();
}


Notes:
posted by plinth at 11:22 AM on February 12 [2 favorites]


This sounds a little like concordancing. Maybe try this site.
posted by mukade at 1:09 PM on February 12


Agent Ransack gets you very close; the only thing is that it doesn't have a way to configure how much surrounding text is dumped, it just dumps whatever else is on the same CR/LF-delimited line. It's possible that their paid product, FileLocator Pro, has that feature, but I haven't investigated that. Neither of them do OCR though, so you'll need to find something else to do the OCR dumps first.
posted by Aleyn at 2:51 PM on February 12


On Linux, I would use a combination of pdftotext, libreoffice --convert-to, and grep -C n. For example,
$ pdftotext document.pdf document.txt
$ grep -C 5 'phrase' document.txt
will get you every line with phrase in it, and 5 lines on either side. It will be more or less accurate depending on the layout of the PDF and how well pdftotext deals with it.

For Word documents,
$ libreoffice --convert-to txt:Text document.doc
$ grep -C 5 'phrase' document.txt
If you have LibreOffice installed in Windows and want to use the command line there, the option is -convert-to (single hyphen).
posted by WasabiFlux at 6:11 PM on February 12 [1 favorite]


Try In-Com's SmartTS. It's not cheap but it's great software.
SmartTS
posted by CathyG at 9:40 PM on February 12


WOW!!! Glad I asked!

There's such a wealth of possibilities here, I'm going to have to think about how to start winnowing.

Special shout-out to plinth for his generous, off-the-cuff coding.

Thanks, everybody!
posted by IAmBroom at 1:52 PM on February 13


« Older I'm looking for recommendation...   |  I have a friend in law school ... Newer »

You are not logged in, either login or create an account to post comments