When Google just isn't enough.
February 6, 2009 10:58 AM   Subscribe

Can anyone recommend a powerful textual search tool I can use on anything I want?

So I'm looking for a powerful search tool for academic research. To clarify, I'm not having trouble finding sources. I want to be able to search within sources for inexact phrases.

As it turns out, Google is powerful in the sense that it can find terms almost anywhere, but the search engines on WestLaw and LexisNexis are ridiculously powerful in the arguments they allow you to use. For example, x /s y finds x in the same sentence as y; x /p y finds x in the same paragraph as y, x /5 y finds x within five words of y, etc.

This is incredibly useful, especially if a term is used in more than one way but I'm only interested in one of them. I would like to be able to do this with arbitrary text documents from sources like Project Gutenberg, but I can't seem to get Google (or Google Desktop) to do this. Does anyone have any ideas to either improve my google-fu or for an alternative search tool?

Web-based or Windows-compatible is fine, but I'd like to avoid paying for it if at all possible. Help me, hive mind!
posted by valkyryn to Technology (11 answers total) 14 users marked this as a favorite
Well, dtSearch would probably be what you want, but it doesn't quite meet the "avoid paying for it" criterion.
posted by trip and a half at 11:07 AM on February 6, 2009

It's not LexisNexis, but the command line tool 'agrep' does fuzzy text matches and might be useful for Gutenberg texts. I don't believe it will do the "within 5 words of" searches that you want, however.
posted by zippy at 11:09 AM on February 6, 2009

Lucene is free and lets you do "proximity queries" (among other things).
posted by zippy at 11:10 AM on February 6, 2009 [1 favorite]

Regular Expressions is the tool you need, and a number of programming text editors offer it in a number of flavors. There is a learning curve, but if you want really powerful search tools, this is where you want to go.
posted by gum at 11:12 AM on February 6, 2009

You might also want to have a look at Google API Proximity Search (GAPS).
posted by Tawita at 11:13 AM on February 6, 2009

Also, you can hack proximity searches in regular google using the '*' wildcard operator.

For instance, if you're looking for "hamster" w/3 "dance" you can do these queries:

"hamster dance"
"hamster * dance" (separated by one word)
"hamster * * dance" (sep. by two words)
posted by zippy at 11:47 AM on February 6, 2009 [2 favorites]

Sounds like you want to set up a language corpus to search for collocates and other structures. It's been a while since I did this, so I don't know what's state of the art.

WordSmith Tools isn't free, but it was pretty decent for very complex searches a few years back, and is unlikely to have got worse. Version 3 is free for private use.

You're probably looking for corpus linguistics software, maybe more specifically concordance analysis. A slightly dated list of resources to get you started is here.
posted by scruss at 11:56 AM on February 6, 2009

Do you want to search things on the web, or things after you've downloaded them to your local machine? Big difference.

If it's the latter, ditto gum, above. Regexps are the key. Combine with the standard UNIX command line tools (available in Windows through Cygwin) and a scripting language like Perl (or ruby, or sed and awk, but probably not Python for this application -- it's not meant for command line pipelines, but it'd be fine if everything you were doing was complicated enough to justify its own script.)

For instance, two words in the same paragraph (file must have UNIXish line endings instead of DOS):

perl -00 -ne 'print if /one/i and /two/i' filename.txt

Looking within n words of each other, or looking within the same sentence quickly gets much more complicated (especially the latter, as figuring out what you want to interpret as a sentence isn't trivial, and will produce undesirable results on some texts no matter what you do.)
posted by Zed at 2:00 PM on February 6, 2009

actually, that's going to fail for short words that'd be expected to occur within other words. And thinking about correcting for that in a way that recognizes words correctly regardless of proximity to punctuation... um, forget this route unless you're already a programmer or you want to be one.
posted by Zed at 2:07 PM on February 6, 2009

Regexps are very powerful but I don't think they're the right tool for this job — you want some sort of specialized natural-language-search engine.
posted by hattifattener at 2:51 PM on February 6, 2009

You can find old 'htdig' that did remarkable things to text searches (web pages explicitly), but did text also. Sorta same thing as Lucene. Otherwise I'd have to point you out to crafting your own specific search engine. Fast computers and regexps are fine if you know them, large data and you craft your own. Still it turns out to be more "find things with these words and prune the ones they don't want" and it's Lucene type stuff with regexp after.
posted by zengargoyle at 11:48 AM on February 7, 2009

« Older You're charging me how much for a folding table?   |   I want to see the movie that scared me as a kid! Newer »
This thread is closed to new comments.