Comments on: When Google just isn't enough.

Question: When Google just isn't enough.

valkyryn — Fri, 06 Feb 2009 10:58:49 -0800

Can anyone recommend a powerful textual search tool I can use on anything I want?

So I'm looking for a powerful search tool for academic research. To clarify, I'm not having trouble finding sources. I want to be able to search within sources for inexact phrases.

As it turns out, Google is powerful in the sense that it can find terms almost anywhere, but the search engines on WestLaw and LexisNexis are ridiculously powerful in the arguments they allow you to use. For example, x /s y finds x in the same sentence as y; x /p y finds x in the same paragraph as y, x /5 y finds x within five words of y, etc.

This is incredibly useful, especially if a term is used in more than one way but I'm only interested in one of them. I would like to be able to do this with arbitrary text documents from sources like Project Gutenberg, but I can't seem to get Google (or Google Desktop) to do this. Does anyone have any ideas to either improve my google-fu or for an alternative search tool?

Web-based or Windows-compatible is fine, but I'd like to avoid paying for it if at all possible. Help me, hive mind!

By: trip and a half

trip and a half — Fri, 06 Feb 2009 11:07:55 -0800

Well, dtSearch would probably be what you want, but it doesn't quite meet the "avoid paying for it" criterion.

By: zippy

zippy — Fri, 06 Feb 2009 11:09:07 -0800

It's not LexisNexis, but the command line tool 'agrep' does fuzzy text matches and might be useful for Gutenberg texts. I don't believe it will do the "within 5 words of" searches that you want, however.

By: zippy

zippy — Fri, 06 Feb 2009 11:10:23 -0800

Lucene is free and lets you do "proximity queries" (among other things).

By: gum

gum — Fri, 06 Feb 2009 11:12:08 -0800

Regular Expressions is the tool you need, and a number of programming text editors offer it in a number of flavors. There is a learning curve, but if you want really powerful search tools, this is where you want to go.

By: Tawita

Tawita — Fri, 06 Feb 2009 11:13:41 -0800

You might also want to have a look at Google API Proximity Search (GAPS).

By: zippy

zippy — Fri, 06 Feb 2009 11:47:39 -0800

Also, you can hack proximity searches in regular google using the '*' wildcard operator.

For instance, if you're looking for "hamster" w/3 "dance" you can do these queries:

"hamster dance"
"hamster * dance" (separated by one word)
"hamster * * dance" (sep. by two words)

By: scruss

scruss — Fri, 06 Feb 2009 11:56:29 -0800

Sounds like you want to set up a language corpus to search for collocates and other structures. It's been a while since I did this, so I don't know what's state of the art.

WordSmith Tools isn't free, but it was pretty decent for very complex searches a few years back, and is unlikely to have got worse. Version 3 is free for private use.

You're probably looking for corpus linguistics software, maybe more specifically concordance analysis. A slightly dated list of resources to get you started is here.

By: Zed

Zed — Fri, 06 Feb 2009 14:00:21 -0800

Do you want to search things on the web, or things after you've downloaded them to your local machine? Big difference.

If it's the latter, ditto gum, above. Regexps are the key. Combine with the standard UNIX command line tools (available in Windows through Cygwin) and a scripting language like Perl (or ruby, or sed and awk, but probably not Python for this application -- it's not meant for command line pipelines, but it'd be fine if everything you were doing was complicated enough to justify its own script.)

For instance, two words in the same paragraph (file must have UNIXish line endings instead of DOS):

perl -00 -ne 'print if /one/i and /two/i' filename.txt

Looking within n words of each other, or looking within the same sentence quickly gets much more complicated (especially the latter, as figuring out what you want to interpret as a sentence isn't trivial, and will produce undesirable results on some texts no matter what you do.)

By: Zed

Zed — Fri, 06 Feb 2009 14:07:49 -0800

actually, that's going to fail for short words that'd be expected to occur within other words. And thinking about correcting for that in a way that recognizes words correctly regardless of proximity to punctuation... um, forget this route unless you're already a programmer or you want to be one.

By: hattifattener

hattifattener — Fri, 06 Feb 2009 14:51:21 -0800

Regexps are very powerful but I don't think they're the right tool for this job — you want some sort of specialized natural-language-search engine.

By: zengargoyle

zengargoyle — Sat, 07 Feb 2009 11:48:00 -0800

You can find old 'htdig' that did remarkable things to text searches (web pages explicitly), but did text also. Sorta same thing as Lucene. Otherwise I'd have to point you out to crafting your own specific search engine. Fast computers and regexps are fine if you know them, large data and you craft your own. Still it turns out to be more "find things with these words and prune the ones they don't want" and it's Lucene type stuff with regexp after.