Synonym Finder?
February 6, 2009 8:08 AM   Subscribe

Is there an automated way to generate potential search terms for a given expression?

I'm trying out an idea I had about search engine queries and keep getting stuck. Ideally, I'd like to generate a file (CSV, XML, etc.) with a large listing of potential search terms, send the file to ___, and receive back a file that lists, for each given search term:

1. Synonyms
2. Related Terms
3. Common Misspellings

For example, if I entered the word 'Carpenter', I'd like to see:

1. Woodworker, Builder, Craftsman
2. Wood, Hammer, Nail, Repair, Fix
3. Caprenter

Does anyone have any ideas? I've had decent luck doing this manually, but I'd like to feed a large list (3-5K) of unique terms.
posted by jmevius to Computers & Internet (9 answers total) 2 users marked this as a favorite
I use Oracle text to do a lot of neat stuff like this. ie searching for headlines "astrology" returns headlines with "scorpio" etc. All this is bundled right into the Oracle text product.

This askTom has some insight into the process they use for term-building.

All this is probably of little use to you though - I imagine if you had access to Oracle you would already be using it.
posted by H. Roark at 8:30 AM on February 6, 2009

I think eBay does this to some degree, since you no longer need to use Fat Fingers to find misspelled entries, which got rid of a cool trick. At least, it's true for common misspellings, and it usually recommends a few categories, although that might be based on results, which gets you the same information, only generated on the fly.
posted by mccarty.tim at 8:38 AM on February 6, 2009

Response by poster: To answer a question above, I don't have access to Oracle. The article does address some of the questions I have however. Thanks for the find.

Re: Fat Fingers -- that is one of the things I thought of when I started working on this. Any way to extract results?
posted by jmevius at 9:11 AM on February 6, 2009

You could use the Suggested Upper Merged Ontology data for this. In this sense, an ontology is an organization of concepts and words according to their meaning. SUMO has a notion of narrower, broader, and specific meanings and is designed to be used by computers. Thus, your program could suggest both broader terms (hypernyms), narrower terms (hyponyms), and what are called instance hyponyms (e.g., proper names).

As an example, from 'carpenter' you can move around the ontology to get woodman, woodsman, woodworker, Joseph, artificer, artisan, craftsman, journeyman, cabinetmaker, furniture maker, joiner, splicer, carver, woodcarver.
posted by jedicus at 9:51 AM on February 6, 2009 [2 favorites]

Off-topic, but Google has its own internal synonym/ontology system which is activated by a tilde at the start of the term.

So one automated way to search for "carpenter" synonyms would just be to search for "~carpenter".
posted by AmbroseChapel at 3:32 PM on February 8, 2009

Response by poster: AmbroseChapel - correct me if I'm wrong, but doesn't the ~ operator only return results from synonym searches? For example, if I searched for "~carpenter", it might return results for carpenter or woodworker.

What I was looking for was if I submitted carpenter, to be returned the result "woodworker" (plus whatever other synonyms Google is smart enough to know). Am I misinterpreting that?
posted by jmevius at 9:30 AM on February 9, 2009

Wordnet will find you all synonyms with as much detail as one can hope for. Here are the results for carpenter. You can use it online, or download and install it locally if you have many queries to perform.

Misspellings is tricky; it's easy to go from a misspelled word to the correct one, but I don't know of any available software to automatically generate mistakes. There are two viable solutions, downloading a corpus or using heuristics. A corpus like those found here will provide you with common real-life misspellings, with the downside that their number is limited to a few thousands. Alternatively, if you can program a little, you can create a simple generator that applies a few rules to a word to generate mistakes - remove/insert double letters, exchange ei/ai, on/an, etc.

Related terms is REALLY tricky. Basically, you need a giant semantic ontology that can link words by concept. To the extent of my knowledge, that doesn't exist yet. Search companies like Yahoo and Google are trying to come up with one, but with only limited success so far. They have the raw data, from words that appear together in searches and web pages, but it takes a good few terabytes of space, and organising it meaningfully while filtering out the noise is a highly non-trivial task.

Sorry if that sounds discouraging... Natural language processing is still the Holy Grail of AI. We're getting closer to useful applications, but we're just not there yet.
posted by Spanner Nic at 9:30 PM on February 9, 2009

Response by poster: Wordnet seems to be almost exactly what I was looking for. The synonym and hypernym functions are really useful to me.

Is there a way to feed the program a list of entries (instead of one at a time)?
posted by jmevius at 8:29 AM on February 10, 2009

No, I don't think so, but if you install it locally it'll be fast enough for most purposes.
posted by Spanner Nic at 8:40 AM on February 10, 2009

« Older Onions make me me tell people about it...   |   Make PS3 games use 1080p? Newer »
This thread is closed to new comments.