Synonym Finder?
February 6, 2009 8:08 AM Subscribe
Is there an automated way to generate potential search terms for a given expression?
I'm trying out an idea I had about search engine queries and keep getting stuck. Ideally, I'd like to generate a file (CSV, XML, etc.) with a large listing of potential search terms, send the file to ___, and receive back a file that lists, for each given search term:
1. Synonyms
2. Related Terms
3. Common Misspellings
For example, if I entered the word 'Carpenter', I'd like to see:
1. Woodworker, Builder, Craftsman
2. Wood, Hammer, Nail, Repair, Fix
3. Caprenter
Does anyone have any ideas? I've had decent luck doing this manually, but I'd like to feed a large list (3-5K) of unique terms.
I'm trying out an idea I had about search engine queries and keep getting stuck. Ideally, I'd like to generate a file (CSV, XML, etc.) with a large listing of potential search terms, send the file to ___, and receive back a file that lists, for each given search term:
1. Synonyms
2. Related Terms
3. Common Misspellings
For example, if I entered the word 'Carpenter', I'd like to see:
1. Woodworker, Builder, Craftsman
2. Wood, Hammer, Nail, Repair, Fix
3. Caprenter
Does anyone have any ideas? I've had decent luck doing this manually, but I'd like to feed a large list (3-5K) of unique terms.
I think eBay does this to some degree, since you no longer need to use Fat Fingers to find misspelled entries, which got rid of a cool trick. At least, it's true for common misspellings, and it usually recommends a few categories, although that might be based on results, which gets you the same information, only generated on the fly.
posted by mccarty.tim at 8:38 AM on February 6, 2009
posted by mccarty.tim at 8:38 AM on February 6, 2009
Response by poster: To answer a question above, I don't have access to Oracle. The article does address some of the questions I have however. Thanks for the find.
Re: Fat Fingers -- that is one of the things I thought of when I started working on this. Any way to extract results?
posted by jmevius at 9:11 AM on February 6, 2009
Re: Fat Fingers -- that is one of the things I thought of when I started working on this. Any way to extract results?
posted by jmevius at 9:11 AM on February 6, 2009
You could use the Suggested Upper Merged Ontology data for this. In this sense, an ontology is an organization of concepts and words according to their meaning. SUMO has a notion of narrower, broader, and specific meanings and is designed to be used by computers. Thus, your program could suggest both broader terms (hypernyms), narrower terms (hyponyms), and what are called instance hyponyms (e.g., proper names).
As an example, from 'carpenter' you can move around the ontology to get woodman, woodsman, woodworker, Joseph, artificer, artisan, craftsman, journeyman, cabinetmaker, furniture maker, joiner, splicer, carver, woodcarver.
posted by jedicus at 9:51 AM on February 6, 2009 [2 favorites]
As an example, from 'carpenter' you can move around the ontology to get woodman, woodsman, woodworker, Joseph, artificer, artisan, craftsman, journeyman, cabinetmaker, furniture maker, joiner, splicer, carver, woodcarver.
posted by jedicus at 9:51 AM on February 6, 2009 [2 favorites]
Off-topic, but Google has its own internal synonym/ontology system which is activated by a tilde at the start of the term.
So one automated way to search for "carpenter" synonyms would just be to search for "~carpenter".
posted by AmbroseChapel at 3:32 PM on February 8, 2009
So one automated way to search for "carpenter" synonyms would just be to search for "~carpenter".
posted by AmbroseChapel at 3:32 PM on February 8, 2009
Response by poster: AmbroseChapel - correct me if I'm wrong, but doesn't the ~ operator only return results from synonym searches? For example, if I searched for "~carpenter", it might return results for carpenter or woodworker.
What I was looking for was if I submitted carpenter, to be returned the result "woodworker" (plus whatever other synonyms Google is smart enough to know). Am I misinterpreting that?
posted by jmevius at 9:30 AM on February 9, 2009
What I was looking for was if I submitted carpenter, to be returned the result "woodworker" (plus whatever other synonyms Google is smart enough to know). Am I misinterpreting that?
posted by jmevius at 9:30 AM on February 9, 2009
Wordnet will find you all synonyms with as much detail as one can hope for. Here are the results for carpenter. You can use it online, or download and install it locally if you have many queries to perform.
Misspellings is tricky; it's easy to go from a misspelled word to the correct one, but I don't know of any available software to automatically generate mistakes. There are two viable solutions, downloading a corpus or using heuristics. A corpus like those found here will provide you with common real-life misspellings, with the downside that their number is limited to a few thousands. Alternatively, if you can program a little, you can create a simple generator that applies a few rules to a word to generate mistakes - remove/insert double letters, exchange ei/ai, on/an, etc.
Related terms is REALLY tricky. Basically, you need a giant semantic ontology that can link words by concept. To the extent of my knowledge, that doesn't exist yet. Search companies like Yahoo and Google are trying to come up with one, but with only limited success so far. They have the raw data, from words that appear together in searches and web pages, but it takes a good few terabytes of space, and organising it meaningfully while filtering out the noise is a highly non-trivial task.
Sorry if that sounds discouraging... Natural language processing is still the Holy Grail of AI. We're getting closer to useful applications, but we're just not there yet.
posted by Spanner Nic at 9:30 PM on February 9, 2009
Misspellings is tricky; it's easy to go from a misspelled word to the correct one, but I don't know of any available software to automatically generate mistakes. There are two viable solutions, downloading a corpus or using heuristics. A corpus like those found here will provide you with common real-life misspellings, with the downside that their number is limited to a few thousands. Alternatively, if you can program a little, you can create a simple generator that applies a few rules to a word to generate mistakes - remove/insert double letters, exchange ei/ai, on/an, etc.
Related terms is REALLY tricky. Basically, you need a giant semantic ontology that can link words by concept. To the extent of my knowledge, that doesn't exist yet. Search companies like Yahoo and Google are trying to come up with one, but with only limited success so far. They have the raw data, from words that appear together in searches and web pages, but it takes a good few terabytes of space, and organising it meaningfully while filtering out the noise is a highly non-trivial task.
Sorry if that sounds discouraging... Natural language processing is still the Holy Grail of AI. We're getting closer to useful applications, but we're just not there yet.
posted by Spanner Nic at 9:30 PM on February 9, 2009
Response by poster: Wordnet seems to be almost exactly what I was looking for. The synonym and hypernym functions are really useful to me.
Is there a way to feed the program a list of entries (instead of one at a time)?
posted by jmevius at 8:29 AM on February 10, 2009
Is there a way to feed the program a list of entries (instead of one at a time)?
posted by jmevius at 8:29 AM on February 10, 2009
No, I don't think so, but if you install it locally it'll be fast enough for most purposes.
posted by Spanner Nic at 8:40 AM on February 10, 2009
posted by Spanner Nic at 8:40 AM on February 10, 2009
This thread is closed to new comments.
This askTom has some insight into the process they use for term-building.
All this is probably of little use to you though - I imagine if you had access to Oracle you would already be using it.
posted by H. Roark at 8:30 AM on February 6, 2009