How do I collect dictionary definitions for ~350 words to put into a database?
February 11, 2011 11:12 AM   Subscribe

I have a list of about 350 words that I'm putting into a database that I'm building from scratch. For each word in the list, I'd like to include its American English dictionary definition (preferably American Heritage Dictionary), UrbanDictionary entry, and Wikipedia page link (I realize that entries may not exist for some columns for many of these words). Is there some way to streamline or data-mine this process, so I don't have to look up each word individually in 3 separate sources x350+, not to mention all the copying and pasting?

This is for academic research purposes and I have no funds. Filling the database with the definitions/part-of-speech/pronunciation is one very small part of the bigger project (not shirking research responsibilities, but rather getting hung up on this tangential but integral step).

Bonus points if anybody has tips as to how to enter International Phonetic Alphabet (IPA) characters into the database...they will inevitably be part of the dictionary definition (the pronunciation guide(s)) and I can't seem to find out any information about what the convention/standards/mappings are for SQL database entry and IPA symbols.

Thank you.
posted by iamkimiam to Computers & Internet (10 answers total) 3 users marked this as a favorite
 
Build a database with a table called words:

WordId (unique),Word,AmericanEnglish,UrbanDictionary,Wikipedia

You can build the database and enter the data very easily in phpmyadmin

I appreciate the response isn't wordy or lengthy as I assume basic knowledge of database design on your part and I'm short on time.
posted by dougrayrankin at 11:17 AM on February 11, 2011


Oh and you can have all this stuff for free on Mac, Linux or Windows.
posted by dougrayrankin at 11:20 AM on February 11, 2011


Spend a minute watching the intro videos on Google Refine. In particular, the get URL based on field functionality should streamline pulling back the definitions from the sites in question. Really, you just need to script up something to scrape URLs. If you can't do this, someone in your school's CS department would probably do it for a 6-pack.
posted by bfranklin at 11:23 AM on February 11, 2011


If you're on a mac, fake.app is fantastic for this kind of thing - it's a little like automator for working with web-based workflows.

You can set up a workflow which will iterate through your list of words, look each one up at those three sites, extract the definition (hopefully in a named div on the page), then capture all the data into a CSV file. Fairly easy, if you don't mind a little javascripting.
posted by davemee at 11:57 AM on February 11, 2011


wget and/or python are your friend for this sort of work.
posted by rr at 12:25 PM on February 11, 2011


Option #1: Mechanical Turk if you're willing to spend a few shekels to get someone else to do manual labor for real cheap.

Option #2: At least here in the US, I can get an undergrad intern to do this for me for free in exchange for research credit. This task is easy enough to explain and easy enough for someone to do at home on their own time. There may be something similar in the UK university system so it would be worth looking into.
posted by special-k at 12:35 PM on February 11, 2011


You may find the Wordnik API useful for grabbing definitions and pronunciations—AHD and Wiktionary are two of their API data sources. Doesn't help with the Urban Dictionary component, and doesn't seem to give a link back to Wiktionary although that should be easy to generate.

IPA should come encoded in Unicode/UTF-8, and UTF-8 is the keyword you're looking for as far as how to get that into your SQL database. The specifics depend on exactly which DB you're using, but all of them (at least MySQL and PostgreSQL) should be able to do the right thing if you tell them you're going to be giving them UTF-8.
posted by xueexueg at 12:41 PM on February 11, 2011


What OS are you building this on top of? The initial datamining could easily be done with a bash script or perl. Read in the list of words, feed it to the correct search string per dictionary, scrape your data and and have it spit out some relatively easy text format to clean up by hand (csv or something else with a consistent field delimiter). After it's clean, it can be easily massaged into a database.

And what xueexueg said: you definitely want to go with UTF8.
posted by Cat Pie Hurts at 1:40 PM on February 11, 2011


Response by poster: Well, despite the several great suggestions here, I'm no further along than where I started. I looked into each option and it was either too complicated for me to understand or cost money (Fakeapp was closest to a solution, but it can't really go levels deep to pull out all the bits I need; plus it's $30. Wordnik is brilliant, but API is way beyond my comprehension right now).

Thanks anyway for all your help. At this point I'm going to look each word up and manually copy/paste into the columns. What a drag.

I'm also looking into the UTF-8 stuff...is there something special I need to do to actually *enter* the symbols into the database (like, how would I enter the word [mɛtəfɪltɚ] into the database?)...conceptually I just don't get how to make the parts fit together, ya know?
posted by iamkimiam at 2:44 PM on February 22, 2011


Response by poster: hmmm...maybe all is not entirely lost. I *am* able to set up a Fake Workflow that goes to this URL: http://www.wordnik.com/words/
Then I type in the word I want to look up
Then save the raw source code as a txt file named same as the word
Then run the script!

NOW, I just need a way to automate THAT task so I don't have to type the words and the filename to save the txt file as 349 more times. And cough up the $30 if I can't do it in less than 4 moves.

Anybody have any additional thoughts about how to go about this and/or the UTF-8 data entry?
posted by iamkimiam at 3:14 PM on February 22, 2011


« Older I want to be the next Terrance Dicks.   |   Groping stepdad vs. little kid Newer »
This thread is closed to new comments.