A repository of large flashcard-suitable translation lists?
January 6, 2008 8:24 PM
A repository of large flashcard-suitable translation lists?
I'm looking at writing a freeware language tutor for the Mac and have been scouring the web looking for language dictionaries (actually "word lists" is probably more accurate) to use as an underpinning.
While I've found a few remarkable repositories for individual languages, the work involved in negotiating public use and then regularizing the different data sources is a bit daunting.
Can anyone recommend a single source for many different languages? Bonus points if it includes pronunciation mp3s...
I'm looking at writing a freeware language tutor for the Mac and have been scouring the web looking for language dictionaries (actually "word lists" is probably more accurate) to use as an underpinning.
While I've found a few remarkable repositories for individual languages, the work involved in negotiating public use and then regularizing the different data sources is a bit daunting.
Can anyone recommend a single source for many different languages? Bonus points if it includes pronunciation mp3s...
By the way, I think that choosing parallel passages from gutenberg texts -- while not exactly 'flashcard' sized -- would let you also draw on MP3 audiobook versions of the corresponding texts in a way that is reasonably automatable.
posted by mrflip at 11:10 PM on January 9, 2008
posted by mrflip at 11:10 PM on January 9, 2008
This thread is closed to new comments.
There is also the NLTK -- the natural language toolkit, which comes with a variety of research-quality text corpora in both English and many other languages.
The Gutenberg project has some more interesting collections of word lists (like Grady Ward's Multiple Language Lists of Common Words), and a *huge* array of translated texts, from which you could choose parallel passages.
Another source for semantically labelled translations would be large open software projects that have been localized: Mozilla, Linux, all of the Gnu utilities. Look for directories called 'locale'; the subdirectories will be either the two-letter ISO language code (en, de, fr, etc) or the cc_lc two letter language then country (en_us, en_uk(?), ko_kr, etc). The locale files are encoded somehow, btw, but easy to decode. The lexicon, of course, will be fairly restricted; but you'll have access to a wide variety of languages, and the advantage that you know how each atomic phrase corresponds across all those languages.
If you find or prepare any such lists, would you please get in touch by email or metafilter mail? My current project is a website to collect, share and redistribute rich datasets -- a Flickr for data, kinda -- and this is exactly the type of rich resource I'm hoping to make it easy for people to find, share and explore.
posted by mrflip at 11:07 PM on January 9, 2008