Help me build a better dictionary!
September 16, 2006 9:44 PM   Subscribe

How do you start digitizing a dictionary? More specifically, how (and where) do I begin putting together a digital and internally cross-referenceable edition of the Hans Wehr Dictionary of Modern Written Arabic?

[Long explanation ahead. A thousand apologetic synonyms.]

If you've ever learned Arabic, there's a good chance you've probably glanced once or twice at the green bible. The truth is, "Hans," as Arabic language students know it, is simply the best learners reference available, despite a few fundamental problems. One of these has been the lack of any good way to look up words in the opposite direction. Hans only goes from Arabic to English, and no reliable dictionary designed expressly for the English-speaking student of Arabic exists.

Arabic is a langage based on trilateral roots, which makes the meaning of any one word is strongly dependent on context of the collective meanings within the entire root series. Looking up any unfamiliar word from English to Arabic involves a tedious process of finding the word, then cross-referencing it to Hans, which provides the proper context on two axes: how it fits into the conjugation patterns of other words derived from the same root, and how the terms was commonly used when the dictionary was written. A student can only be sure that the word she's chosen is the right one after she's double and triple-checked all the contexts.

One of the solutions to this problem, I believe, is to put the entire Hans Wehr dictionary into a computer-searchable format. This would mean that any word searched for in English could be easily linked back to its Arabic translation, which would then be found in the original context set out by Wehr.

I would guess off the top of my head that some kind of database is the best option, since the dictionary is arranged hierarchically and alphabetically.

My question for the Hive Mind is this: what's the best software to use to get this project off the ground? I'm not so technically inclined that I can design something like this from the ground up. I'd something that offers a frontend flexible enough that I can design my own hierarchy of tables and entries, and something that (even more ideally) would allow me to tag only certain texts to be available for search. Lastly, I'd like the software to allow for collaboration, maybe by making it easy to import other peoples' work into the main body.

Once I get this thing started up, I'll move it over to MeFiProjects so that everyone can get a chance to participate and see how progress is going.

This is a long question that might need some clarification if it seems I haven't explained things enough. Thanks for your collective help!
posted by awenner to Writing & Language (16 answers total) 4 users marked this as a favorite
Your first step is to contact the copyright holder for that book and get permission from them to create a derivative work.

(Or your first step is to contact an intellectual property lawyer and tell him to prepare to defend you when the copyright holder sues your tail feathers off.)
posted by Steven C. Den Beste at 10:04 PM on September 16, 2006

The original copyright is 1979.
posted by Steven C. Den Beste at 10:08 PM on September 16, 2006

For possibly free book scanning, talk to the Text Archive folks at Don't know if they can help w/ hypertext, database creation, etc. But obviously SDB is right. You won't get any help from "legitimate" organizations unless the book is in the public domain or the copyright-holder approves.
posted by Dave 9 at 10:08 PM on September 16, 2006

maybe dumb-ass idea: would a wiki be a useful mechanism? I don't know how easy it is to scale a wiki, but with open public use like Wikipedia's it could be another route to a decent dictionary-cum-thesaurus, with simple initial setup.
posted by anadem at 10:20 PM on September 16, 2006

Response by poster: Copyright issues are being dealt with, and my question isn't whether I can get legitimate organizations to help me. This is largely a project being done on the private time of individuals.

Also, text-scanning isn't the problem. Each entry is going to be full-text rewritten, so my question is about the best software to get this done in a searchable and organized format.
posted by awenner at 10:24 PM on September 16, 2006

Take a look at Matapuna which is specialised open source lexicographical software.
posted by i_am_joe's_spleen at 10:44 PM on September 16, 2006 [1 favorite]

awenner I'm not aware of a pre-existing piece of software that would do what you're suggesting without substantial rewrites.

However, in a previous life I rigged up a quick-and-dirty crossreferencing system for a constructed language called Lojban in (IIRC) MySQL and PHP. That was quite a bit easier becuase Lojban has synthetic grammar and word-formation rules that make rule-based matching easy, but the same process would, I suppose, work for any other language.

Since you indicate that the entries are going to be re-entered/re-typed you should look at doing this in two phases.

The first part is to enter the data in a marked-up, machine readable format. My first instinct is some stripe of XML, there may already be tools in existance for this, but whatever you do stick with it.

Once you have the data in a form that the machine can read, you have the fun task of designing a database that can hold all the data in a searchable format. This is a lot easier than it sounds, most word definitions are pretty simple creatures.

The next step is to write an app around the data in the database to actually do clever things with the data.

One piece of advice, try not to let the format of the data distract you from its nature. i.e. because things are in one entry, arranged together on the page doesn't mean they aren't distinct bits of data with their own schemas.

Again, I am nowhere near an expert in this, linguistics is barely a hobby of mine, but if you'd like to talk it over in more detail, I'd be happy to elaborate either here or at the email in my profile.

Oh, one other thing, you will have to take care to ensure that all the tools you use understand Unicode/UTF-8 in the same way so that your Arabic doesn't get garbled in strange and exciting ways.
posted by Skorgu at 1:06 AM on September 17, 2006

>no reliable dictionary designed expressly for the English-speaking student of Arabic exists

I don't doubt you, but ... really? That's bizarre. Why not?
posted by AmbroseChapel at 2:27 AM on September 17, 2006

Response by poster: >No reliable dictionary designed for the English-speaking student of Arabic exists.

Why? Because while a lot of Arabic-English dictionaries have been written, none provide the context that's absolutely necessary to understand which word is the right one to use in any particular situation. Instead, most such dictionaries get written for the Arabic-speaking student of English. Mawrid, the best such dictionary on the market, comes with important contextual information on not just words, but also public figures and useful concepts. Thomas Edison, for example, merits his own entry, which makes a lot of sense when you're trying to bridge a cultural gap that extends past word differences.

I once had the frustrating experience of trying to find the Arabic word for "sole," - that is, the sole of a shoe. After eventually finding the entry in the Oxford Arabic-English dictionary, I was informed that "sole" meant "a thing found on the bottom of a shoe," without any further translation. That's the problem.
posted by awenner at 2:53 AM on September 17, 2006

This sounds fantastic. Nothing to offer here on the technical side right now, but I'd love to help out--having experienced the horrors of the English-Arabic dictionary. I'll be watching MeFiProjects
posted by pullayup at 3:56 AM on September 17, 2006

I use Steingass's A Learner's English-Arabic Dictionary. Sure, it's out of date (1st ed. 1882), but it's very useful for basic vocabulary.
SOLE, s. (of the foot) batan ar-rijl (pl. butun, abtun). —(of the shoe) na'l (pl. an'al, ni'al)
(Needless to say, he has the Arabic script as well, but I'm too lazy to enter it; same goes for the diacritics on the Roman letters.)

There's no Arabic etymological dictionary, either. Arabic lexicography is surprisingly limited.
posted by languagehat at 6:17 AM on September 17, 2006

My wife would love something like this.
posted by laz-e-boy at 6:34 AM on September 17, 2006

Just out of curiousity: Do you know what type of dictionary and translation tools are used by the U.S. military? Or the C.I.A.?
posted by Dave 9 at 5:04 PM on September 17, 2006

CIA uses humans.
posted by ikkyu2 at 11:56 PM on September 17, 2006

awenner: I'll be checking in periodically since I need to do this sort of thing myself for a language called Khotanese. The available lexicographical resources are not very good, and a database would be extremely useful.
posted by AArtaud at 10:58 AM on September 19, 2006

I don't know what the software behind is, but I know that both the paper and the electronic versions are indispensably rich in cross-references for the student of Chinese. Rick Harbaugh might be willing to explain. His email's on his home page.
posted by eritain at 2:31 AM on April 23, 2007

« Older How to focus?   |   Beer for my boys, or not? Newer »
This thread is closed to new comments.