Can the web community help improve machine translation?
March 24, 2005 4:52 PM   Subscribe

Babelfilter: Machine translation still sucks. Why isn't there a system that allows members of the public to collectively help improve the quality of computer-produced translations?

This question was inspired by an essay I stumbled upon through etaoin's FPP.

Don't get me wrong, I don't believe MT will ever be able to provide a transparent source-to-target language translation with all the nuance and metaphoric colloqualisms intact; however, whenever I use it it just seems to me like it could do a lot better if a large amount of humans were in a position to improve and build upon the existing dictionaries. For example, a common two word phrase will get lost in translation when that seems reasonably easy to overcome by specifying that "if word A and B occur next to each other, phrase AB is intended, therefore translate it as such" and pointing the program to the intended entry in the lexicon.

Also, all current MT programs I'm aware of translate one word in the source language to exactly one word in the target language (the most common meaning, I guess, but I'm not sure). Words have ambiguous meanings, deal with it, and I personally would have no problem with 'multifinality', i.e. with more than one outcome, perhaps made visible to the user with some on-hover DHTML.
posted by goodnewsfortheinsane to Writing & Language (11 answers total)
 
Response by poster: Okay it looks like I hit "Post" a bit too soon, but I suppose the question is clear as it stands. I could think of a few other examples, but I'm very interested in your thoughts on this, and I'm willing to clarify if any part of my unedited post somehow got, erm, lost in translation.
posted by goodnewsfortheinsane at 4:56 PM on March 24, 2005


I am not a linguist. I think this is an interesting idea but the problem is that translation is more than simple substitution. The rules of natural languages are numerous, complex, and inconsistent. The contributors would be limited to those who know two languages and understand how to specify rules in whatever system you to specify grammar.

On the other hand, it seems your idea would be good for building multi-lingual dictionaries.
posted by rdr at 5:19 PM on March 24, 2005


They've thought of that. The relevant chapter from a book on machine translation.
In general, there are two approaches one can take to the treatment of idioms. The first is to try to represent them as single units in the monolingual dictionaries. What this means is that one will have lexical entries such as kick_the_bucket. One might try to construct special morphological rules to produce these representations before performing any syntactic analysis --- this would amount to treating idioms as a special kind of word, which just happens to have spaces in it. As will become clear, this is not a workable solution in general.
And so on... the whole book is pretty interesting.

OTOH, I can see how there would be some potential to distributing this kind of work. If there are 100 variations of a particular idiom, a researcher might assume that's impossible to deal with over the whole language. Too much effort. But you farm it out to people and let them earn points for translating them, and almost anything is workable.
posted by smackfu at 5:23 PM on March 24, 2005


Best answer: I've been noticing that this guy (a VC) has some interest in machine translation on the web for the purpose of improving communication between the west and the arab world (and if he can find a way to build a business around it, all the better, I'm sure).

You might browse his posts on the subject and see if there is anything relevant, you might also ping him with your idea and see if he knows anything or anyone.

I don't think its as simple as goodnewsfortheinsane makes it out to be either, but tapping the web for collaborative training of machine translation could be interesting and wouldn't require teaching the participants a system for specifying grammar.

Rather than building dictionaries or annotating grammar, people would translate parts of documents, maybe only a few sentences. Each fragment would be translated multiple times (one might even have other people translate the translated phrase back) these multiple translations would then be fed to a machine learning application that would tune its existing grammars and vocabularies based on statistical analysis of the sample set.

To make the project fun and engaging, one could arrange a ranking system like that used in distributed computing projects. To avoid abuse one would screen submitted translations against those available from machine translation web services, and would discard and decline competition credit for translations that were too far from the norm. Accounts with repeated abuse would be locked out.

Or not. Just a thought. IANAMTEOAMLE
posted by Good Brain at 7:06 PM on March 24, 2005


Any idiot could knock together a one word => one word machine translator in half a day, so I think they're a bit more complicated than you give them credit for. They have proper linguists working on this and I'm sure they've though of everything obvious.

The other thing is that the public being able to tell the system what the correct answer was won't really help evolve the algorithm. It might know what to do it if sees the exact same phrase again, but it won't know what other circumstances it needs to make the same correction, if any, and how to recognise them. Defining the scope of when to apply a certain linguistic rule is the hard part, and that's probably what MT designers spend their time hand-optimizing. I don't think throwing numbers will help this problem - any useful data will be lost in the noise of everyone's suggestions.
posted by cillit bang at 8:34 PM on March 24, 2005


I've often wondered how feasible it would be to make such a thing human assisted. You sign up to be a translator for a service, say, and you are assigned an instant messager ID. You log in whenever you feel like it and while you're in front on a computer, you occaisonally get requests to translate sentences or paragraphs. You fill in the answer and submit it. As described above several people might be polled so that you have overlap for all or most of the document. Some other people who can't translate, but who speak the target language, might get sent multiple versions of sentences, or whole paragraphs, and rank them for correctness and understandability. Presumably you'd get paid for any of your translations meet muster.

This puts the machine part into what machines do best: collecting and scoring information, processing payments, assembling text, etc, and puts the fuzzy parts into the hands of us fuzzy thinkers. You wouldn't get your response back for a while, of course, it might be minutes to hours depending on the length, complexity, the number of people in the program, etc.

This is something my company is having to deal with. Due to the fact that we recently signed a deal with a Canadian company, all of a sudden we need a french version of our website. And this is a big deal because our website is about 5% static text and 95% dynamically generated, with huge chunks of displayed text residing in databases, templates, constructed variables, etc. Add to this the complexity of handling multiple currencies (we're an investment firm), different localization parameters, etc. Ugh.

Are there things you can pay for that are better than babelfish? Babelfish sucks really bad and has for a long, long, LONG time. And I think it often does little better than word for word literal translations. For fun, translate something from english to anything else, and then back to english. Hilarity ensues. I once made some messenger plugins to do translation using babelfish. It was pretty hilarious.
posted by RustyBrooks at 9:16 PM on March 24, 2005


There's something like this out there. Check it out.
posted by adamrice at 9:31 PM on March 24, 2005


RustyBrooks, you've obviously never translated a substantial document of significant difficulty. The problem is that single sentences don't stand alone and usually can't be easily translated without the surrounding context. Let me illustrate that with a sample sentence taken from your post:

"You fill in the answer and submit it."

This sentence contains a number of ambiguities, starting with who is meant by "you" and what "it" is. If you translate it into a language close to English, like German or Dutch, you might even come up with an understandable result, but go into something a bit further removed like Chinese or Japanese, and the result will be complete gibberish, in all likelihood, even when done by a human translator.

On to the original question: Why isn't there a system that allows members of the public to collectively help improve the quality of computer-produced translations?


I think one important factor is that it is not so much the dictionary that is the problem, but the algorithm behind the whole thing (e.g. the way ambiguities are resolved), which is much more difficult to do in a collaborative effort. Although, some of your ideas are quite interesting -- maybe part of the answer is that simply nobody tried so far.
posted by sour cream at 12:02 AM on March 25, 2005


What you're talking about is a combination of 2 concepts that are still in pretty early stages of development: machine translation and machine learning. There are a number of researchers around who are experimenting with these technologies (I used to be one of them) and if you looked around enough you could probably find some publicly accessible experiments that demonstrate what you asked about, but for the most part both technologies are too young to really be effectively combined in to a commercial product at this point. Machine learning in particular is still in very rudimentary stages of development.

So you know, most people working in the machine translation field agree that machine translation will only really become useful once it's been successfully integrated with machine learning.

Wait a few more years and someone will have developed something useful or at least released a big public test project.
posted by mexican at 12:09 AM on March 25, 2005


I dashed off a brief response earlier. I'll come back with more.

Machine translation was once considered a realistic, achievable, desirable, and very high-profile goal for computing. A huge amount of resources and scholarship were thrown at the problem, and you can all see where we're at. Frankly, I'm surprised it is as good as it is.

Some years ago, I saw a presentation by one of Japan's leading researchers on MT. He divided the problem into 5 levels--lexical analysis, syntactic analysis, semantic analysis, situational analysis, and one other I can't remember. He also discussed the two main approaches to MT: the algorithmic approach (basically diagramming sentences) and the corpus approach (starting with a collossal repository of paired canned phrases, and finding exact matches between it and your source text). I think most MT is a mix of these two now.

The classic example of the problems facing MT is the following pair of sentences:
1. The pen is in the box.
2. The box is in the pen.
posted by adamrice at 7:27 AM on March 25, 2005


Best answer: Scroll to the end for the short answer. Here's the long answer:

Machine translation still sucks. Why isn't there a system that allows members of the public to collectively help improve the quality of computer-produced translations?

There is a huge variety of different approaches with vastly different underpinnings. Many of them are intended to solve different types of problems. The type of knowledge needed is going to depend on the system. So there's a basic problem: in what format can a general user provide information to the system? Well, is the system a complex knowledge-based system (e.g. Babelfish, which is based on Systran, as are all other web-based systems that I'm aware of)? Those types of systems rely on complex internal representations, and it requires training to provide input in the proper format. So it's not really amenable to input from general users. On the other end of the spectrum, there are statistical systems which generally require examples of translations. However, these systems have other requirements, which I'll explain below.

whenever I use it it just seems to me like it could do a lot better if a large amount of humans were in a position to improve and build upon the existing dictionaries.

They are, just not in the precise way that you're suggesting. Modern statistical machine translation systems are based on machine learning principles. The idea is that if you provide enough example translations, then the algorithm can learn enough information to translate new sentences. The statistical models in these systems consist of billions of parameters. They require millions of input examples. Guess where all the input examples come from? We can mine them from the internet. Some commonly used sources are the Bible; multilingual news services (i.e. the BBC, Xinhua); and the proceedings of multilingual governments (i.e. the Canadian parliament, the government of Hong Kong, and the European Parliament). Here's the problem with statistical systems: the performance a system trained on one type of data (e.g. news) tends to degrade when it is applied to a different type of data (e.g. governmental proceedings). The best system to apply to any particular input is one trained only on data of the same type -- using a heterogeneous training set can actually reduce the performance. In other words, even though this technology is really cool and works well within certain domains, the possibility of producing a general-purpose translation engine on it is still remote.

A training corpus comprised of input data from random internet users is not only likely to be problematic for reasons of heterogeneity, but also because there's no way to ensure data quality. The news agencies and governments use professional translators to produce their documents. Therefore, a minimal level of fidelity is guaranteed in their data. No such guarantee is possible in unreviewed, unedited data, so it's much more likely to produce an unusable system.

For example, a common two word phrase will get lost in translation when that seems reasonably easy to overcome by specifying that "if word A and B occur next to each other, phrase AB is intended, therefore translate it as such" and pointing the program to the intended entry in the lexicon.

I'm not sure how to answer this, except to say that current state-of-the-art statistical systems do in fact do this. They are capable of translating phrases of arbitrary length, not merely single words. By the by, it's worth pointing out here that the meaning of "word" isn't necessarily well-defined in this context, anyway. Even in English, where word boundaries are commonly understood (i.e. spaces between characters), the input "words" that are processed by an MT system may be somewhat different -- they could be roots, morphemes, or some other intermediate representation, and they may very well be tokenized differently then in the original text (for instance, contractions such as "don't" may be split into two words). Once you get into languages like Chinese and German, where the word boundaries aren't even explicit in the text, it gets even more complicated, and you have to make some simplifying assumptions, which are often true in the general case, but not always true in specific cases.

Also, all current MT programs I'm aware of translate one word in the source language to exactly one word in the target language (the most common meaning, I guess, but I'm not sure).

As I've said above, there are systems that translate multiple words to multiple words as a unit. But I'm mostly familiar with the research systems, so I can't point you to any commercial examples.

Words have ambiguous meanings, deal with it, and I personally would have no problem with 'multifinality', i.e. with more than one outcome, perhaps made visible to the user with some on-hover DHTML.

Most MT engines produce multiple outputs as a matter of course. So what you're talking about here is a matter of interface. fwiw, I know that there have been user studies on exactly what you're talking about -- although I'm not sure what the general consensus (if there is one) of the field is on this matter. However, I'm pretty sure that there are cross-lingual IR systems that pursue something similar to this principle. In these systems, the purpose of the translation is not to be precise; it is simply to provide enough information to the user (a searcher who is unfamiliar with the language used in the indexed collection) to decide whether s/he has found the correct document, which can then be translated more accurately using other means (often a paid translator). I think that some of these systems offer a "mulitfinality" option, as you call it. Again, these are research systems.

The short answer is: people are working on all of these things.
posted by alopez at 9:20 AM on March 25, 2005


« Older Clickity   |   Candy of the '80s Newer »
This thread is closed to new comments.