Looking for the name of most languages in every language
February 16, 2012 4:33 AM   Subscribe

ShotInTheDarkfilter: I need the names of the top 60 or so most common languages in each of those languages.

I already have a list of the 60 with the following three columns:
ISO 639-1 Code | Name of Language in that Language | Name in English

So, the entry for German would be:
de | deutsch | German

What I need now is to fill in the rest of the list with the name of each one of those languages in every other language that's in that list.

My only idea is to write a script to query google translate, but I'm not sure of the quality of the results (and I have no way to check!), whether that would be ok with their TOS, and I'd rather avoid having to write that script in the first place. I'm kind of hoping this information is already out there somewhere.

Bonus points for it being easily machine readable - spreadsheet, CSV, xml, etc.
posted by tempythethird to Writing & Language (7 answers total) 1 user marked this as a favorite
The list of Wikipedias has links to the English-language article on each of the languages, then on the left side of those Wikipedia articles are links to corresponding articles on the other-language wikis, each of which should have approximately what you're looking for as page titles. (But it might be "x language" instead of just "x", e.g. you end up at "deutsche Sprache" instead of just "deutsche".)
posted by XMLicious at 4:56 AM on February 16, 2012

ISO 639-1 codes. Copy and paste the table into a spreadsheet, export the columns you want.
posted by ellenaim at 5:06 AM on February 16, 2012

He's looking for things like the word for French in Russian or the word for Spanish in Chinese, not just the language's own name in each language.
posted by XMLicious at 5:42 AM on February 16, 2012 [1 favorite]

Best answer: Here is the Wikipedia list of languages (English), and the left-hand side menu gives you a list of alternative languages for which such a list exists.

Come to think of it, that's not terribly useful, so I think your best bet, if nobody has a better idea/knows of a resource, is to write a script which goes through the English Wikipedia list of languages (linked above) one by one, clicks on each one you have in your list, and, using the "Name of Language in that Language" image, as it were, scans the left-hand side menu for the relevant link, then goes to that page, and gets the first word(s) that looks like your language-keyword.

For instance, for Greek that would be:

ISO: el
Name of Language in that Language: Ελληνικά
Name in English: Greek
Name in German: Griechisch
Name in Abkhaz: Абарзен бызшәа

etc. At the end, you can manually fill in whatever is left empty.

Problem is, as XMLicious says above, that some of the pages (or many, there's no way to know, really) translate the language name as "x language", so, for instance, German has "Die griechische Sprache", and in the Abkhaz example above, the word "бызшәа" seems to translate as "language". Depending on the data quality you have in mind, you might want to have your table checked by native speakers (or, at least, relatively competent speakers - there's no native speaker level necessary for knowing how to correctly say "I speak X" or some such). And given that all you'd ask is for people to correct 60 words, all it should take is 5 minutes each, if that.

Actually, I've thought of something: given that you sound way more computer literate than I am - it might be possible for your script to get the page name, which is likely in each case to be the name of the language. If the page name is one word only, just go with that one (for instance, the Afrikaans title for the Greek page is "Grieks"). If it is 2 words (as it is in the case of German, which has "Griechische Sprache" as its title - I cheated a bit on that one), you can then get your script to look inside the page and see if the first bolded word/group of words are the same (if not, get that rather than page title), and if you are still stuck with more than one word, instruct it to get the title/first cell in the table on the right, if there is one. If that still doesn't help, you could go and do more searching and analyzing inside the text, but that seems to become too involved for a casual piece of programming. But yeah, I don't have the foggiest idea about programming, so this might all be silly.

Good luck, seems like you're putting together a useful resource, will it go public? Sorry if this is nosy.
posted by miorita at 5:48 AM on February 16, 2012

Response by poster: miorita
Thanks for your great answer, your proposal is right on. I would only modify the approach in your last paragraph as such: if it's two words, then run both of them through google translate to English - if one of them translates to "language" then I toss it.

Naturally I would still prefer to just find a prepackaged resource (its gotta exist, right!?) but if I don't I think your plan will be what I will do.

The accuracy has to be good as this information is going to form the core of a search system.

And sure I have no qualms over making the results public, as long as doing so doesn't violate the TOS of wikipedia or google translate (if I choose to use it). Any idea who would be interested in such a collection?
posted by tempythethird at 6:09 AM on February 16, 2012

...if it's two words, then run both of them through google translate to English - if one of them translates to "language" then I toss it.

Just note that this might not obtain the correct result, as you might be going from an adjective to a noun and end up with the wrong suffix / word form. And I wonder if in some languages the presence of the word "language" is de riguer; I'm noticing that the left-hand links include that in some cases like "Bahasa Indonesia".

Any idea who would be interested in such a collection?

Maybe stick it somewhere in WikiBooks?
posted by XMLicious at 6:18 AM on February 16, 2012

Actually, I was going to include a warning along the same lines as XMLicious - in the German version, for instance, you would be left with "deutsche" for German, or "griechische" for my example above, which is incorrect on two counts: the language is called "Griechisch" with capital G and no final e. In "die griechische Sprache", griechische is an adjective which agrees with its noun, "Sprache" (language), whilst "Griechisch" is itself a noun and appears in its noun form (nouns are capitalised). Google translate does not capture such changes - actually, no online dictionary would allow you to verify such differences other than manually (at least, I cannot imagine being able to make this process automatic, but then, as I said, I really know nothing about programming), because such rules don't tend to be active in English. And yes, cases
like "Bahasa Indonesia" (I think most languages are "Bahasa x" in Indonesian - English is, I think, "Bahasa Inggris", for instance) could be a further spoke in the wheels.

I think a potential part-solution would be as follows: have a "lead" equivalent for each language (the one you get from Wikipedia), with optional variants listed in the same slot. I think this makes sense, especially for a search engine, since some languages might well have more than one name in a given language. For instance, "German" in Romanian could be "limba germana", "germana" or "nemteste" (the latter would be more adjectival, but you never quite know what people search for), or, a better example still, for "Hungarian", Romanian uses "limba maghiara", "maghiara" or "ungureste".

If you go for this option (lead and variants), and if your list remains hidden as such, you can maybe write a script which allows for a "clever" search engine, meaning that you hard-code all the various language alternatives for the word "language", and see what form of the remaining word is both most frequently searched for AND appears in whatever results are displayed as a one-word. If this works, you will probably end up with the variant which is most reliably used as the standard name for the respective language. Alternative spellings or alternative words can be added to your list as potential variants if they are repeatedly used by different users in conjunction with one of the already existing translations.

For various reasons, this might not be as easy as it sounds, and you might have to tinker with it, but it might be worth a shot.

Anyway, getting back to the other point - I could imagine that quite a few web-based services would be interested in such a list - maybe, as XMLicious says, Wikipedia, maybe the ISO people, other language resources. I'd even see it work quite well as a translation aid on a web-page of its own - maybe a fun extra service you could offer on the site where it will be used for searches. If you do display it though, I'd probably only use "lead" entries and verified, reliable variants - the whole assembly of terms which would (maybe) enrich a search engine would be possibly inelegant in a list can be read by everybody.

Again, good luck.
posted by miorita at 9:17 AM on February 16, 2012

« Older Help me identify this quote/passage that I love!   |   You're never truly dressed without a pair of... Newer »
This thread is closed to new comments.