Are there any open-source language detection algos?
October 5, 2006 1:42 AM   RSS feed for this thread Subscribe

Is there a piece of open-source software which analyzes a text file (or standard input) and tries to determine the language it's written in?

Preferably something that

1) already has training data, or at least has enough data for someone to easily train it themselves
2) can work properly if everything is UTF-8 encoded
3) can list some type of score to indicate confidence (and possibly display alternatives which scored higher than 0%)
4) can be used from the command line - something like this:


# lang essay.txt
essay.txt: Ukrainian (88.9% sure), Russian (4.3% sure)


I know how to write one myself, but I don't want to waste the effort if it already exists. (however, if it turns out there's no free equivalent, I'll seriously consider doing the project and posting it on Sourceforge).
posted by helios to computers & internet (4 comments total) 2 users marked this as a favorite
This tool seems to meet every criterion except #3.
posted by gsteff at 2:03 AM on October 5, 2006


Fortunately, that site also provides a long list of competitors.
posted by gsteff at 2:04 AM on October 5, 2006


Languid does a similar thing. It uses the Perl module Language::Guess behind the scenes, which could just as easily be used in a short Perl script for your command line requirement.
posted by thebabelfish at 8:45 AM on October 5, 2006


I tried text_cat, but it only supports 1 or 2 languages in UTF-8.

Languid works very well (and is on the competitors list that gsteff provided).

Thanks!
posted by helios at 6:47 PM on October 5, 2006


« Older I'm doomed. Doomed! Coming dow...   |   Brrr. Our rented Victorian fla... Newer »
This thread is closed to new comments.


Related Questions
PHP newbie: teach me how to speak it August 22, 2008
Translation for Original Language? November 24, 2007
Looking for a mac word processor with French... March 2, 2007
Is Rosetta Stone language software any good? February 24, 2004
How can I set up my computer to type a lot of... February 4, 2004