Are there any open-source language detection algos?
October 5, 2006 1:42 AM   Subscribe

Is there a piece of open-source software which analyzes a text file (or standard input) and tries to determine the language it's written in?

Preferably something that

1) already has training data, or at least has enough data for someone to easily train it themselves
2) can work properly if everything is UTF-8 encoded
3) can list some type of score to indicate confidence (and possibly display alternatives which scored higher than 0%)
4) can be used from the command line - something like this:

# lang essay.txt
essay.txt: Ukrainian (88.9% sure), Russian (4.3% sure)

I know how to write one myself, but I don't want to waste the effort if it already exists. (however, if it turns out there's no free equivalent, I'll seriously consider doing the project and posting it on Sourceforge).
posted by helios to Computers & Internet (4 answers total) 2 users marked this as a favorite
This tool seems to meet every criterion except #3.
posted by gsteff at 2:03 AM on October 5, 2006

Fortunately, that site also provides a long list of competitors.
posted by gsteff at 2:04 AM on October 5, 2006

Languid does a similar thing. It uses the Perl module Language::Guess behind the scenes, which could just as easily be used in a short Perl script for your command line requirement.
posted by thebabelfish at 8:45 AM on October 5, 2006

I tried text_cat, but it only supports 1 or 2 languages in UTF-8.

Languid works very well (and is on the competitors list that gsteff provided).

posted by helios at 6:47 PM on October 5, 2006

« Older How to deal with a bad cold on vacation abroad   |   Ooh, it's a bit parky... in. Newer »
This thread is closed to new comments.