Are there any open-source language detection algos?
October 5, 2006 1:42 AM
Subscribe
Is there a piece of open-source software which analyzes a text file (or standard input) and tries to determine the language it's written in?
Preferably something that
1) already has training data, or at least has enough data for someone to easily train it themselves
2) can work properly if everything is UTF-8 encoded
3) can list some type of score to indicate confidence (and possibly display alternatives which scored higher than 0%)
4) can be used from the command line - something like this:
# lang essay.txt
essay.txt: Ukrainian (88.9% sure), Russian (4.3% sure)
I know how to write one myself, but I don't want to waste the effort if it already exists. (however, if it turns out there's no free equivalent, I'll seriously consider doing the project and posting it on Sourceforge).
posted by helios to computers & internet (4 comments total)
2 users marked this as a favorite
posted by gsteff at 2:03 AM on October 5, 2006