need a library for text classification
July 20, 2008 4:17 PM   Subscribe

I need to be able to automatically identify language (English, Japaneese, Russian, etc ... ) in which a particular blog-post has been written. (lang attribute might or might not be available).

Few years ago I came across a library for RSS feeds that was doing roughly what I need - can not find it anymore though.
posted by chexov to Computers & Internet (4 answers total) 1 user marked this as a favorite
I'd bet that if you have a reasonably small set of possible languages, you can do it with just digraph frequencies. Each blog post can be boiled down to a point in an N^2-dimensional space (where N is the number of letters you're left with after case-smashing, removing diacriticals you don't like, and so on); use a bunch of sample text similarly to find points for the languages of interest; categorize each post according to what language's example point is closest.

Not a very sophisticated algorithm, but it's simple and might work perfectly well.
posted by hattifattener at 4:43 PM on July 20, 2008

It's written as a Perl library, available from the site above.
posted by thebabelfish at 4:50 PM on July 20, 2008

Nice find, eponysterical-babelfish.

CPAN also turns up modules like Lingua::Identify, Text::Language::Guess, Lingua::Ident (which looks like it implements my idea above)...
posted by hattifattener at 5:05 PM on July 20, 2008 - is the one I was looking for. Many, many thanks!
posted by chexov at 11:08 PM on July 20, 2008

« Older A diamond ring that takes a licking and keeps on...   |   Is my friend allergic to... food? Newer »
This thread is closed to new comments.