Recommendations for an easy, customizable Python Parser for NLP?
June 20, 2014 7:54 AM

There are loads of options (too many!), but I can't figure out if there is a basic parser that I could use and tweak a bit, or which might be the most straight-forward to create a new parser for a language variety without one. Any ideas for someone with limited skill would be awesome at this point.

I am working with a small corpus I made for an Arabic dialect and need to do a syntactic analysis of it. I know a little Python, I 95% of the codecademy course and have been using it for things like frequency counts, removing affixes (to get accurate frequency counts), and standardizing spelling.

My corpus is about 50,000 words, so hand-tagging the whole thing would be possible but awfully painful. I'm considering it since I haven't found anything that looks right. I know I'll need to hand-tag some for training and can do that. I also know that it won't be perfect no matter what.

At a bare minimum, I need to tag nouns and verbs accurately, and full clauses would be the next major 'want' but a variety of constituents is ideal. My data contains some of another language as well, so the end goal is to compare the structure of those parts to the Arabic-only parts. This is what I can't see doing without a parser.
posted by petiteviolette to Computers & Internet (4 answers total) 3 users marked this as a favorite
You probably ran into nltk, and I'm pretty sure that's what most people use for this sort of thing. It's not the easiest library in the world to work with, but there's some good documentation out there. This stackoverflow has some links.
posted by evisceratordeath at 8:07 AM on June 20, 2014


The NLP guys I know basically say that all the options are terrible, but nltk is the least terrible for most cases.
posted by dorque at 8:27 AM on June 20, 2014


NLTK has helpful examples, don't forget about that!
posted by oceanjesse at 8:40 AM on June 20, 2014


Thanks for the quick answers!

I'm fairly familiar with NLTK, but stopped trying due to some major issues. These may be due in part to my lack of experience, but:

1 - I can't figure out how to create a new grammar for it to use in its parsing/tagging. It doesn't have any built in support for Arabic dialects, so it would have to be able to accept new stuff.

2 - At least in Python 2, it was a disaster with encoding (confirmed to me by a comp sci person). If the first 'problem' has an easy answer, I will certainly try NLTK with Python 3, even though the support for it isn't as thorough as in 2.
posted by petiteviolette at 10:52 AM on June 20, 2014


« Older Forcing Nicecast to Port 8000?   |   Tedium, tedii, tedio Newer »
This thread is closed to new comments.