Help me dig into lexical analysis!
November 30, 2005 1:10 PM   Subscribe

Lexical analysis! What are some good resources for a beginner?

I'm focusing some word-nerdity on a secretive mad-scientist-flavored excursion into natural language analysis, and I know that I don't know very much about it. I'd like to have more than self-invented gut-instinct ideas to work with. What books, websites, essays, etc. will help me get up to speed on the subject?

Specific interest in word-frequency analysis, but I'm finding myself increasingly curious about the whole neighborhood of ideas. I'm frustrated by my inability to cover much ground with Google -- I don't know the words for what I don't know about!
posted by cortex to Science & Nature (14 answers total) 2 users marked this as a favorite
this is an oldie but goodie computing text that i really like. they include a bit of background (chomsky hierarchy etc) before getting into the details.

if you want to write something by hand, look for information on "recursive descent" parsers. i find them by far the easiest to write and understand.

this is computing rather than natural-language related, because that's what i know. hope it helps.
posted by andrew cooke at 1:19 PM on November 30, 2005

oops. not all those links find the book (the first one doesn't). this is the second link, and what i intended.
posted by andrew cooke at 1:21 PM on November 30, 2005

Response by poster: Computational linguistics pointers are definitely helpful -- I am indeed writing my own parser as part of this.
posted by cortex at 1:31 PM on November 30, 2005

Perhaps you could clarify what you mean by "analysis"? There are a lot (a LOT) of things that calculations of word frequencies can be used for, and the actual process of calculating them is fairly straightforward, so I suspect you don't mean that. Also, the only sense of "lexical analysis" that I'm familiar with is what the wikipedia article says, and with respect to natural language (as opposed to a computer language), that's not a very interesting task, and doesn't seem to be what you're after.

As far as "natural language analysis" goes, well, I am a linguist, and analyzing natural language is what I do (in the sense of formulating theories that make predictions about how natural languages behave), but it's not clear if this is what you're interested in either.
posted by advil at 1:33 PM on November 30, 2005

Response by poster: Attempt at clarification:

I'm finding myself broadly interested in a subject about which I know very little. I have specific applications in mind -- I'm working on very first-draft software tools for parsing and analyzing a large body of text, and I'm performing word-frequency counts to help chase a couple of brainstorms and generate statistics about the corpus. However, I'm getting by on an arm-chair sensibility about all of this, and I would like to have a better understanding of the various issues and ideas tied to the subject.

The in-apt use of "lexical analysis" hopefully underscores my position: I don't have the functional vocabulary to describe accurately what I'm interested in. Hence...

(But lexical analysis as described in advil's wikipedia link is one of the things I'm interested in. I have a comp-sci background, so it's a phrase that has stuck in my head from that side of things.)
posted by cortex at 1:47 PM on November 30, 2005

I found the Cambridge Series in Computational Linguistics very helpful for explaining the basics of language analysis to new staff back in my corpus linguistics days.
I remember GATE being a fun platform/toolkit, but I can't exactly remember what I used it for.
posted by scruss at 2:01 PM on November 30, 2005

It seems that there are roughly two (closely related) fields you're interested in:
  • Information Retrieval (IR): more or less, the study of using statistical techniques to get information, in some form or other, out of natural language texts. It seems that IR may be mainly what you're interested in. A few buzzwords that might help with the searches: "topic detection and tracking", "question answering", "ontology extraction", "statistical alignment" (or "text-translation alignment"). The standard textbook is "foundations of statistical natural language processing", by Manning and Schutze, MIT Press. If you have cs background this book may be approachable. A sample academic research group that does this is here.
  • Natural Language Processing (NLP): the study of taking sentences of natural language, and having a computer act more or less as a human does when they hear that sentence. Most of the work here focuses on parsing sentences into some kind of syntactic representation. There is some overlap between this and the previous topic, though less than you might think - many IR tasks don't need good parsing techniques, and any linguist will tell you that the large-scale statistical techniques that IR often uses simply aren't what humans do. These days for CS people question-answering is the main task that needs real NLP techniques. I liked the textbook "speech and language processing" by Jurafsky and Martin, Prentice Hall, 2000. It's somewhat more NLP oriented than the manning and schutze book. Hopefully these recommendations won't be too technical, but I just don't know of any less technical ones.
By the way, I have no idea to what extent any of this is approachable to a non-specialist, but ACL (association for computational linguistics) has put the past 20 years or so of the journal Computational Linguistics online for free here.

If you are interested in parsing, you might want to know a bit about linguistics. Stephen Pinker's "The language instinct" is the canonical recommendation here. It's very readable, and very interesting.
posted by advil at 2:13 PM on November 30, 2005

Foundations of Statistical Natural Language Processing is a good intro to the statistical side of NLP.

There are some sample chapters online, including one on collocations.

Searching for courses on NLP that have course notes online will also be helpful. One topic you might want to search for is n-gram models.
posted by formless at 2:19 PM on November 30, 2005

Get Jurafsky & Martin's Speech and Language Processing. It's an introductory text, which means it has a very broad scope, but it does explain the practicial sides too, so you'll actually be able to use what you read.

I don't know if you can download it somewhere, but if you can spare the money to buy it it's probably worth it anyway as it's not the type of book you want to read from screen.
posted by fvw at 2:29 PM on November 30, 2005

Response by poster: Have I mentioned lately how great AskMe is?
posted by cortex at 3:06 PM on November 30, 2005

Response by poster: formless, your link is coming up 403 for me. Which is funny, since it's the top hit for that string. Was it working when you linked it, I presume?
posted by cortex at 5:05 PM on November 30, 2005

Response by poster: Something about my "I presume" smells kinda snarky to me, but I really didn't intend it that way.
posted by cortex at 6:04 PM on November 30, 2005

Here are some basic kinds of parsers off of the top of my head:

Chart Parsers
Link Parsers
CYK Parser
Viterbi Parsers
LR Parsers (Also SLR Parser)
Tomita Parser (An improved LR parser)

I also second the Jurafsky & Martin book as well as Manning and Schutze.

I'm a computational linguist with a pretty good background in NLP and I can certainly answer any questions. I've written an LR parser before and it took a little while. The toughest part for you will be collecting the parts of speech for each word and their probabilities without statistics. Also, there is no need to write a parser when they are available for free. But if you just want the practice, enjoy!
posted by Alison at 6:20 PM on November 30, 2005

The Charniak Parser is free and so is the link parser.
posted by Alison at 6:23 PM on November 30, 2005

« Older Help me ask for a promotion!   |   An easy way to check for Firefox extension... Newer »
This thread is closed to new comments.