Learning to analyze text
August 15, 2009 8:46 PM   Subscribe

Please help me crash-course language and text analysis.

I'm working with a group of researchers who've won a grant to analyze over 2000 magazine articles that cover a time span of 30 years. The idea is to process the articles' texts to find how certain themes show up and mutate over the years, the first appearance of certain ideas, etc.
I'm the default computer guy, and need to figure out precisely what we're going to be studying and program or download the tools we will use to do so.
I don't have CS degree (I have an Architecture degree & masters and make a living doing web dev). I use Python for my everyday work, and didn't have much trouble following Segaran's "Programming Collective Intelligence".
Please advise me on the books or websites I need to read, or the keywords I should be searching for.
posted by signal to Computers & Internet (12 answers total) 8 users marked this as a favorite
Isn't that the researchers' job? What did they propose to get the grant?

There are hundreds of approaches to content analysis of texts. It's not like learning how to bake bread. It's like learning to cook. The query is far too broad.

The grant proposal presumably described a methodolgy. What was it?
posted by fourcheesemac at 9:12 PM on August 15, 2009

To get you started with processing the text:
Natural Language Toolkit
Foundations of Statistical Natural Language Processing by Manning and Sch├╝tze

For the analysis you're speaking about, some of the work people have done with bibliometrics and social network analysis might be helpful. Look up Jon Kleinberg's papers on propagation of ideas in social networks (e.g. MemeTracker).
posted by needled at 9:15 PM on August 15, 2009 [2 favorites]

p.s. There's an O'Reilly book out on NLTK - you can read the text online here.
posted by needled at 9:32 PM on August 15, 2009

You probably _can_ take this on, but honestly, I think you (/the researchers) would be much better off teaming up with someone with existing expertise in natural language processing, or at least AI. Given your background, I think you may find that you are in somewhat over your head; this is a huge area, and what you want to do is not anything close to a solved problem. And like fourcheesemac says, I hope for the sake of your personal sanity there was some much more specific methodology proposed in the grant.

The idea is to process the articles' texts to find how certain themes show up and mutate over the years, the first appearance of certain ideas, etc.

You want to look into research on topic detection and tracking. Note that most of the research I'm aware of (which I don't have any deep knowledge of) is about things much more concrete than "themes" or "ideas" -- I'm not sure we have the technology or the understanding to actually deal with things that abstract.
posted by advil at 9:48 PM on August 15, 2009

You might find this interesting:
Topic Modeling
posted by null terminated at 11:11 PM on August 15, 2009

When I was an undergraduate in compsci (4 years ago), I was working with some New Media artists. One professor was really impressed by the work I was doing and the ideas I was generating, and so came to me with a project. At first, I thought she just wanted me to help her build some kiosks and aggregate some data. Turns out, she wanted me to automatically categorize stories by theme: are they about love, or hate, or food, or immigration, or the drug store, etc.?

She asked for exactly the same thing your colleagues are asking for.

I spent about a couple weeks talking to the right professors, reading the available literature, and doing my research. Then I told her it was impossible.

What I'm saying is that, while there are all sorts of nifty tools for statistical analysis of natural language, the actual jump to semantics is still almost completely lacking. A computer can determine all sorts of shit about natural language with the proper Bayesian algorithm. It can even, to an extent, extract what the string says. But what it means is a complete loss.

Say you have a little story: "My mother's lasagna tasted terrible. It was sloppy, runny, and bland. She knew it. And it took her all day to make it. But every Friday, she would make that lasagna, using the same unappealing recipe every single time. Because that's how her mother had made it every Friday back in Italy."

Okay, that story is "about" food. And a computer might actually pick that up, since it contains the words "lasagna" and "recipe". However, what it's not going to get, no matter how well you program it (right up until you have strong AI), is that the story is also "about" immigration and love and tradition.

So, if you're looking for a technique that analyzes a document and spits out "this is about fish" or "this document is describing abstract expressionism", there is no crash course that will prepare you to achieve your aims. I tried that, and found that essentially all the literature reported negative results. Although I did have one professor offer to sponsor my PhD if I was inclined to begin researching theme extraction.

However, depending on what exactly you need, you might look at document clustering algorithms. Basically, you would feed all of your documents into the clustering algorithm, and it would group them based on statistical analysis of (usually) word use. So, it would split up the articles into piles that mentioned food a lot, or mentioned medicine, or barbershops. You could then manually review the clustered documents more efficiently than a completely manual review of 2000 articles.

But, if you want to know what it's about, you need to ask a human.

Your colleagues should have talked to you before they wrote a grant based on a science fiction understanding of computers.
posted by Netzapper at 11:35 PM on August 15, 2009 [4 favorites]

needled is spot-on - I actually came in here to recommend the awesome Natural Language Toolkit, which is one of the coolest things about Python.
posted by koeselitz at 12:22 AM on August 16, 2009

Without revealing too much, I'm involved with a fund that is looking at the same problem but towards a rather different goal.

I took over the project about a year ago when it was floundering; we'd had three different guys running it, each after the other, and each replacing prior technology that had been thrown at the problem. Even after hundreds of thousands of dollars development costs, the system's discriminatory power never got to the point where we could trade off the resultant data. I wasn't about to repeat prior mistakes, so I found a way to avoid the problem.

We're using Mechanical Turk to get folks to parse and categorise our news & text snippets. We run distinct periods corresponding with what we consider "market events", pulling in not just news stories as you're proposing but also automated feeds.

Now we can push each "element" (as we call them) through the categorisation process five times, discarding outlying categorisations. We're getting confirmatory hit rates of about 95%, meaning the few elements that fall out we can manually evaluate or even (as we've been doing lately) safely ignore.

Don't know if you'd like to go the fully automated route, as it seems nothing yet beats a human's reading ability. We're trading on this systems output now so we're trusting (other people's) money to it.
posted by Mutant at 5:56 AM on August 16, 2009 [1 favorite]

The compsci side of this question, while not simple, is at least straightforward, and there are good suggestions above.

The linguistic side of this question is impossible to answer clearly without very narrow parameters being set.

Could you post an abstract of the proposal?
posted by fourcheesemac at 5:56 AM on August 16, 2009

Response by poster: Thanks for all the help so far. Some good stuff.

Background: The researchers are all Architectural History people, with no quantitative, math or compsci knowledge at all. The computer analysis of the texts is a significant part of but not the entire project. The actual resaerch proposal is very vague on the technical aspects and more focused towards the 'softer' historial and arch-theoretical affairs. The actual wording of the computer related part is something like "analyse texts from the XXXX period using computational approaches".
I was going to be a researcher on the project but I was already listed as a researcher on another project for the same fund, so I was demoted to technical assistant, which doesn't look as good on my resume but pays more.
We have no extra funding for this and in fact had our budget slashed, so hiring an actual expert is out of the question. I'm as good as it gets.
Having said all this, the precise nature of the analysis is something I have to research and propose, which is why I this askme question is probably not as precise as it could be. If extracting actual meaning is presently intractable, as Netzapper says, we might have to just work off of keywords, phrases, etc.
Anywho, thanks for all the help, and please feel free to add more information, caveats or criticism of the whole damn idea.
posted by signal at 10:03 AM on August 16, 2009

Garbage in, garbage out. The need to appear to be doing quantitative analysis while not actually knowing why or how you plan to do this (or what advantage it has over qualitative analysis or inteprretation) has sunk many research proposals, including plenty I've rejected as a proposal reviewer in my life. Still, the problem persists because of the mythical hoohaa so characteristic of quantophilia, and so easily pitched as "science" to the uninitiated.

It's still the "researchers'" obligation to specify what it is they want to know and how they think NLP quant analysis will produce such knowledge without bias. You might get away with a very non-granular approach -- have student assistants classify each article in the database with a few keywords, etc. But I can almost guarantee this is one of those projects where the "quantitative" part is pure window dressing, incapable of discovering any patterns in "the data" (because the "quant" data are not conceptualized in relation to the arguments, if I read your last clarification clearly) not already presumed to exist or otherwise deducible through qualitative analysis. I've seen it over and over again in my career as a social scientist.

It's the qualitative types -- and I count myself among them -- who are too easily seduced by the technical-sounding language of "textual analysis," "natural language processing," and the like. Regular humanists just read everything and synthesize a summary of the major patterns in their own supercomputers, which are made out of proteins and fats and located in their skulls. Serious qualitative social scientists are rarely impressed by attempts to automate the process of interpretation.

I'm not harshing on you, signal, but on social "science" of a certain kind in general. What will happen here is that your colleagues will expect you to perform miracles with no guidance from them, which is already pretty much the basis of this question.

The problem of textual meaning is simply not reducible to an algorithm, at least not yet. At a minimum, one needs to know how texts are normally interpreted by any specific recipient of their intended meaning.
posted by fourcheesemac at 12:20 PM on August 16, 2009

However, depending on what exactly you need, you might look at document clustering algorithms.

CLUTO is great fun for clustering. Basically, you give it a bunch of textual items and a number of groups to sort them into, and it automatically groups them and tells you which words are important to the groups.

Since you don't know what you're trying to do, really, or how you're going to do it, here's one very easy and probably only slightly informative suggestion. Break your work into a series of 5 year chunks (1950-1955, 1951-1956, 1952-1957... etc) and run CLUTO against chunk, generating (say) 5 groups each time. Find out how the words that define the groups change over time, and whether a given article moves around as it's compared to articles 4 years before vs. 4 years after.

For example. let's say we're doing this with cooking and times of the day. Early in the morning eggs will be breakfasty, and maybe associated with onions and bread or cheese depending on what you cook them with. They might not be an important defining ingredient - so many breakfast recipes include eggs they won't tell you much about a recipe. As the day goes on they become more associated with flour and butter as baking becomes more prevalent and egg-based breakfast foods become less popular, and if a recipe includes eggs it becomes more notable because they're a less frequent ingredient. And maybe for some reason those results are interesting!

Also echoing Mechanical Turk's ability to get things classified for you on the cheap.
posted by soma lkzx at 6:17 PM on August 16, 2009

« Older Suggestions for starting a small event photography...   |   List of forms of communication? Newer »
This thread is closed to new comments.