Is there a better/faster way to operationalize the coding of messy text data?
November 29, 2012 11:24 AM   Subscribe

I'm trying to tag and code about 4000+ unique paragraphs of data. These are opinion responses to two similar questions. I manually went through the first 4000 responses and it took weeks using Google Refine. I'm wondering if there's a way to operationalize this to be a bit easier and less time consuming?

There are about 10 dimensions I'm coding for with tags, and the wording of the answers varies wildly. Here's an example of various responses to one of the dimensions:

"This is just how it is in my brain."
"I see it in my head like this."
"When I think about it, this is what I come up with."
"It's just the way it is, you know, inside my mind."
"Mind's eye sees it this way."

These responses would be preceded and followed by other text that's also important to me to code on other dimensions, so I don't want to replace the whole paragraph with a tag like 'in head', but rather add 'in head' to a series of tags for that response (separate by semicolons). The data is also very messy with punctuation, spelling, omissions and IPA/other character variation.

If you've done something like this, how did you go about it? I'm currently using Google Refine's cluster feature to collapse about 5% of the rows that are short and straightforward. Then manually adding semicolon-separated tags in a new column for the rest. It's very tedious.

Tools I am familiar with: Google Refine, SQL, Excel (yuck), Text Wrangler and some very limited command line stuff (terminal and awk, grep).
posted by iamkimiam to Computers & Internet (5 answers total) 5 users marked this as a favorite
Have you considered using LSA to cluster them? Since you have this manually-coded "training class" of statements that typify all 10 dimensions you're looking for, it would be (nontrivial but) fairly simple to do a 1-to-many comparison ten times - once for each dimension - that ranked each response according to how closely it conformed to the prototype for the dimension. R also has some great text mining resources (of which this is one) that I have played around with occasionally and sound like they might be useful here. R looks difficult and command-liney, but if you're using SQL and have ever used SPSS or Matlab, you'll pick it up in no time.
posted by katya.lysander at 11:36 AM on November 29, 2012 [1 favorite]

Maybe this isn't what you're after, but what about dedoose?
posted by unknowncommand at 12:47 PM on November 29, 2012

You may already know this, in which case I apologize for being redundant, but what you're doing is coding qualitative data. There is a class of tools suited to this listed on Wikipedia.

In grad school, I used a handful of ruby scripts to kludge something together. Now that I work somewhere that can pay for contracting, I contract it out. It is slow, tedious work as you know.

Depending on the sensitivity of the data and the complexity of coding it, you could try "cheap" outsourcing by setting up HITs on Amazon's Mechanical Turk, and paying a couple pennies to have each response coded on each axis.

The best way is to have grad students do it for free, but this is not available to everyone.
posted by heliostatic at 2:40 PM on November 29, 2012 [1 favorite]

How much to you care about the quality level of the coding? Taking your example, I can formulate a rule: brain or head or think or mind ==> tag "in head" and I could then code this up in awk or something, but I'd miss: "I just sort of see it like that". I could add "see" to the rule, but then I might catch some extraneous references to vision, literal eyesight. And you won't know how accurate your characterization is, won't know if you're missing some thread in the responses you haven't considered. How long does it take you to tag a paragraph? 15 seconds? that's 20 hours of tagging. Can you spread it over a week? You'll spend 2 days writing and debugging the damn code! And then you'll notice some rule that you need to add. Or maybe write the code, but also hand tag 100 paragraphs, to get some confidence in the quality of your auto-tagger. My choice of coding language (awk, grep, perl, python, excel, openOffice, etc) would depend on a) what format the orig data is in, and b) what platform you are most familiar with/have access to.
posted by at at 6:56 AM on November 30, 2012

R also has some great text mining resources
Wikipedia also tells me there is also an R package for coding. R is definitely a useful tool to learn. Be aware that's it's not very elegant for very large data sets.

As katya suggested you could take your already coded stuff as training data and use some kind of categorisation algorithm. If I understand correctly each tag is boolean so SVM might work well. Looking at your examples, things used in spam detection like a naive bayes classifier might also be good. I've found Progamming Collective Intelligence to be a useful introduction to machine learning, although I'd find a free program rather than implementing things yourself. A word of warning clustering/categorisations are kind of dark arts; there's no data independent/"you should just do this" algorithm.

Which leads me to an unhelpful answer:
I have a feeling this is a solved problem/someone must have done this before. Have you tracked down someone in your university's computer science department that deals with text mining and asked them?
posted by Erberus at 8:35 AM on December 2, 2012

« Older Help a relative server n00b keep his Mac Mini safe...   |   Heart attack or giggle fit? Newer »
This thread is closed to new comments.