Is there a better/faster way to operationalize the coding of messy text data?
November 29, 2012 11:24 AM Subscribe
I'm trying to tag and code about 4000+ unique paragraphs of data. These are opinion responses to two similar questions. I manually went through the first 4000 responses and it took weeks using Google Refine. I'm wondering if there's a way to operationalize this to be a bit easier and less time consuming?
posted by iamkimiam to Computers & Internet (5 answers total) 5 users marked this as a favorite
There are about 10 dimensions I'm coding for with tags, and the wording of the answers varies wildly. Here's an example of various responses to one of the dimensions:
"This is just how it is in my brain."
"I see it in my head like this."
"When I think about it, this is what I come up with."
"It's just the way it is, you know, inside my mind."
"Mind's eye sees it this way."
These responses would be preceded and followed by other text that's also important to me to code on other dimensions, so I don't want to replace the whole paragraph with a tag like 'in head', but rather add 'in head' to a series of tags for that response (separate by semicolons). The data is also very messy with punctuation, spelling, omissions and IPA/other character variation.
If you've done something like this, how did you go about it? I'm currently using Google Refine's cluster feature to collapse about 5% of the rows that are short and straightforward. Then manually adding semicolon-separated tags in a new column for the rest. It's very tedious.
Tools I am familiar with: Google Refine, SQL, Excel (yuck), Text Wrangler and some very limited command line stuff (terminal and awk, grep).