Can I save myself hours of linguistic data analysis??
August 16, 2016 3:13 AM   Subscribe

I have roughly 100 narrative language transcripts that I need to code and analyse. So far I've spent a good few weeks getting through all the coding that definitely needs to be done by a human, but now I'm left with all the measures that I know somewhere there must be some software/online analyser to help me figure out other than 100% manually.

The measures I need are:
- Total Number of Words (this is an easy one I can automate in my word processor)
- Number of C-Units ('Communication Units' - consist of a main clause, modifiers and any subordinating clauses)
- Number of Different Words (have found some online calculators for this but not all are geared for linguistic analysis)
- Number of Causal Clauses

Does anyone have any idea? I would be happy to pay for software if it would save me days and days of zombifying coding. The C-Units measure in particular strikes fear into my heart..
posted by rose selavy to Writing & Language (3 answers total) 6 users marked this as a favorite
 
There is some evidence that Weka can do this. I use it to test some basic machine learning/classifiers before using something more powerful or automated. Google tells me that "Communication Unit" classification has been done with Weka, but it looks like it required some training of a naive bayes text classifier.

I'm not sure if this gets you closer but it might be worth looking at the documentation, as several plugins exist for the software.
posted by teabag at 5:27 AM on August 16, 2016 [1 favorite]


"Unique words" is quite easy to do. C-units is somewhat trickier because you need to parse sentences and interpret the results, but there are good software tools for at least part of this.

I have written scripts to do much of this myself. You can send me a memail if you wind up needing hands-on help.
posted by grobstein at 5:59 AM on August 16, 2016 [1 favorite]


If you are working with English and can program, or deal with research software, what you want doesn't sound that complicated (though you may have to hand-inspect the parses). Two starting points are NLTK (python), and Stanford CoreNLP. Note that "C-unit" isn't a standard term or definition in linguistics or NLP so you're unlikely to find this built in (as far as I know), but from what I can gather it just means something like count sentence/utterance boundaries, which any standard parser will get you.
posted by advil at 9:15 AM on August 16, 2016 [2 favorites]


« Older Glass of Water, Hold the Lead   |   Save us from the spiral of defeat! Newer »
This thread is closed to new comments.