Interesting Text Analysis things
January 15, 2020 2:28 AM   Subscribe

I've got a dataset mainly sourced from newspapers and I'm looking for interesting analysis to do to it. What fun things could I do? I will be using R. Caveat, I only have an example subset for now. So I'd like to prepare some analysis based on that.

So I'm currently preparing a shiny application to bundle these analyses together, such that I can load up the data and get a bunch of analysis out.

There are maybe three or four text columns that I'd like to experiment with and a few more straightforward columns.
So far I've wordclouded the text columns (though I don't really have enough data to see much of interest until I get the full source).
I've got the usual counts of sources etc.
I'm going to try doing a Latent Dirichlet Allocation on the data to try and split it into topics.

Are there any other interesting graphs, analysis, chopping, frying or other processing that might be interesting?

What other things might you
posted by Just this guy, y'know to Science & Nature (6 answers total) 6 users marked this as a favorite
 
I suspect from your question that you may know about it, but tidytext is a very good package for doing this. That link goes to an introductory vignette.
posted by Cannon Fodder at 2:50 AM on January 15, 2020 [1 favorite]


De biasing based on word2vec is a fun project for any corpus.
posted by PMdixon at 7:01 AM on January 15, 2020 [2 favorites]


You could do some NLP analysis e.g. sentiment analysis?
posted by lazaruslong at 7:02 AM on January 15, 2020 [2 favorites]


Do you have author information? Might be interesting to look at how consistent an author is in their use of function words and other topic-independent indicators. Variability in those indicators can be a sign of ghostwriting or heavy-handed editing.

For an example of this, see Rosenthal & Yoon's 2011 paper Judicial Ghostwriting: Authorship on the Supreme Court.
posted by jedicus at 8:11 AM on January 15, 2020 [4 favorites]


This is a little off the wall, but hear me out: depending on how your data is organized (if you can separate headlines / section headers, for example) you could try to identify headlines or section headers that are singable to the Teenage Mutant Ninja Turtles meter.
posted by lazaruslong at 8:23 AM on January 15, 2020 [3 favorites]


I know you already mentioned LDA, but in case you haven’t seen the r libraryLDAVis, it’s a great tool to make topic clusters.
posted by tinymegalo at 9:11 AM on January 15, 2020 [2 favorites]


« Older Uploading Links to Google Free Webpage   |   Recommended car top carrier thingies for a compact... Newer »
This thread is closed to new comments.