Analyzing a large corpus of email data
July 26, 2013 3:12 PM   Subscribe

I'm working on project for class which involves looking at a large corpus of email data for patterns in Gephi. Any patterns that we can find are fine, so long as we can justify them and back them up with qualitative analysis. This is my first time doing analysis on this scale, and I'm not entirely sure where to start. I've run a few different layout algorithms on it, and had the best results with Force Atlas 2, I've filtered out the nodes with 1 out-degree, and I've ranked node sizes by betweeness-centrallity. The graph is directed, with the edges being sized according to their weight (determined by number of mails sent), so a lot of the layout plugins I've been finding won't work (as they're tailored for undirected graphs). Is there anything obvious that I'm missing that might make for a compelling visualization, or show interesting connections in the network?
posted by codacorolla to Computers & Internet (4 answers total) 3 users marked this as a favorite
 
You might be able to get some inspiration from the blog entry where Stephen Wolfram creates several visualizations based on his personal email archive (discussion on the blue).
posted by ceribus peribus at 4:10 PM on July 26, 2013


Similar to what Wolfram did, I would advocate some non-graph visualizations. Use some scatterplots based on derived values and subsets.
posted by demiurge at 4:28 PM on July 26, 2013


If you have access to the text of the emails, you might try a Google Zeitgeist style approach and plot word frequency as a function of time. Bonus points if the word frequency propagates through the social graph in an interesting way. Then you could create a movie where nodes in the graph light up as the word use increases; hopefully you'll see an outward propagating wave. For example, if the emails are come from a software development house, words like "bug-fix" and "beta" and "release" should stand out at particular points in the development cycle.
posted by eigenman at 8:27 PM on July 26, 2013


You might look into ranking nodes by degree (will likely be power-law or something close), finding some interesting terms in the email subjects/bodies (compute tf-idf after filtering for typical stop words, to find a starting place then just manually pick out a few terms), then watching the interplay between the appearance of these most interesting terms in emails and where in the graph (highly connected subcomponent versus sparsely connected subcomponents) those terms appear as a function of time.

You can get an idea of how close a node is to the 'center' of the graph by measuring the average shortest path length from it to each (or the nearest) of the 10 (or 20, or 50, whatever) highest-degree nodes.
posted by axiom at 9:13 PM on July 26, 2013


« Older Can/should I just go to another doctor for my...   |   What do I do with myself during a break from... Newer »
This thread is closed to new comments.