Tools for analyzing list-serv contents?
August 31, 2006 2:49 PM   Subscribe

Tools to analyze a mailing list archive?

I've got an archive of several years of messages sent to a very chatty list-serv. What are some good tools to analyze it? Like finding out volume, how many posts were made by certain people, trends, common words, etc. I'm just looking to make some fun charts and graphs to entertain the list.
posted by empath to Computers & Internet (6 answers total)
posted by Optimus Chyme at 3:02 PM on August 31, 2006

Agora uses mailing list archives to model the social relationships implied by mailing list interactions.
posted by jimw at 3:51 PM on August 31, 2006

What's meant by "an archive"? One enormous text file? How big? A number of text files? On your hard disk or in your email client? I'm with optimus (except for the capital G on grep). The only way you're really going to do this is by searching through the files using some kind of regular expression syntax, which is non-trivial if you're new to that kind of thing.

It's possible there is analysis software for mailing lists, but that would surely work on the server they were sent from, as part of the management software. You're not going to find specific software for analysis like this on local files.
posted by AmbroseChapel at 5:39 PM on August 31, 2006

(except for the capital G on grep)

It was the beginning of a sentence. >:(
posted by Optimus Chyme at 5:45 PM on August 31, 2006

This kind of thing is pretty easy to do in perl. Below is a script I wrote a couple of years ago that takes as input a mbox file and just prints very basic stats on the top posters, by email address. It's not very sophisticated, but maybe you can modify it further.
#!/usr/bin/perl -w

use Mail::Mbox::MessageParser;

my $file_name = $ARGV[0] || 'mbox';
my $file_handle = new FileHandle($file_name);

my $folder_reader =
new Mail::Mbox::MessageParser( {
'file_name' => $file_name,
'file_handle' => $file_handle,
'enable_cache' => 0,
'enable_grep' => 1,
} );

# Any newlines or such before the start of the first email

my $msgcount = 0;
my $foundaddr = 0;
my %addrcount;

# This is the main loop. It's executed once for each email

if((++$msgcount) % 100 == 0) {
print STDERR "\r$msgcount"

my $email = $folder_reader->read_next_email();

if($$email =~ /\nFrom: (.*)\r?\n/) {
my $fromaddr = $1;

if( $fromaddr =~ m/[\w\d\.\,\-\=\+\_\%\$\!\#\^\&]+\@([\w\d\-]+\.)+[\w\d\-]+/ ) {

my $addr = lc($&);
if(++$addrcount{$addr} == 1) {
# print "$addr\n";

print STDERR "\n\n";

print "$msgcount messages processed, of which $foundaddr had legible email addresses.\n\n";
$count = 1;
foreach my $email (sort { $addrcount{$b} <=> $addrcount{$a} } keys %addrcount) {
printf("%2d. %-60.60s %d\n", $count++, obfuscate($email), $addrcount{$email});

sub obfuscate {
my $addr = shift;

$addr =~ s/@/ AT /ig;
$addr =~ s/\./ dot /ig;

return $addr;

posted by Rhomboid at 6:21 PM on August 31, 2006

Your regular expression scares me.

\w includes both \d and underscore, for a start.
posted by AmbroseChapel at 7:14 PM on August 31, 2006

« Older Are there any particularly good RSS feeds for...   |   itunes sharing Newer »
This thread is closed to new comments.