How to extract metadata about emails?
May 15, 2006 3:01 AM Subscribe
What's the easiest way to extra data about my emails? It's for the purposes of making a visualization along similar lines to Mountain.
Basically I have an idea for a Flash based visualization which would require a database of metadata about all my emails...
- who it's from
- who it's going to
- how many characters it has
- date
I want to feed that information into Flash and then do pretty things with it. The format can be flexible - ie. CSV, XML, etc - I've wrangled many different text sources with Actionscript before.
Is there a way to extract this data from Gmail directly, or would I need to download it all to something like mbox format and run some sort of old-skool Perl magic on them all?
Basically I have an idea for a Flash based visualization which would require a database of metadata about all my emails...
- who it's from
- who it's going to
- how many characters it has
- date
I want to feed that information into Flash and then do pretty things with it. The format can be flexible - ie. CSV, XML, etc - I've wrangled many different text sources with Actionscript before.
Is there a way to extract this data from Gmail directly, or would I need to download it all to something like mbox format and run some sort of old-skool Perl magic on them all?
From, To and Date are already in the headers. Size has been put into various headers, but all of them have been kludges that didn't work (indeed, the rule of mbox is NEVER, EVER trust those numbers.)
Scanning those four bits of information out of a mbox format isn't hard. A message is From to From, or as a regular expression, "^From ". Yes, the space is important, "^From:" is a different header.
So, scan until you see "^From " at the start of a line. That's the start of the message. Count characters until you see "^From " again. That's the size.
Now, while you're scanning, you'll want to look for "^To:" "^From:" and "^Date:", and when you see them, capture ever character between the colon and the end of line.
Now, the fun part -- parsing out. Email addresses are easy, they'll be on the "^To:" and "^From:" lines, enclosed in < >. Any other text on that line is freeform.
Dates are harder. We've sort of setteld on "Day, DD Mon YYYY HH:MM:SS +-TZTZ", so right now here is "Mon, 15 May 2006 06:47:50 -0500", but I've seen other bogosities. Ideally, you convert to something easier to work with.
Once you have those four bits, write out the data and then repeat until finished.
posted by eriko at 4:48 AM on May 15, 2006
Scanning those four bits of information out of a mbox format isn't hard. A message is From to From, or as a regular expression, "^From ". Yes, the space is important, "^From:" is a different header.
So, scan until you see "^From " at the start of a line. That's the start of the message. Count characters until you see "^From " again. That's the size.
Now, while you're scanning, you'll want to look for "^To:" "^From:" and "^Date:", and when you see them, capture ever character between the colon and the end of line.
Now, the fun part -- parsing out. Email addresses are easy, they'll be on the "^To:" and "^From:" lines, enclosed in < >. Any other text on that line is freeform.
Dates are harder. We've sort of setteld on "Day, DD Mon YYYY HH:MM:SS +-TZTZ", so right now here is "Mon, 15 May 2006 06:47:50 -0500", but I've seen other bogosities. Ideally, you convert to something easier to work with.
Once you have those four bits, write out the data and then repeat until finished.
posted by eriko at 4:48 AM on May 15, 2006
POP them all into a mbox file and run your stats script. I don't think there's any other sane way to do it. Screen scraping is going to be a hell of a lot slower and more complicated when you essentially want to retrieve every message anyway.
posted by Rhomboid at 5:55 AM on May 15, 2006
posted by Rhomboid at 5:55 AM on May 15, 2006
eriko, it's more like "\n\nFrom " (i.e. it must be preceded by a blank line) and anyway, he should use an existing mbox parser from e.g. CPAN rather than reinvent this long-invented wheel.
posted by Rhomboid at 5:58 AM on May 15, 2006
posted by Rhomboid at 5:58 AM on May 15, 2006
D'oh - completely forgot about the POP interface. That's a much better way of doing it.
Rhomboid wins!
posted by wilberforce at 7:37 AM on May 15, 2006
Rhomboid wins!
posted by wilberforce at 7:37 AM on May 15, 2006
eriko, it's more like "\n\nFrom "
Nope -- the start is "From " -- that's it. No newline needed, nor allowed. Your pattern would never hit a message in most mailers.
The mistake I made was that the *end* of message is a bare newline -- "^\n" . Thus, you scan until you hit the bare line, then scan to the next "From " and start the next message.
I'm annoyed that I shanked that, given the number of trashed mbox files I've rebuilt to recover email from damaged filesystems.
posted by eriko at 12:07 PM on May 15, 2006
Nope -- the start is "From " -- that's it. No newline needed, nor allowed. Your pattern would never hit a message in most mailers.
The mistake I made was that the *end* of message is a bare newline -- "^\n" . Thus, you scan until you hit the bare line, then scan to the next "From " and start the next message.
I'm annoyed that I shanked that, given the number of trashed mbox files I've rebuilt to recover email from damaged filesystems.
posted by eriko at 12:07 PM on May 15, 2006
It won't work if you're using line based regexps, but if you are using whole-file regexps or setting the input delimiter as in perl/awk then it will. My point was that merely scanning for "^From " is not sufficient; the line immedialy prior must be a blank line, so the effective separator of messages is "newline newline F r o m space", except for of course the first and last message in the file.
posted by Rhomboid at 5:33 PM on May 15, 2006
posted by Rhomboid at 5:33 PM on May 15, 2006
This thread is closed to new comments.
So you might need to use that or something similar to do the processing.
posted by wilberforce at 4:29 AM on May 15, 2006