How do I export or convert my Thunderbird inbox into the Unix mailbox format?
April 11, 2007 12:47 PM   Subscribe

How do I export or convert my Thunderbird inbox into the Unix mailbox format?

I need to compile a spam corpus to test a Beysian spam filter. Thunderbird mailbox looks like a Unix type mailbox, but it adds some extra headers: X-Account-Key, X-UIDL, X-Mozilla-Status, X-Mozilla-Status2.
posted by FakeOutdoorsman to Computers & Internet (9 answers total) 1 user marked this as a favorite
 
My understanding is that Thunderbird mailboxes are already in unix mailbox format.
posted by twiggy at 1:06 PM on April 11, 2007


Sorry, I clicked submit on accident. If you really don't want those headers there, you would have to write a script to remove them (via Perl or any other language)...

There's really no need to, though. Since they're present in all of your messages good and bad, the Bayesian filter shouldn't look at them as bad or anything. It's already in valid unix mailbox format, it just has some extra headers. That doesn't make it incompatible though.
posted by twiggy at 1:07 PM on April 11, 2007


Fwiw, instead of "Unix format", you should describe it as "mbox", assuming that's what you mean (instead of "Maildir", e.g.). You'll get more search hits that way.
posted by cmiller at 1:35 PM on April 11, 2007


to strip the unwanted headers:
grep -v X-account-key yourfile | grep -v X-UIDL | ... etc ... > yournewfile
assuming, of course, that you don't have those strings in any of the body of your messages.
posted by sergeant sandwich at 4:43 PM on April 11, 2007


there's probably a cleverer way to do that using fgrep though.
posted by sergeant sandwich at 4:44 PM on April 11, 2007


egrep i mean. argh
posted by sergeant sandwich at 4:45 PM on April 11, 2007


egrep -v '^(X-account-key|X-UIDL|X-Mozilla-Status)' yourfile > yournewfile

Forcing it to find only at the beginning of the line ('^') is both safer and faster.
posted by oats at 5:50 PM on April 11, 2007


*Ahem*, it's legal to have those in the body of the message. Imagine, e.g., what happens when my email contains an encoded attachment with a line that starts with "X-UIDL".

You need a state machine. You can do it with sed.
posted by cmiller at 11:21 AM on April 14, 2007


Oh, and RFC822 (or is it 823?) headers can span several lines.

Header: one two three next line starts with whitespace
four five six grep cannot help with this.

I hate to be a spoil-sport here.
posted by cmiller at 11:23 AM on April 14, 2007


« Older How much money will I save on my mortgage by...   |   Girls go online Newer »
This thread is closed to new comments.