How do I export or convert my Thunderbird inbox into the Unix mailbox format?
April 11, 2007 12:47 PM Subscribe
How do I export or convert my Thunderbird inbox into the Unix mailbox format?
I need to compile a spam corpus to test a Beysian spam filter. Thunderbird mailbox looks like a Unix type mailbox, but it adds some extra headers: X-Account-Key, X-UIDL, X-Mozilla-Status, X-Mozilla-Status2.
I need to compile a spam corpus to test a Beysian spam filter. Thunderbird mailbox looks like a Unix type mailbox, but it adds some extra headers: X-Account-Key, X-UIDL, X-Mozilla-Status, X-Mozilla-Status2.
Sorry, I clicked submit on accident. If you really don't want those headers there, you would have to write a script to remove them (via Perl or any other language)...
There's really no need to, though. Since they're present in all of your messages good and bad, the Bayesian filter shouldn't look at them as bad or anything. It's already in valid unix mailbox format, it just has some extra headers. That doesn't make it incompatible though.
posted by twiggy at 1:07 PM on April 11, 2007
There's really no need to, though. Since they're present in all of your messages good and bad, the Bayesian filter shouldn't look at them as bad or anything. It's already in valid unix mailbox format, it just has some extra headers. That doesn't make it incompatible though.
posted by twiggy at 1:07 PM on April 11, 2007
Fwiw, instead of "Unix format", you should describe it as "mbox", assuming that's what you mean (instead of "Maildir", e.g.). You'll get more search hits that way.
posted by cmiller at 1:35 PM on April 11, 2007
posted by cmiller at 1:35 PM on April 11, 2007
Best answer: to strip the unwanted headers:
posted by sergeant sandwich at 4:43 PM on April 11, 2007
grep -v X-account-key yourfile | grep -v X-UIDL | ... etc ... > yournewfileassuming, of course, that you don't have those strings in any of the body of your messages.
posted by sergeant sandwich at 4:43 PM on April 11, 2007
there's probably a cleverer way to do that using fgrep though.
posted by sergeant sandwich at 4:44 PM on April 11, 2007
posted by sergeant sandwich at 4:44 PM on April 11, 2007
egrep i mean. argh
posted by sergeant sandwich at 4:45 PM on April 11, 2007
posted by sergeant sandwich at 4:45 PM on April 11, 2007
Best answer:
Forcing it to find only at the beginning of the line ('
posted by oats at 5:50 PM on April 11, 2007
egrep -v '^(X-account-key|X-UIDL|X-Mozilla-Status)' yourfile > yournewfile
Forcing it to find only at the beginning of the line ('
^
') is both safer and faster.posted by oats at 5:50 PM on April 11, 2007
*Ahem*, it's legal to have those in the body of the message. Imagine, e.g., what happens when my email contains an encoded attachment with a line that starts with "X-UIDL".
You need a state machine. You can do it with sed.
posted by cmiller at 11:21 AM on April 14, 2007
You need a state machine. You can do it with sed.
posted by cmiller at 11:21 AM on April 14, 2007
Oh, and RFC822 (or is it 823?) headers can span several lines.
Header: one two three next line starts with whitespace
four five six grep cannot help with this.
I hate to be a spoil-sport here.
posted by cmiller at 11:23 AM on April 14, 2007
Header: one two three next line starts with whitespace
four five six grep cannot help with this.
I hate to be a spoil-sport here.
posted by cmiller at 11:23 AM on April 14, 2007
This thread is closed to new comments.
posted by twiggy at 1:06 PM on April 11, 2007