Little text manipulation help?
November 2, 2009 10:54 AM   Subscribe

I need some help repairing some text files. I don't really understand regular expressions and so on enough.

I have two badly formed mbox files.

One is missing the "From " indicator that a new message is about to begin. Instead, there's a "From: ." I need this second "From: " to stay there, as without it the messages all show up as having no sender when I import them into any client. Thus, I need a way to add a "From " to the line above each line where a "From: " occurs. As far as I know the "From " is just some mailer daemon's email that I don't really care about.

The second is the same, but the messages also show up has having a bad date. Currently, there is a field labeled along the lines of "Sent: Monday, January 13, 2003 1:11 PM." Properly mbox files seem to look like "Date: Mon, 14 Apr 2003 12:53:22 -0600." While just changing "sent" to "date" fix that, or do I need some really good regexing?
posted by yesno to Computers & Internet (14 answers total)
Response by poster: I am running OS X so I have the full range of Unix programs. I also have TextMate.
posted by yesno at 10:55 AM on November 2, 2009

Can you supply a sample file with bad output the output you want? (just a single message)
posted by wolfr at 10:58 AM on November 2, 2009

Response by poster: Here's a non-personal example


From "Warp Bot"
To: "Warp Info"
Sent: Thursday, March 28, 2002 12:33 PM
Subject: Warp Records Letter - March 28th - live news

Hello and welcome to a Warp Records news letter.
We have lots and lots of live action coming up.


This one lacks both a proper sender and a proper date.

It should be:


From: "Warp Bot"
To: "Warp Info"
Date: 21 Mar 2002 12:33
Subject: Warp Records Letter - March 28th - live news

Hello and welcome to a Warp Records news letter.
We have lots and lots of live action coming up.


If the date does have to be reformatted, I don't care about getting the time right. Just the date.
posted by yesno at 11:09 AM on November 2, 2009

1. Search: From \"Warp Bot\"

Replace: From MAILER-DAEMON(ctrl-m character)From: \"Warp Bot\"

2. For the date, you'll need to do 12 searches for the different months.

First, replace Sent: with Date:. Then get rid of all your days of the week (search for "Thursday, " for example, with the space, and replace with nothing). Then:

Search: March {[0-9]+},

Replace: \0 Mar(space)

Then do this with the different months.

Hope this helps; it's been a while.
posted by Melismata at 11:21 AM on November 2, 2009

Best answer: Here's a sed command to fix your From lines, if i'm understanding your problem correctly:
(Substitute appropriate values for YOUR_FILE and NEW_FILE)
cat YOUR_FILE |sed -e 's/^From/From MAILER-DAEMON\nFrom:/'  >  NEW_FILE
I think you missed a : after the first From in your example, if this is true just remove the last : from the regex.

To just substitute Date for Sent and see if it works, do also this:
cat NEW_FILE | sed -e 's/^Sent:/Date:/' > NEWER_FILE

posted by Dr Dracator at 11:43 AM on November 2, 2009

Hard to say without trying it on your input files, but `formail` is usually good for mbox munging.
posted by alikins at 11:48 AM on November 2, 2009

Response by poster: Emails should begin in a mbox file with a "From " (no colon). The "From:" is the real-life sender.
posted by yesno at 12:06 PM on November 2, 2009

Response by poster:
cat YOUR_FILE |sed -e 's/^From/From MAILER-DAEMON\nFrom:/' > NEW_FILE

This *almost* works. It puts in the new "From " line before the "From: " field. But there's no linebreak between them.

I get this:

From MAILER-DAEMONn From: "John B"

From this:

From "John B"

When what I really want is this:

From: "John B"

posted by yesno at 12:14 PM on November 2, 2009

The \n should have been a line break, are you sure you aren't forgetting the \ after DAEMON?
posted by Dr Dracator at 12:26 PM on November 2, 2009

to elaborate on alikins recommendation of formail:
processing mbox files is the reason for formail's existence. From the formail(1) manpage:
To convert a non-standard mailbox file into a standard mailbox file you can use:
    formail -ds <old_mailbox >>new_mailbox
posted by namewithoutwords at 12:29 PM on November 2, 2009 [1 favorite]

Response by poster: Yes, I put in the command exactly as given, except for replacing the filenames, and the newline is not put in.

I'll look at formail; these are actually "valid" mboxes as long as they have a "From " begining each new mail, it's just the fields are a bit out of whack.
posted by yesno at 12:41 PM on November 2, 2009

Response by poster: Ok, the sed command worked perfectly in Linux. Maybe by 2060 newline inconsistencies between platforms will be worked out.

formail didn't make heads or tails out of this, by the way. And I'm better off trying to learn sed. Thanks, everyone! I'll be back for date-mangling if I need to. I hope I don't need to.
posted by yesno at 12:59 PM on November 2, 2009

Response by poster: Oh, I did have to add a bit more verbage to the "From " line, but that was easy.
posted by yesno at 2:12 PM on November 2, 2009

You can regex from 'date' format to RFC822 format like this:

cat NEW_FILE | perl -pe 's/^Sent:.*([A-Z]..)[a-z]* (\d+), (\d+) (\d+):(\d+) ([AP]M)/sprintf "Date: %0.2d %s %04d %0.2d:%0.2d %0.2d%0.2d", $2, $1, $3, $4 % 12 + 12 * ($6 eq "PM"), $5, -5, 0/e' > NEWER_FILE

(where you replace that -5 with your time zone's offset from GMT)
posted by nicwolff at 3:04 PM on November 2, 2009

« Older e-mailing multiple addresses within one Entourage...   |   Please suggest good toys for penned house rabbits. Newer »
This thread is closed to new comments.