How to get email addresses out of a sent mail folder?
June 6, 2006 4:05 AM   Subscribe

How do I extract the email addresses from a folder of sent mail sitting on an IMAP server, so that they can be put into a mailing list?

A colleague has sent individual emails to about two thousand people about an event that they have applied to attend. Those sent emails are held in five folders on an IMAP server.

I've just found out that my colleague has done it the (very) hard way. I'd like to help them out before they have to send the next email, by getting all of the individual email addresses out of those folders and into a Thunderbird mailing group (or one for each folder), so the update can be sent to everyone at the same time.

If I can get the addresses into a csv file, I can get them into Thunderbird. Any suggestions for a quick way of getting those addresses from that IMAP folder and into a csv file? I'm not afraid to go into a shell and do some commandline magic (OS X, or Cygwin on Windows), but I have no idea which incantation to use, and would really appreciate the help.

Thank you.
posted by reynir to Computers & Internet (11 answers total)
 
Grep would be the quickest, dirtiest way. Something like:

    grep "^To: " * > emails.txt

Run that in each folder. It should pull the To: lines out of each email file, and store them in emails.txt. Then you can search&replace on emails.txt to get rid of all the "to:" and change the carriage returns to commas.

You could do it more elegantly with awk or perl, but if it's only a few thousand emails, and you only need to run it once, grep's yer buddy.
posted by ParsonWreck at 6:22 AM on June 6, 2006


(Here's how you know it's early in the day: I just used "elegant" and "perl" in the same sentence.)
posted by ParsonWreck at 6:33 AM on June 6, 2006


This will work in Python. You should redirect the output to a file and then use sort | uniq to uniquify it. I didn't do that in the program because spitting out the output as it arrives will make it easier to debug.

import email
import getpass
import imaplib

HOST = "imap.example.com"
USER = "alice"
FOLDER = "2006/sent/sent"

connection = imaplib.IMAP4_SSL(HOST)
res, data = connection.login(USER, getpass.getpass())
assert res == "OK"

res, count = connection.select(FOLDER)
assert res == "OK"

res, (msg_nums,) = connection.search(None, "ALL")
assert res == "OK"

for msg_num in msg_nums.split():
res, message_text = connection.fetch(msg_num, "(RFC822)")
assert res == "OK"

message = email.message_from_string(message_text[0][1])
tos = message.get_all("To") or []
ccs = message.get_all("Cc") or []
all_recipients = email.Utils.getaddresses(tos + ccs)
print "\n".join(addr.lower() for realname, addr in all_recipients)


Don't say I never did anything for you.
posted by grouse at 6:34 AM on June 6, 2006


Every line after "for msg_num" should be indented by four spaces. Stupid space-stripper.
posted by grouse at 6:35 AM on June 6, 2006


I am not sure if I did it the hard way or not but I converted the files to text/ascii then performed a grep to pull the raw email addresses and names piped into a file where I then used sed and awk to clean it up before piping it into another file as a spreadsheet.

With sed and awk you can do alot of editing or clean-up of the raw text anyway you please.

I am sure that the rest of the hive mind here will provide more elegant solution and I look forward to reading them. Good luck.
posted by jadepearl at 6:38 AM on June 6, 2006


Response by poster: Thanks all, much appreciated. If I grep out all the To: lines, that gets me nearly there. What comes out looks like this (I have put the space after the angle brackets otherwise preview eats the whole bracketed bit):

To: Joe Bloggs < joe.bloggs@mefi.com>
To: Jane Doe < jane.doe@mefi.com>
To: < noname@mefi.com>

Are there any similar commands that can pull out the contents of the angled brackets into a file? Or delete everything which isn't within them? Sed and awk look very powerful, and I'll have to learn them for the future but I don't have much time to get this done, so any hints welcome.
posted by reynir at 7:32 AM on June 6, 2006


The script I wrote for pulls out the contents of the angle brackets, deals with multi-recipient e-mails, and all sorts of other corner cases that you might not have thought of yet. I don't know why you want a quick and dirty solution when a complete one has been presented. *shrug*
posted by grouse at 7:41 AM on June 6, 2006


Response by poster: Grouse, sorry, I used grep as that was the first answer I read, and now I'm almost there it seemed to make sense to carry on with that approach - also, Im not that sure what I was doing with the Python script ('use sort | uniq to uniquify it' loses me a bit). But thanks for taking the time to suggest it.
posted by reynir at 9:38 AM on June 6, 2006


It'd probably be more convenient to just run grouse's script, since you may have some cases where the emails don't match the exact pattern (eg, some might just be "To: foo@bar.com" and some might be "To: foo@bar.com (Foo Bar)"

But anyway, assuming you used grep to pull the To: lines into emails.txt, try this to yank out just the email addresses (this should be all on one line, obviously, and copy-and-paste to make sure you get the spaces right).

cat emails.txt | sed 's,.* < *\([^ ]*@[^>]*\).*,\1,' > stripped-emails.txt

grouse's other point, about running sort and uniq, is probably intended to address the issue that there may be duplicate addresses included in the list. To remove duplicates from a file, you can do

sort file.txt | uniq > no-dups.txt

or (on most systems)

sort -u file.txt > no-dups.txt
posted by inkyz at 10:13 AM on June 6, 2006


And having posted that I promptly forget to html-encode and get the spacing wrong myself, sigh. Try this instead:

cat emails.txt | sed 's,.* <*\([^ ]*@[^ >]*\).*,\1,' > stripped-emails.txt
posted by inkyz at 10:15 AM on June 6, 2006


There is no need to ever do "cat input | command > output". That is just doing unnecessary work. "command < input> output" achieves the same thing.

You can do the RE extraction in a single step without first doing grep and then doing sed.

perl -ne 'print $1 if m/^To: .*(\S+@\S+)/' < mboxfile
posted by Rhomboid at 1:39 PM on June 6, 2006


« Older tape loader & ntbackup   |   Marriage down the tubes Newer »
This thread is closed to new comments.