Easiest way to extract e-mail from the body of thousands of e-mail addresses?
January 25, 2012 12:05 AM   Subscribe

Long story short, I lost my contacts from an old e-mail but I have their e-mail/details in the body of invoices that I forwarded to my current address. I've been copying thousands of e-mail out out by hand, but am realizing I'll have carpal tunnel before it's done. Is there a way to do this short of hiring someone to write a script? Tried forwarding e-mail to Thunderbird, but the extension that looked like it might do this doesn't search the body of e-mail https://addons.mozilla.org/en-us/thunderbird/addon/email-address-crawler/ Also, looking into autohotkey, but am under time constraints. Grateful for any pointers here!
posted by capedape to Technology (6 answers total) 1 user marked this as a favorite
 
It's only one of the step, but if you can export all your invoices to one large text file dump somehow, you can certainly grep out everything thatlooks@likethis with a one-line 'script', producing a list in whatever format you like.

It'd be something like, um...
grep -o -E "\b[a-zA-Z0-9.-]+@[a-zA-Z0-9.-]+\.[a-zA-Z0-9.-]+\b" source.txt > output.txt 
(Though I am sure there are 30,000 far better coders (i.e. non-amateur hacks like me) here on MeFi who can do that one-liner better and more elegantly than that.)

And then once you have a nice tidy list, you can import it into whatever address book you're using.
posted by rokusan at 12:25 AM on January 25, 2012


I hope this is not too technical for you, but this is how I would deal with this issue.

This is an Excel macro solution: In general, it works like this - paste the text into Cell A1 and run this macro. It will scan cell A1 and place any email addresses it finds into the Row 1 starting at cell B1 and on over to however many emails there are. Make sure the worksheet tab is named Sheet1 and the cell you are extracting from is A1.

------
1) First you want to get as many emails into one cell just as big blob of text. Not knowing your email app, I don't know how it exports, but the idea is to create a file (csv, txt etc) of the emails.

2) Following that, you would combine those emails into one big group It will help if tabs and carriage returns are removed from that text.

3) With all that combined, you then want to put that text into cell A1. You can have up to 32,767 characters per cell, so that should leave room for multiple emails.

4) Run the macro using Alt-F8 and choose run. It will do its magic but you may get an error at the end when it cannot find anymore data to process, but just quit out of the errors. The work will have been done.

5) Following extraction, you can copy/paste/transpose the addresses into a list.


Here is the macro..............

Sub GetEmailaddress()

Dim lastrow As Long, i As Long
Dim ncol As Integer, spos As Integer
Dim n As Integer, n1 As Integer, n2 As Integer
Dim searchtxt As String
Dim email As String

With Worksheets("Sheet1")

lastrow = .Cells(Rows.Count, "A").End(xlUp).Row
For i = 1 To lastrow
searchtxt = .Range("A" & i)
ncol = 2
spos = 1
Do
n = InStr(spos, searchtxt, "@", vbTextCompare)
If n <> 0 Then
n1 = InStrRev(searchtxt, " ", n, vbTextCompare)
n2 = InStr(n, searchtxt, " ", vbTextCompare)
If n2 = 0 Then n2 = Len(searchtxt) + 1
email = Trim(Mid(searchtxt, n1, n2 - n1))
Cells(i, ncol) = email
ncol = ncol + 1
spos = n2
End If
Loop Until n = 0
Next i

End With
End Sub

posted by lampshade at 1:20 AM on January 25, 2012


This sounds like a great job for Mechanical Turk. This guide was written for journalists, but it's a good explanation of how you might use it.
posted by quadrilaterals at 5:33 AM on January 25, 2012 [1 favorite]


If you can get all the emails into one text file, and you have access to a Mac, you can download the freeware TextWrangler editor from Bare Bones Software and do the following:

1. Replace all "<" and ">" characters with spaces. This ensures that each email address enclosed by angle brackets is a distinct word set off from adjacent text.
2. Replace all spaces with line breaks. Now each email address (and every other word in the file) is on a new line.
3. Copy all lines containing "@" into a new file. (This is a TextWrangler command: "Process lines containing...") That file now has all your email addresses, as well as any other words with @ in them (such as unit prices on an invoice, e.g. "2@$1.95").
4. Save that file (you can do this later, but I prefer to save before doing any more editing).
5. Delete duplicate lines (using the "Process duplicate lines..." command).
6. Delete anything that is not really an email address.

If you want to keep their real names, it gets a little more complicated and the precise steps will depend on how your big text file of email is formatted.

Other text editors can undoubtedly do this too, but TextWrangler (and its more powerful sibling, BBEdit) are the ones I know best.
posted by brianogilvie at 7:23 AM on January 25, 2012


Response by poster: Appreciate the code and all the other angles suggested here. Will work on this tonight.

Incidentally, this has also pushed me to revisit my long forgotten basic coding skills, so thanks for that too.
posted by capedape at 3:24 PM on January 25, 2012


Response by poster: what I ended up doing was installing thunderbird and this extension for it:
http://nic-nac-project.de/~kaosmos/mboximport-en.html

It lets you:

- export of folder in a single file (mbox format), with also the subfolders if you want;
- export of all messages in single files (eml format);
- export of all messages in single files (html or plain text format), with an index;
- export of all messages in one single plain text file;
- export of index of the messages in a folder (HTML or CSV format); - mbox files import;
- eml files import;
- import of all the eml files existing in a directory;

I'll use some code/techniques from this thread to extract the relevant info I need. Thanks again
posted by capedape at 6:30 PM on January 25, 2012


« Older Apparently I wasn't at my own wedding   |   Want good golf games Newer »
This thread is closed to new comments.