extract email addresses from garbage text?
April 13, 2009 6:11 AM Subscribe
I have a ton of garbage text with about 100 email addresses scattered within. I want a text file containing just those email addresses. Is there an MS Word search/replace query that will do that? What is it? (Or is there another way?)
Response by poster: Should also add that each email address is preceded and followed by at least 1 blank space, if that changes anything? thanks for the help so far.
posted by stupidsexyFlanders at 6:42 AM on April 13, 2009
posted by stupidsexyFlanders at 6:42 AM on April 13, 2009
Response by poster: @bcwinters - how do you run this expression in NPP? I put it in both search and replace boxes in that dialog (checking the regex box) and nothing happens. Also, will this delete everything but the emails, or just highlight them?
posted by stupidsexyFlanders at 6:53 AM on April 13, 2009
posted by stupidsexyFlanders at 6:53 AM on April 13, 2009
Best answer: You definitely only want to put this in the search box, not the replace box.
Each time you hit Find it will bring you to the next match (the next email address), which you can copy and paste into a fresh document. But that's obviously not ideal...
Unfortunately it looks like Notepad++ doesn't have a "Find All" feature, which I assumed it did—in BBEdit on my Mac I would use that feature to just build a new text file with each match on its own line.
Maybe someone else can recommend a different free text editor that will do that for you, assuming you don't want to spend the time cutting, pasting, searching, cutting, pasting, searching. I'll dig around for one in the meantime.
posted by bcwinters at 7:11 AM on April 13, 2009
Each time you hit Find it will bring you to the next match (the next email address), which you can copy and paste into a fresh document. But that's obviously not ideal...
Unfortunately it looks like Notepad++ doesn't have a "Find All" feature, which I assumed it did—in BBEdit on my Mac I would use that feature to just build a new text file with each match on its own line.
Maybe someone else can recommend a different free text editor that will do that for you, assuming you don't want to spend the time cutting, pasting, searching, cutting, pasting, searching. I'll dig around for one in the meantime.
posted by bcwinters at 7:11 AM on April 13, 2009
If you are going the copy/paste route, you could open the document in Firefox, do ctrl+f for @, and 'highlight all'. You won't get the whole address this way, but they will be easy to pick out.
posted by Who_Am_I at 7:23 AM on April 13, 2009
posted by Who_Am_I at 7:23 AM on April 13, 2009
It would be easier to do the cutting and pasting in Notepad++, Who_Am_I, because the whole address will be highlighted instead of just the @.
stupidsexyFlanders, try this search string:
[A-Z0-9._%+-]+@[A-Z0-9.-]+
Turn on "Style found token" and click "Find All."
That will highlight every email address in the document.
posted by bcwinters at 7:27 AM on April 13, 2009
stupidsexyFlanders, try this search string:
[A-Z0-9._%+-]+@[A-Z0-9.-]+
Turn on "Style found token" and click "Find All."
That will highlight every email address in the document.
posted by bcwinters at 7:27 AM on April 13, 2009
Yeah definitely bcwinters. I was going on your mention that NPP doesn't do 'find all', but since it does that's way better. Your second RegEx worked for me in NPP, BTW. stupidsexyFlanders, make sure you change the 'search mode' from normal to regular expression.
posted by Who_Am_I at 7:44 AM on April 13, 2009
posted by Who_Am_I at 7:44 AM on April 13, 2009
OK, here we go. If you are still working on this, you can try using the program WinGrep instead of Notepad++.
When you start it, it will have a wizard where you can paste the search string and select the files you want to search in.
After the wizard does its thing, click your text file's name in the list of results. The bottom pane will show you the results of the search.
I clicked "Fancy" (to switch the view to "plain"), turned off Line Numbers and Whole Line, then did a Save As. In the Save As dialog box, I switched the type to "Results in Plain Text" and saved the resulting file. This gave me a text file with each email address on a new line. Whoo hoo!
posted by bcwinters at 7:55 AM on April 13, 2009 [1 favorite]
When you start it, it will have a wizard where you can paste the search string and select the files you want to search in.
After the wizard does its thing, click your text file's name in the list of results. The bottom pane will show you the results of the search.
I clicked "Fancy" (to switch the view to "plain"), turned off Line Numbers and Whole Line, then did a Save As. In the Save As dialog box, I switched the type to "Results in Plain Text" and saved the resulting file. This gave me a text file with each email address on a new line. Whoo hoo!
posted by bcwinters at 7:55 AM on April 13, 2009 [1 favorite]
I could do this in a few seconds with Notepad++, and I know precisely how. You don't even really need regular expressions, which are (ugh) sort of a pain to me, anyway - Notepad++ also has an 'extended' mode, where everything is generally the same, but \r = 'return', \n = 'newline', \s = 'space' and \t = 'tab'.
Here are the steps I'd follow to do this:
(1) [Probably make a copy of the text file in case you bork it.]
(2) Control-H to bring up the 'find and replace' dialogue.
(3) Select the 'Extended' option (Alt-x is the hotkey). Type in '\n' in the 'Find What:' field. Then, make the 'Replace With:' field blank. Then 'Replace All' (Alt-a). Do the same thing again, only replacing '\n' with '\r'. This will remove all newlines and returns, so that your whole document is just one 'paragraph.'
(4) Now, with Find And Replace again, search for 'http://' and replace with '\nhttp://'. (Replace all.) This will mean that the web addresses will be at the beginning of each line. We've isolated them on one side.
(5) Again with the Find And Replace (Replace All): search for '.com ' and replace with '.com\n'. Do the same with '.org ' and '.net ', too, and whatever other subdomains you might have in the document. Notice the space after '.com '; that's important, because it makes sure you don't put newlines in the middle of URLs that are more than just a root-level domain (i.e. 'http://www.metafilter.com/foobar') You can also add newlines to the end of every .html, .php, and even every '/ ' if you have a lot of slash-terminated addresses. It's okay if you put too many newlines in, so long as there's a newline at the beginning and ending of every URL and none in between.
(6) Now that you have a whole bunch of lines, many of which are URLs and many of which are random text, select all (Ctrl-a) and select TextFX -> TextFX Tools -> Sort Lines Case Sensitive (at column). Once you do this, everything will be alphabetized, and everything starting with 'http://' will be in the same place.
(7) Delete everything before and everything after the chunk of URLs. Viola!
posted by koeselitz at 10:09 AM on April 13, 2009
Here are the steps I'd follow to do this:
(1) [Probably make a copy of the text file in case you bork it.]
(2) Control-H to bring up the 'find and replace' dialogue.
(3) Select the 'Extended' option (Alt-x is the hotkey). Type in '\n' in the 'Find What:' field. Then, make the 'Replace With:' field blank. Then 'Replace All' (Alt-a). Do the same thing again, only replacing '\n' with '\r'. This will remove all newlines and returns, so that your whole document is just one 'paragraph.'
(4) Now, with Find And Replace again, search for 'http://' and replace with '\nhttp://'. (Replace all.) This will mean that the web addresses will be at the beginning of each line. We've isolated them on one side.
(5) Again with the Find And Replace (Replace All): search for '.com ' and replace with '.com\n'. Do the same with '.org ' and '.net ', too, and whatever other subdomains you might have in the document. Notice the space after '.com '; that's important, because it makes sure you don't put newlines in the middle of URLs that are more than just a root-level domain (i.e. 'http://www.metafilter.com/foobar') You can also add newlines to the end of every .html, .php, and even every '/ ' if you have a lot of slash-terminated addresses. It's okay if you put too many newlines in, so long as there's a newline at the beginning and ending of every URL and none in between.
(6) Now that you have a whole bunch of lines, many of which are URLs and many of which are random text, select all (Ctrl-a) and select TextFX -> TextFX Tools -> Sort Lines Case Sensitive (at column). Once you do this, everything will be alphabetized, and everything starting with 'http://' will be in the same place.
(7) Delete everything before and everything after the chunk of URLs. Viola!
posted by koeselitz at 10:09 AM on April 13, 2009
Argh, didn't notice that these are email addresses rather than URLs.
Goddamnit, this is stupid. Here, I'll be right back with a better, faster solution.
posted by koeselitz at 10:17 AM on April 13, 2009
Goddamnit, this is stupid. Here, I'll be right back with a better, faster solution.
posted by koeselitz at 10:17 AM on April 13, 2009
Best answer: Okay, here. I wrote an AutoHotkey script real quick-like that does what you need. Here is an .exe file, and here is the .ahk source code in case anybody'd like to fool with it themselves:
; A script to isolate email addresses from a text file.
; Saves these emails to email_addresses.txt in the same folder.
#NoTrayIcon
FileSelectFile, file,,, Select a Text File to Extract email Addresses from:
SplitPath, file,, directory
Loop, Read, %file%, %directory%\email_addresses.txt
{
Loop, Parse, A_LoopReadLine, %A_Tab%%A_Space%
{
IfInString, A_LoopField, @
{
FileAppend, %A_LoopField%`n
}
}
}
ExitApp
What this script does: when you run the .exe, it'll come up with a box asking you for a file to translate. When you select a text file, it'll take that file, examine every 'word', and output every 'word' that contains @ to a file in the same directory as the original file called email_addresses.txt. I haven't tested it extensively; I imagine it would help to remove all newlines and returns before using it, but it should work either way.
posted by koeselitz at 11:01 AM on April 13, 2009
; A script to isolate email addresses from a text file.
; Saves these emails to email_addresses.txt in the same folder.
#NoTrayIcon
FileSelectFile, file,,, Select a Text File to Extract email Addresses from:
SplitPath, file,, directory
Loop, Read, %file%, %directory%\email_addresses.txt
{
Loop, Parse, A_LoopReadLine, %A_Tab%%A_Space%
{
IfInString, A_LoopField, @
{
FileAppend, %A_LoopField%`n
}
}
}
ExitApp
What this script does: when you run the .exe, it'll come up with a box asking you for a file to translate. When you select a text file, it'll take that file, examine every 'word', and output every 'word' that contains @ to a file in the same directory as the original file called email_addresses.txt. I haven't tested it extensively; I imagine it would help to remove all newlines and returns before using it, but it should work either way.
posted by koeselitz at 11:01 AM on April 13, 2009
Response by poster: bcwinters was right there from the start. koeslitz got off to a slow start but finished very strong. koeslitz takes it by a nose (how often does AskMe provide an executable written to solve your computer problem?)
Thanks everyone. The koeslitz script works great (you have to go through and delete leading and trailing brackets, carats, commas, semi colons and colons, but that's trivial).
posted by stupidsexyFlanders at 2:00 PM on April 13, 2009 [1 favorite]
Thanks everyone. The koeslitz script works great (you have to go through and delete leading and trailing brackets, carats, commas, semi colons and colons, but that's trivial).
posted by stupidsexyFlanders at 2:00 PM on April 13, 2009 [1 favorite]
stupidsexyFlanders: The koeslitz script works great (you have to go through and delete leading and trailing brackets, carats, commas, semi colons and colons, but that's trivial).
Heh heh, thanks. I don't know if "works great" is very meaningful when you have to go to that much trouble, but I'm glad it was at least functional to you. It was fun - and it took, like, half an hour of coding time, so there's that. If I were really good, and if I hadn't been half-asleep, I guess it would've been better.
Anyway, glad I could help.
posted by koeselitz at 3:15 PM on April 13, 2009
Heh heh, thanks. I don't know if "works great" is very meaningful when you have to go to that much trouble, but I'm glad it was at least functional to you. It was fun - and it took, like, half an hour of coding time, so there's that. If I were really good, and if I hadn't been half-asleep, I guess it would've been better.
Anyway, glad I could help.
posted by koeselitz at 3:15 PM on April 13, 2009
Response by poster: yeah, it actually was more than functional, saved me a lot of time. Thanks again.
posted by stupidsexyFlanders at 5:52 PM on April 13, 2009
posted by stupidsexyFlanders at 5:52 PM on April 13, 2009
This thread is closed to new comments.
This site offers the following search as a good starting point:
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
If you've never used regular expressions before, the basic gist is that this will search for:
[any bunch of letters and numbers, possibly with a coupla punctuation marks] followed by an "@" followed by [another bunch of letters and numbers, possibly with a period or dash in there] followed by a "." ending with [at least 2 but not more than 4 letters]
If you don't have a text editor that can do regular expressions, you could try something free like Notepad++.
posted by bcwinters at 6:40 AM on April 13, 2009 [3 favorites]