Grep me into BBEdit address book heaven
February 26, 2007 6:16 AM
Could a good grep samaritan help me out of "find and replace" purgatory into BBEdit heaven?
I have a lovely script for Entourage on Mac OS X which looks through my inbox, winnows out the unique senders and creates a file of all the addresses so I get an address book of everyone who has ever contacted me. I have used this on a regular basis to slim down my Entourage database, which works well.
The only problem with the list is that the email addresses are joined to the sender names with a line of dots, so having created the file year after year I can't now easily import the addresses into a new e-mail application.
So far I've always used BBEdit for extremely basic find and replace operations, but I get the impression that it would be able to clean this address book file, if only I knew the right grep. And considering I'm a total heathen in this regard, I'm really struggling. Every combination I have tried so far has rendered some portion of the address book unusable.
Here is an example of what I have:
Sender Name.............................E-mail Address
Kida....................................kida@address.com
........................................xxxx@aol.com
bluematter..............................support@bluematter.com
........................................support@btclick.com
I believe there are forty characters / dots before each address, so one solution could be simply to find the first forty characters of each line and delete them.
The other could be to delete the first two dots that appear before any letter of the alphabet, to leave a space. That would make the e-mail addresses clean again.
Can anyone advise me? Thanks
I have a lovely script for Entourage on Mac OS X which looks through my inbox, winnows out the unique senders and creates a file of all the addresses so I get an address book of everyone who has ever contacted me. I have used this on a regular basis to slim down my Entourage database, which works well.
The only problem with the list is that the email addresses are joined to the sender names with a line of dots, so having created the file year after year I can't now easily import the addresses into a new e-mail application.
So far I've always used BBEdit for extremely basic find and replace operations, but I get the impression that it would be able to clean this address book file, if only I knew the right grep. And considering I'm a total heathen in this regard, I'm really struggling. Every combination I have tried so far has rendered some portion of the address book unusable.
Here is an example of what I have:
Sender Name.............................E-mail Address
Kida....................................kida@address.com
........................................xxxx@aol.com
bluematter..............................support@bluematter.com
........................................support@btclick.com
I believe there are forty characters / dots before each address, so one solution could be simply to find the first forty characters of each line and delete them.
The other could be to delete the first two dots that appear before any letter of the alphabet, to leave a space. That would make the e-mail addresses clean again.
Can anyone advise me? Thanks
Search For:
(\.){2,}
Replace With:
,
That will change all runs of periods with two or more periods into a comma. Alternatively, you can put nothing in the Replace With box and the runs of two or more periods will simply be deleted, though I'm not certain that's what you want. It would help if you could be explicit about what you want the runs of periods transformed into.
posted by jperkins at 6:34 AM on February 26, 2007
(\.){2,}
Replace With:
,
That will change all runs of periods with two or more periods into a comma. Alternatively, you can put nothing in the Replace With box and the runs of two or more periods will simply be deleted, though I'm not certain that's what you want. It would help if you could be explicit about what you want the runs of periods transformed into.
posted by jperkins at 6:34 AM on February 26, 2007
Search for:
([^.]*)\.\.+(.*?)\r
Replace with:
\1,\2
This grabs everything before the dots if there is anything, replaces the run of dots with a comma, and grabs everything between the dots and the carriage return. It will work if there are any number of single dots in the first field, but not if there are any paired dots. (eg, "John Q.V.. Public" won't cause a problem, but "John Q.. Public" will)
posted by ardgedee at 7:43 AM on February 26, 2007
([^.]*)\.\.+(.*?)\r
Replace with:
\1,\2
This grabs everything before the dots if there is anything, replaces the run of dots with a comma, and grabs everything between the dots and the carriage return. It will work if there are any number of single dots in the first field, but not if there are any paired dots. (eg, "John Q.V.. Public" won't cause a problem, but "John Q.. Public" will)
posted by ardgedee at 7:43 AM on February 26, 2007
Er. Shows me for not previewing. "John Q.V. Public" will work. "John Q.V.. Public" won't.
posted by ardgedee at 7:44 AM on February 26, 2007
posted by ardgedee at 7:44 AM on February 26, 2007
...and the full Replace expression should be \1,\2\raaz and if there is a problem with names and doubled dots (such as "John Q.V.. Public"), add more dots to the search expression, eg ([^.]*)\.\.\.\.\.\.+(.*?)\r
Obviously I should refrain from posting before my full dose of coffee.
posted by ardgedee at 9:03 AM on February 26, 2007
Obviously I should refrain from posting before my full dose of coffee.
posted by ardgedee at 9:03 AM on February 26, 2007
"add more dots to the search expression, eg ([^.]*)\.\.\.\.\.\.+(.*?)\r"
Or use (\.){2,} which matches a run of periods with a minimum two periods in a row and then replace the entire captured run with a comma.
posted by jperkins at 9:58 AM on February 26, 2007
Or use (\.){2,} which matches a run of periods with a minimum two periods in a row and then replace the entire captured run with a comma.
posted by jperkins at 9:58 AM on February 26, 2007
There are great answers here, but if you're still interested in learning about the GREP functions in BBEdit, have a look at the PDF version of the user manual that's hiding inside the application bundle. I found it to be REALLY helpful in explaining the syntax in a real-world-usage kinda way. Before I read that manual I thought GREP was just for supernerds, and now I use it all the freakin' time.
(I don't mean this to be an "RTFM" kind of response at all--just wanted you to know that this is one place where the documentation is surprisingly decent and useful and could possibly help you in the future. Especially since most information you find online about GREP and regex is catered towards programmers looking for lists of commands to refresh their memories, etc.)
posted by bcwinters at 10:10 AM on February 26, 2007
(I don't mean this to be an "RTFM" kind of response at all--just wanted you to know that this is one place where the documentation is surprisingly decent and useful and could possibly help you in the future. Especially since most information you find online about GREP and regex is catered towards programmers looking for lists of commands to refresh their memories, etc.)
posted by bcwinters at 10:10 AM on February 26, 2007
If there's a real concern over double periods in the names themselves, match on at least three periods in a row:
(\.){3,}
Regarding the regex supplied by ardgedee:
[^.] is the negation of the . metacharacter which will turn a match anything into its negation: match nothing> I think that they meant:
[^\.]
But the solution that I've provided is much simpler and customizable in the event that you do have two periods in a row in the names. In fact, you could even match on (\.){10,} which says grab all of the runs of periods with at least ten periods in a row and then do as I suggested earlier in the replace with a single comma.
Also, in general using \r as the line ending is a bad idea. Use the $ metacharacter. So, ([^.]*)\.\.+(.*?)\r should instead be:
([^.]*)\.\.+(.*?)$
There's also no need to make the (.*?)\r non-greedy as . doesn't match on the end of line characters.
posted by jperkins at 10:15 AM on February 26, 2007
(\.){3,}
Regarding the regex supplied by ardgedee:
[^.] is the negation of the . metacharacter which will turn a match anything into its negation: match nothing> I think that they meant:
[^\.]
But the solution that I've provided is much simpler and customizable in the event that you do have two periods in a row in the names. In fact, you could even match on (\.){10,} which says grab all of the runs of periods with at least ten periods in a row and then do as I suggested earlier in the replace with a single comma.
Also, in general using \r as the line ending is a bad idea. Use the $ metacharacter. So, ([^.]*)\.\.+(.*?)\r should instead be:
([^.]*)\.\.+(.*?)$
There's also no need to make the (.*?)\r non-greedy as . doesn't match on the end of line characters.
posted by jperkins at 10:15 AM on February 26, 2007
"I believe there are forty characters / dots before each address, so one solution could be simply to find the first forty characters of each line and delete them."
Ah! Now I understand what you meant by that. How about:
Search For:
^(.){1,40}
And leave Replace With empty.
posted by jperkins at 10:36 AM on February 26, 2007
Ah! Now I understand what you meant by that. How about:
Search For:
^(.){1,40}
And leave Replace With empty.
posted by jperkins at 10:36 AM on February 26, 2007
Having pasted your supplied sample into an editor with a monospaced font, it appears that what you actually have is two fixed-width fields per line; the first field is a 40-character name field, right-padded with dots, and the second field is the email address.
The One Right and True Way to deal with this is to do two separate search and replace passes over the file. The first pass will insert a unique delimiter between the two fields, and the second pass will clean up the padding.
On the first pass: replace ^(.{40}) with \1\t which means: replace any sequence of 40 characters at the start of a line with the same sequence followed by a tab. Note that the . in the search expression here does not have \ in front of it; we want the dot to have its regular-expression meaning of "any character" rather than matching a literal dot.
On the second pass: replace \.*\t with \t (which means: replace any number of dots followed by a tab with a tab).
You now have a standard tab-separated file that any email client should be able to import.
If you want to discard the names altogether and generate a pure list of email addresses (not sure why you'd want this), then you only need one pass: replace ^.{40} with nothing. You'd also need to delete the first line, since that contains a header rather than data.
posted by flabdablet at 2:39 PM on February 26, 2007
The One Right and True Way to deal with this is to do two separate search and replace passes over the file. The first pass will insert a unique delimiter between the two fields, and the second pass will clean up the padding.
On the first pass: replace ^(.{40}) with \1\t which means: replace any sequence of 40 characters at the start of a line with the same sequence followed by a tab. Note that the . in the search expression here does not have \ in front of it; we want the dot to have its regular-expression meaning of "any character" rather than matching a literal dot.
On the second pass: replace \.*\t with \t (which means: replace any number of dots followed by a tab with a tab).
You now have a standard tab-separated file that any email client should be able to import.
If you want to discard the names altogether and generate a pure list of email addresses (not sure why you'd want this), then you only need one pass: replace ^.{40} with nothing. You'd also need to delete the first line, since that contains a header rather than data.
posted by flabdablet at 2:39 PM on February 26, 2007
Wow! Thanks so much. I barely know who to mark as best answer because there is so much amazing advice here. I will also read the flipping manual :)
posted by unclemonty at 2:58 AM on February 27, 2007
posted by unclemonty at 2:58 AM on February 27, 2007
I'd also be willing to bet that your original lovely script is jumping through an extra hoop or two to generate all those dots in the first place, and that making a modified version that emits a tab between the name and address fields, instead of the variable number of dots, would be an absolutely trivial exercise. What's it written in?
posted by flabdablet at 4:11 AM on February 28, 2007
posted by flabdablet at 4:11 AM on February 28, 2007
This thread is closed to new comments.
\.\.+\w
should do the match (a . then at least one more . then a word character)
If you're putting this into a replace string you need to save the word pattern to go in the replace bit - put it in brackets.
so to get, say, csv, search for
\.\.+(\w)
and replace with
,\1
I'm not a bbedit person but i think this should work.
posted by handee at 6:26 AM on February 26, 2007