REGEXpert help needed!
December 14, 2006 10:45 AM Subscribe
Regex experts: I need a regular expression to help trim some lines in a text file. I haven't done regex in some time and I'm not having any luck with this. Hope it'll be easy for a wiz.
I have a text file with almost 3,000 e-mail addresses. The format per line is supposed to be:
[address], [name]
But many are:
[address], [address]
The system I'm importing to will not allow two addresses on a line, so I have to trim those instances to simply be:
[address]
I have a text file with almost 3,000 e-mail addresses. The format per line is supposed to be:
[address], [name]
But many are:
[address], [address]
The system I'm importing to will not allow two addresses on a line, so I have to trim those instances to simply be:
[address]
Here's a rough and dirty stab at that:
posted by Khalad at 10:51 AM on December 14, 2006
s/(\S+@\S+),\s*\S+@\S+/$1/g
\S
means "any non-whitespace character".posted by Khalad at 10:51 AM on December 14, 2006
Which address would you keep, under those circumstances? Always the first one?
And are you using sed?
posted by cerebus19 at 10:52 AM on December 14, 2006
And are you using sed?
posted by cerebus19 at 10:52 AM on December 14, 2006
And as a naughty, naughty tag on; does anyone have a quick and easy way to strip out everything else in a file except for URLs? This would make exporting of Tab Mix Plus's saved sessions a doddle.
posted by dance at 11:57 AM on December 14, 2006
posted by dance at 11:57 AM on December 14, 2006
dance, here's the regular expression I use for finding URLs:
That's a start.
posted by Khalad at 12:09 PM on December 14, 2006 [1 favorite]
\b[a-z]+:\d*//(?:[&.?!:]?[\w#~+=;%@\-/]+)*
That's a start.
posted by Khalad at 12:09 PM on December 14, 2006 [1 favorite]
dance: from Regexp::Common::URI::http.pm on CPAN, with over 18k test cases:
posted by moift at 3:08 PM on December 14, 2006
(?:(?:http)://(?:(?:(?:(?:(?:(?:[a-zA-Z0-9][-a-zA-Z0-9]*)?[a-zA-Z0-9])[.])*(?:[a-zA-Z][-a-zA-Z0-9]*[a-zA-Z0-9]|[a-zA-Z])[.]?)|(?:[0-9]+[.][0-9]+[.][0-9]+[.][0-9]+)))(?::(?:(?:[0-9]*)))?(?:/(?:(?:(?:(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*)(?:/(?:(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)(?:;(?:(?:[a-zA-Z0-9\-_.!~*'():@&=+$,]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*))*))*))(?:[?](?:(?:(?:[;/?:@&=+$,a-zA-Z0-9\-_.!~*'()]+|(?:%[a-fA-F0-9][a-fA-F0-9]))*)))?))?)
posted by moift at 3:08 PM on December 14, 2006
If """^([^,]*?),[^@]*$""" matches, replace the line with the captured result.
posted by cmiller at 3:48 PM on December 14, 2006
posted by cmiller at 3:48 PM on December 14, 2006
Response by poster: Ugh - thanks much gang, but I'm working within NoteTab Pro which can do regex to some degree but it's apparently not totally compatible with the Unix implementation. I had to take it all into Excel temporarily and do some funky sorting and search/replace to clean it up. But I sure appreciate the efforts... AskMeFi continues to rock!
posted by Tubes at 9:49 PM on December 14, 2006
posted by Tubes at 9:49 PM on December 14, 2006
Wow, thanks - I had no idea there was a library of regex expressions. Now to figure out how to get TextWrangler/BBEdit to dump the results of seach into a new doc...
posted by dance at 12:20 PM on December 15, 2006
posted by dance at 12:20 PM on December 15, 2006
and Tubes, thanks for hosting my tag-on so graciously!
posted by dance at 12:21 PM on December 15, 2006
posted by dance at 12:21 PM on December 15, 2006
This thread is closed to new comments.
perl -pe 's/^([^@]+@[^@]+), .*@.*$/\1/'
posted by grouse at 10:50 AM on December 14, 2006