Help me find permutations of abc with \. and _ characters
May 1, 2013 10:25 AM   Subscribe

[Regex-and-stack-overflowfilter]: Can you help me identify abc a.b.c. a.b.c a_b_c ab_c a_b.c. and any permutation of those characters? More below the fold.

It has been a long time - a very very long time since I've written regex expressions. So I'm searching through a long list of URLs and trying to group a bunch together to clean up my dataset. Unfortunately I've got about 65K historical URLs I'm looking at. Since I'm looking at about 4 years worth of data, there are a whole host of issues with how the naming convention for pages has changed. I'm trying to establish a grouping based on a regex. The basic pattern that I am matching is abc with _ and . as valid delimiters possibly after each character, and ideally I want the whole string where this character pattern is contained.

I have tested
(?i)a(?:[b])(?:[c])*|a(?:[\.,b])(?:[\.,c])*
but it isn't cutting it for me (and it won't work for you either - HA!)

I've been testing on http://gskinner.com/RegExr/ and while that's made the testing a little quicker, my regex skills are clearly failing.
posted by Nanukthedog to Computers & Internet (11 answers total)
 
Best answer: a[._]?b[._]?c matches all of your examples. Or do you need this to do more?
posted by zsazsa at 10:32 AM on May 1, 2013 [3 favorites]


Best answer: I think it needs one more element to match trailing period or underscores.
a[._]?b[._]?c[._]?
posted by kidbritish at 10:35 AM on May 1, 2013 [2 favorites]


Best answer: Do you care about matching multiple consecutive delimiters? eg. do you want "a....b._..c" to be caught?
posted by reptile at 10:36 AM on May 1, 2013 [1 favorite]


Best answer: I think you need a trailing [._]? as well. In perl I would would put parens around the whole thing.

(a[._]?b[._]?c[._]?)
posted by Bruce H. at 10:36 AM on May 1, 2013 [1 favorite]


This is not really your question, but if you're trying to extract top-level domain information from urls, this python module tldextract may help, as it uses mozilla's list of effective tlds (useful if you have any .co.uk or other 'fun' tlds in your data). I found this, appropriately enough, on stack overflow, the answers include a link to mozilla's tld list if you're not into python or would rather roll your own.
posted by worstname at 10:43 AM on May 1, 2013 [2 favorites]


One way to do it in Python is to use the excellent itertools library:
import itertools, re

needles = []
for p in itertools.permutations('abc_.', 5):
    needles.append(''.join(p))
query = "foo bar ab.c_ baz"
q = re.search("|".join(needles), query)
if q: 
    print(q.group(0))
The variable li is a list containing all five-character-long permutations of the characters abc_.

We then use this list as a pattern in a regular expression search on the string query.
posted by Blazecock Pileon at 10:44 AM on May 1, 2013 [2 favorites]


Response by poster: Wow, big step forward, I've tested, and run into a few additional challenges.

It must contain a b and c to be valid, which when I tested isn't completely working.

I modified the great responses to meet my needs (I think). It tests so far - but please poke holes:

a+[\.,_]?b+[\.,_]?c+[\.,_]

@ reptile: great question! Multiple delimiters should be thrown out, which means that part of the implementation is now working properly.

@worstname: if only I were so lucky.

Rough explination of the project: Running a list of old URLs through Rapidminer to help process the names. I'm then building this list back in SAS so I can decode the IP address into its geolocational component of who hit what pages from a given location and when. Sadly no python fun involved today.
posted by Nanukthedog at 10:48 AM on May 1, 2013


Response by poster: ...Yeah, I'm text mining URLs... drink.
posted by Nanukthedog at 10:49 AM on May 1, 2013


Response by poster: er, challenges were before I added the '+'es, it seemed to allow for ab combinations or ac combinations before I did that - which was problematic.
posted by Nanukthedog at 10:51 AM on May 1, 2013


Best answer: Replace a+, b+ and c+ with (a|b|c)+ in each case if you *want* to be able to match multiple a/b/c characters in a row.
posted by tylerkaraszewski at 10:52 AM on May 1, 2013 [1 favorite]


Response by poster: Thanks all for getting me back up and running quickly.
posted by Nanukthedog at 10:59 AM on May 1, 2013


« Older Eggy coffee, egad!   |   Is this a scam? What's the angle here? Newer »
This thread is closed to new comments.