regex to find strings in strings
March 21, 2011 8:45 AM   Subscribe

I am trying to find a regex which will allow me to find out whether a string exists at least N times in a different string.

For example, if I have the string “abacabacbdaabbcedefg”, I would like the regex to tell me if there are at least 3 b’s in the string. I don’t want to know the number of b’s just a true/false answer – are there at least 3 b’s in the string.

I can achieve this will the regex “.*b.*b.*b” – crude but effective – but I would also like to be able to find at least 3 b’s AND at least 4 a’s – but in any order.

To be honest - regex gives me a headache! All help greatly appreciated
posted by the_very_hungry_caterpillar to Computers & Internet (16 answers total) 2 users marked this as a favorite
 
What language are you using? Perl's grep function can return the number of times an RE matches.
posted by scruss at 8:53 AM on March 21, 2011


Response by poster: This is for use inside an SaaS application - only the regex can be added, and the only match is true/false.
posted by the_very_hungry_caterpillar at 8:54 AM on March 21, 2011


Does the regex engine support if-then-else conditionals?
posted by bcwinters at 8:59 AM on March 21, 2011


Response by poster: don't know, but I can try
posted by the_very_hungry_caterpillar at 9:01 AM on March 21, 2011


this means you should ask how many "3bs in a string" are in the string "bbbbb" ?

Matching substrings might be your google term, wiki suggests a few named algorithms, though they don't note regex.

(And what's the regex joke ? if you have a problem and are trying to solve it by regex, you now have two problems ? )
posted by k5.user at 9:05 AM on March 21, 2011


Response by poster: no - the b's do not have to be consecutive. ".*b.*b.*b" means b then anything (including nothing) then b then anything then b, so it finds the first 3 b's in the string.
posted by the_very_hungry_caterpillar at 9:09 AM on March 21, 2011


Best answer: The good news is that since regular languages are closed under intersection, there is guaranteed to be some regex that does what you want. i.e. finds all strings that match both .*a.*a.*a.*a.* and .*b.*b.*b.*.

The bad news is that such a regex is likely to be very, very long. Off the top of my head I can't think of a shorter way to write it than just listing all the possible permutations:

.*a.*a.*a.*a.*b.*b.*b.* |
.*b.*b.*b.*a.*a.*a.*a.* |
.*a.*a.*a.*b.*a.*b.*b.* |
.*a.*a.*a.*b.*b.*a.*b.* | ...


For this simple example there would be 7!/(3!*4!) = 35 clauses, if my math is right.
posted by teraflop at 9:12 AM on March 21, 2011 [4 favorites]


Not sure what language you're using, but why not do something like this and excise the regex completely:

return String.Split('b').Length() == 4 && String.Split('a').Length() == 5;
posted by jeffamaphone at 9:16 AM on March 21, 2011


Assuming the regex system you are using supports it, use a the lookahead operator. The format is usually (?=[stuff]). It will check the stuff in the lookahead section, then return to wherever it was at that point. Example:

(?=(.*a){4})(?=(.*b){3})
posted by burnmp3s at 9:26 AM on March 21, 2011 [3 favorites]


I would also like to be able to find at least 3 b’s AND at least 4 a’s – but in any order

Maybe I'm missing something about your requirements, but aren't the searches for the b's and the a's independent? Use two regexes (regexen?) to get one true/false for "at least three b's" and one true/false for "at least four a's", and then AND them together.
posted by madmethods at 10:10 AM on March 21, 2011 [1 favorite]


Response by poster: @burnmp3s - the s/w just says error when using the supplied reges so I guess it doesn't support lookaheads

@madmethods - agreed - I am looking at running two anded regexs to achieve the results, just not very elegant

@y'all - is this @ thing considered persona non-grata on Mefi/Ask??
posted by the_very_hungry_caterpillar at 11:31 AM on March 21, 2011


Best answer: I am looking at running two anded regexs to achieve the results, just not very elegant

If using two separate regular expressions is an option, I'd say that's far more elegant than trying to do boolean logic in a regular expression.

is this @ thing considered persona non-grata on Mefi/Ask??

Yes. It's not the worst thing in the world, but many don't like it.
posted by scottreynen at 1:32 PM on March 21, 2011


Best answer: Many small regexes are often better than monolithic regexes (for example, when trying to extract information from scraped webpages). This is also one of those cases.
posted by beerbajay at 1:39 PM on March 21, 2011


I think k5.user was trying to clarify whether "bbbbb" contains the string "bbb" three times, "bbbbb" "bbbbb" "bbbbb" or if it would require "bbbbbbbbb"
posted by RobotHero at 1:51 PM on March 21, 2011


Response by poster: cheers! all working.
posted by the_very_hungry_caterpillar at 3:25 PM on March 21, 2011


The good news is that since regular languages are closed under intersection, there is guaranteed to be some regex that does what you want.

Although that is true in this case, please note that regex languages form a superset of the class of regular languages. Depending on your choice of formalization (and regex dialect), this class is not closed under intersection.
posted by erdferkel at 3:43 AM on March 22, 2011


« Older Looking for scrub in all the wrong places ...   |   Help my dress fit like a dream! Newer »
This thread is closed to new comments.