Idiot proof RegEx creator tool ?
May 12, 2008 3:47 AM Subscribe
How can I find a drop-dead simple way of creating Regular Expression strings without having to learn all the underlying mechanics?
I've been trying over the past 6months or so to get over my (apparent) mental block understanding Regular Expressions. Sadly, I havent really made any progress at all. Its still migraine-inducing (literally). I dont understand why someone hasnt created a software tool, such that I can drop in a text-string, and it will generate the regular expression I need.
At my job, I occasionally have to update our Spam filter. To accomplish this, I'm asked to visually/manually filter through our spam box, find the popular new trend in spam subject lines or body text. Ok, I can handle this no problem. Next, I'm supposed to create regular expressions from those strings, and update our spam filter to include those new strings. This is the part that has me totally and completely frustrated.
Every regular expression tool I'm finding on the internet seems to be focused on finding patterns in CODE, and not spam. Others that I try dont seem to be producing the results I want. For example, I found this regex creator, but when I feed two different strings into it (see below) I get the same regex output which doesnt seem right:
Earn a degree --> (\w{4}\s+\w\s+\w+)
your length easily --> (\w{4}\s+\w\s+\w+)
I'm obviously not understanding regular expressions. And honestly, I dont really want (or have the time) to understand the mechanics underneath it. All I want is a tool that I can input text-strings, and the output is a regular expression I can add to my spam filter. Yes, I realize this is somewhat of a sophomoric / whiny request ("I want someone/thing else to do all the work for me!!!")... but thats really not my attitude.
Alternatively, if someone could suggest an online link, book or some other resource that would CLEARLY explain regular expressions in a way a non-coder can understand, I'd be super ecstatic to read it. But so far (along with all my other attempts to learn coding) I havent yet found any resource like that.
I've been trying over the past 6months or so to get over my (apparent) mental block understanding Regular Expressions. Sadly, I havent really made any progress at all. Its still migraine-inducing (literally). I dont understand why someone hasnt created a software tool, such that I can drop in a text-string, and it will generate the regular expression I need.
At my job, I occasionally have to update our Spam filter. To accomplish this, I'm asked to visually/manually filter through our spam box, find the popular new trend in spam subject lines or body text. Ok, I can handle this no problem. Next, I'm supposed to create regular expressions from those strings, and update our spam filter to include those new strings. This is the part that has me totally and completely frustrated.
Every regular expression tool I'm finding on the internet seems to be focused on finding patterns in CODE, and not spam. Others that I try dont seem to be producing the results I want. For example, I found this regex creator, but when I feed two different strings into it (see below) I get the same regex output which doesnt seem right:
Earn a degree --> (\w{4}\s+\w\s+\w+)
your length easily --> (\w{4}\s+\w\s+\w+)
I'm obviously not understanding regular expressions. And honestly, I dont really want (or have the time) to understand the mechanics underneath it. All I want is a tool that I can input text-strings, and the output is a regular expression I can add to my spam filter. Yes, I realize this is somewhat of a sophomoric / whiny request ("I want someone/thing else to do all the work for me!!!")... but thats really not my attitude.
Alternatively, if someone could suggest an online link, book or some other resource that would CLEARLY explain regular expressions in a way a non-coder can understand, I'd be super ecstatic to read it. But so far (along with all my other attempts to learn coding) I havent yet found any resource like that.
You have to specify the string you want to match somehow. No tool is going to be able to read your mind.
If all you want to match is "Earn a degree", then the expression "Earn a degree" will work just fine. But it won't match "Earn an amazing degree!!!" For that, you'll need an expression like "Earn.*degree" The ".*" in the expression will match any text between earn and degree.
I've never read this book, but it looks pretty good.
Alternatively, post your questions here and I'm sure some people will be able to walk you through it.
posted by grouse at 4:09 AM on May 12, 2008
If all you want to match is "Earn a degree", then the expression "Earn a degree" will work just fine. But it won't match "Earn an amazing degree!!!" For that, you'll need an expression like "Earn.*degree" The ".*" in the expression will match any text between earn and degree.
I've never read this book, but it looks pretty good.
Alternatively, post your questions here and I'm sure some people will be able to walk you through it.
posted by grouse at 4:09 AM on May 12, 2008
Response by poster: Thanks grouse,
Thats the part I'm worried about. ("no tool is going to be able to read your mind")
I struggled through what I needed to get done this morning by:
1.) looking at previous regex tests my coworkers had written
2.) because I'm only doing subject line searches and matching on the EXACT string (example: "Earn a degree")
I realize that if I want to do anything more complex than that (searching on variable words,etc) that I'm probably going to have to somehow force myself to learn the actual guts of regex coding. It just seems like with all the technology we have, someone somewhere should be able to come up with a semantic tool where I can point/click and say: For any subject line that starts with "FREE" and also includes "books", "degree" or "sex"..... and creates a regex string to those rules.
The problem I seem to have with regex (and most other UNIX tutorials) is that they are poorly written, very dry/boring, and overly "syntactic" (meaning = having to read 14 pages of available command line switches for a certain utilty doesnt really help me learn actual real world application of the commands)
I wish there was a regex tutorial that :
1.) showed real world examples of regex strings
2.) highlighted or otherwise colorcoded the different parts of the strings, and broke down how the stream of logic/filtering works
3.) gave me some kind of flowchart or visual representation of how the string is created
Sorry for rambling... I'm just frustrated :).. but also stubborn, so I'm going to keep trying. Thanks for the book recommendation.
posted by jmnugent at 4:25 AM on May 12, 2008
Thats the part I'm worried about. ("no tool is going to be able to read your mind")
I struggled through what I needed to get done this morning by:
1.) looking at previous regex tests my coworkers had written
2.) because I'm only doing subject line searches and matching on the EXACT string (example: "Earn a degree")
I realize that if I want to do anything more complex than that (searching on variable words,etc) that I'm probably going to have to somehow force myself to learn the actual guts of regex coding. It just seems like with all the technology we have, someone somewhere should be able to come up with a semantic tool where I can point/click and say: For any subject line that starts with "FREE" and also includes "books", "degree" or "sex"..... and creates a regex string to those rules.
The problem I seem to have with regex (and most other UNIX tutorials) is that they are poorly written, very dry/boring, and overly "syntactic" (meaning = having to read 14 pages of available command line switches for a certain utilty doesnt really help me learn actual real world application of the commands)
I wish there was a regex tutorial that :
1.) showed real world examples of regex strings
2.) highlighted or otherwise colorcoded the different parts of the strings, and broke down how the stream of logic/filtering works
3.) gave me some kind of flowchart or visual representation of how the string is created
Sorry for rambling... I'm just frustrated :).. but also stubborn, so I'm going to keep trying. Thanks for the book recommendation.
posted by jmnugent at 4:25 AM on May 12, 2008
2nding grouse's recommendation for Friedl's book. Regexes are hard, but worth it.
posted by scruss at 4:25 AM on May 12, 2008
posted by scruss at 4:25 AM on May 12, 2008
I found this online javascript tool on del.icio.us a couple of weeks ago: might be of some help.
http://regex.larsolavtorvik.com/
posted by kothar at 4:28 AM on May 12, 2008 [1 favorite]
http://regex.larsolavtorvik.com/
posted by kothar at 4:28 AM on May 12, 2008 [1 favorite]
Much of business programming is not actually a matter of complicated computer science, but really in specifying the problem. This is one of the reasons that software development projects can take much longer than planned—people realize later that the problem was incompletely specified at the beginning and then have to redesign later.
This is your biggest problem. Until you can specify the problem, you can't even effectively ask for help. After you can specify the problem, the syntax is a minor issue.
For any subject line that starts with "FREE" and also includes "books", "degree" or "sex"..... and creates a regex string to those rules.
The problem with creating a tool to do this is that in order to come up with one that can express all the power of regular expressions, it will need to be at least as complicated as regular expressions. We need to know what software you are using because there are multiple incompatible variants of regexes. Assuming you are using Perl-compatible regular expressions, this should work:
^ matches the beginning of the line. The vertical bars allow you to specify several possible alternatives. The parentheses are for grouping.
posted by grouse at 4:35 AM on May 12, 2008
This is your biggest problem. Until you can specify the problem, you can't even effectively ask for help. After you can specify the problem, the syntax is a minor issue.
For any subject line that starts with "FREE" and also includes "books", "degree" or "sex"..... and creates a regex string to those rules.
The problem with creating a tool to do this is that in order to come up with one that can express all the power of regular expressions, it will need to be at least as complicated as regular expressions. We need to know what software you are using because there are multiple incompatible variants of regexes. Assuming you are using Perl-compatible regular expressions, this should work:
^FREE.*(books|degree|sex)
^ matches the beginning of the line. The vertical bars allow you to specify several possible alternatives. The parentheses are for grouping.
posted by grouse at 4:35 AM on May 12, 2008
I first learned regular expressions by reading this brief guide by Dorothea Salo. It's very short, yet quite comprehensive - most regex tutorials go on for a dozen pages but after reading this I was pretty much set. Try reading through it and see if it'll help you grok regular expressions.
posted by bent back tulips at 4:36 AM on May 12, 2008
posted by bent back tulips at 4:36 AM on May 12, 2008
Regex Buddy is pretty good (its 30 euros but they have a free trial)
posted by missmagenta at 4:36 AM on May 12, 2008
posted by missmagenta at 4:36 AM on May 12, 2008
What tool are you using to check the email for your regex patterns? That's important since there's no single regex specification. Different tools support different regex patterns.
I'm not sure that regexes are really the way to go with spam filtering; much of the spam I receive is garbled in ways that would defeat most regex filtering schemes. For example, a spam I recently received contained: Pre scriptionz. It would be hard to write a regex that detected all such permutations of just this 'word'. There could be white space or misspellings or inserted/deleted characters anywhere.
If you use the regex approach, you'll probably have to get fairly good at writing them. Really, it isn't that hard, what part are you having problems with?
posted by DarkForest at 4:36 AM on May 12, 2008
I'm not sure that regexes are really the way to go with spam filtering; much of the spam I receive is garbled in ways that would defeat most regex filtering schemes. For example, a spam I recently received contained: Pre scriptionz. It would be hard to write a regex that detected all such permutations of just this 'word'. There could be white space or misspellings or inserted/deleted characters anywhere.
If you use the regex approach, you'll probably have to get fairly good at writing them. Really, it isn't that hard, what part are you having problems with?
posted by DarkForest at 4:36 AM on May 12, 2008
First google on perl regular expressions; that search brings up several other example-driven ones.. You'll have more luck searching for regex examples in perl, which seems to be where people cut their teeth on them.
If you just want to match literal stings, replace the spaces with \s's and it should be good to go. The tool you are using is a BAD IDEA since it is creating a very general expression which will match many things that your not interested in filtering.
Earn a degree --> (\w{4}\s+\w\s+\w+)
Is correct (four word characters, at least one space, a word character, at least one space, at least one word character) but will match TONS of stuff, such as "form a sentence" (4, space, 1, space, at least 1). It also won't match "earn your degree". Spam blocking is hard, and the people who generate spam contents are generally conscious of not matching obvious regexs.
posted by a robot made out of meat at 4:43 AM on May 12, 2008
If you just want to match literal stings, replace the spaces with \s's and it should be good to go. The tool you are using is a BAD IDEA since it is creating a very general expression which will match many things that your not interested in filtering.
Earn a degree --> (\w{4}\s+\w\s+\w+)
Is correct (four word characters, at least one space, a word character, at least one space, at least one word character) but will match TONS of stuff, such as "form a sentence" (4, space, 1, space, at least 1). It also won't match "earn your degree". Spam blocking is hard, and the people who generate spam contents are generally conscious of not matching obvious regexs.
posted by a robot made out of meat at 4:43 AM on May 12, 2008
For example, s/your\snot\sinter/you are not inter/
posted by a robot made out of meat at 4:45 AM on May 12, 2008
posted by a robot made out of meat at 4:45 AM on May 12, 2008
Regular expressions are really only as complicated as the patterns they're trying to match. So just start with something like this quick start page and use it to move from matching exact phrases to also matching simple variations (added words, optional words, optional plurals, alternate spellings, etc.).
Don't try to leap in and construct incredibly powerful expressions, just consider how the string you're matching against may vary and try to pick out some simple, logical ways to express that variation with basic regex features.
posted by malevolent at 4:47 AM on May 12, 2008
Don't try to leap in and construct incredibly powerful expressions, just consider how the string you're matching against may vary and try to pick out some simple, logical ways to express that variation with basic regex features.
posted by malevolent at 4:47 AM on May 12, 2008
When I constructing truly complex expressions I typically build them incrementally using emacs interactive regular expression search to verify them. You might want to try some variation on this. The cool thing is that emacs will highlight the text that it's matching so you can see what you doing right and what you're doing wrong.
I don't understand what you expect this tool to do. Even if you did manage to find such a tool YOU SHOULD NOT throw away e-mail based on subject lines matching an expression that you don't understand.
Given a large enough corpus of spammy and non-spammy subject lines I can see a tool that tries to find a pattern in spammy subject lines and identify new spammy subject lines but that doesn't have much to do with regular expressions. I would look at spam assasin , it's rules, and it's publically available spam corpus if I were interested in doing something this.
posted by rdr at 4:47 AM on May 12, 2008
Much of business programming is not actually a matter of complicated computer science, but really in specifying the problem.
Most programming, but not regular expressions. Regexes are extremely compact representations of discrete finite state machines. If you don't understand the math behind that, it can be hard to use them. Also, there are a lot of things that DFA's cannot do which you might think they could.
And they can be pretty difficult to do. You can accomplish something that might take 100 lines of regular code, but your resulting regex might take as long to write as that 100 lines of code, and be even harder to figure out.
But there are a few simple things to remember: If you want to find a particular string, just use that string. To find "Earn your degree" you just use "Earn your degree".
Also, Nthing those who think that this would be a hopeless way to fight spam. Spammers use huge lists of synonyms now and randomly alter some letters, regexes won't catch those.
posted by delmoi at 4:51 AM on May 12, 2008
Most programming, but not regular expressions. Regexes are extremely compact representations of discrete finite state machines. If you don't understand the math behind that, it can be hard to use them. Also, there are a lot of things that DFA's cannot do which you might think they could.
And they can be pretty difficult to do. You can accomplish something that might take 100 lines of regular code, but your resulting regex might take as long to write as that 100 lines of code, and be even harder to figure out.
But there are a few simple things to remember: If you want to find a particular string, just use that string. To find "Earn your degree" you just use "Earn your degree".
Also, Nthing those who think that this would be a hopeless way to fight spam. Spammers use huge lists of synonyms now and randomly alter some letters, regexes won't catch those.
posted by delmoi at 4:51 AM on May 12, 2008
Response by poster: DarkForest - "What tool are you using to check the email for your regex patterns?"
Honestly (and I know this isnt helpful)... I really dont know what the backend tool is. I believe our companies filter is based on Spamassasin, but all I have to submit spam tests is a front end web-management interface that basically allows me to submit a new spam test, include the regex string, and modify a few basic settings. Beyond that, i really dont know the backend unix guts of it.
Malevolent - "Don't try to leap in and construct incredibly powerful expressions, just consider how the string you're matching against may vary and try to pick out some simple, logical ways to express that variation with basic regex features."
I think thats one of my frustrations. I'm frustrated because it feels like a huge leap from "simple expressions" (for my purposes, pretty useless in blocking spam)... to "incredibly powerful expressions". Its the same frustration I feel trying to learn Perl or Python or Java,etc. I get through the introduction and the basic concepts, but then it gets incredibly hard, incredibly fast (or so it seems to me). I feel like I cant do anything useful because I cant grasp (or learn) anything beyond the kindergarten basics. :(
.
posted by jmnugent at 4:54 AM on May 12, 2008
Honestly (and I know this isnt helpful)... I really dont know what the backend tool is. I believe our companies filter is based on Spamassasin, but all I have to submit spam tests is a front end web-management interface that basically allows me to submit a new spam test, include the regex string, and modify a few basic settings. Beyond that, i really dont know the backend unix guts of it.
Malevolent - "Don't try to leap in and construct incredibly powerful expressions, just consider how the string you're matching against may vary and try to pick out some simple, logical ways to express that variation with basic regex features."
I think thats one of my frustrations. I'm frustrated because it feels like a huge leap from "simple expressions" (for my purposes, pretty useless in blocking spam)... to "incredibly powerful expressions". Its the same frustration I feel trying to learn Perl or Python or Java,etc. I get through the introduction and the basic concepts, but then it gets incredibly hard, incredibly fast (or so it seems to me). I feel like I cant do anything useful because I cant grasp (or learn) anything beyond the kindergarten basics. :(
.
posted by jmnugent at 4:54 AM on May 12, 2008
One thing about regular expressions which might be useful for learning how they work:
Generally, if you put parentheses around something, you will later be able to tell exactly what part of the string matched the part of the expression in parentheses.
So, you could write your own learning tool. Let's say you've seen someone use the following regexp, and you're trying to understand what it does:
Of course, to do this, you'll first have to learn enough about regular expressions to know where parentheses are allowed to go, but that's not all that difficult, even without understanding what anything in an RE means. For example, don't use parents to split up a *, +, or ? from the thing that precedes it; don't split up anything inside square brackets; (generally) don't split up a backslash from the character that follows it.
posted by Flunkie at 5:09 AM on May 12, 2008
Generally, if you put parentheses around something, you will later be able to tell exactly what part of the string matched the part of the expression in parentheses.
So, you could write your own learning tool. Let's say you've seen someone use the following regexp, and you're trying to understand what it does:
a+[0-9]*\s+.*Then get some data that this regexp is being used for, and write your own little simulator program, sticking parentheses into the regexp (I'm making this syntax up):
exp = new Regexp ( "(a+)([0-9]*)(\s+)(.*)" );
testString = "Green eggs and ham aaa0293 super!";
matchInfo = exp.matchAgainst ( testString );
if ( matchInfo.matches() ) {
for ( i = 0 to matchInfo.matchPartCount(); i++ ) {
part = matchInfo.matchPart(i);
print ( "Part " + (i+1) + ", len " + part.length() + ": " + part );
}
} else {
print ( "Does not match" );
}
This program would print out:Part 1, len 3: aaa
Part 2, len 4: 0293
Part 3, len 5:
Part 4, len 6: super!
Then fiddle around with the input string a little at a time, seeing what how changes affect the data that's matched, or even whether it's matched at all.Of course, to do this, you'll first have to learn enough about regular expressions to know where parentheses are allowed to go, but that's not all that difficult, even without understanding what anything in an RE means. For example, don't use parents to split up a *, +, or ? from the thing that precedes it; don't split up anything inside square brackets; (generally) don't split up a backslash from the character that follows it.
posted by Flunkie at 5:09 AM on May 12, 2008
Ugh. Sorry about the lack of indentation in the code. I put it in, I swear, but I guess Metafilter has HTML nonbreaking space entities working in both "Live Preview" and "Preview", but not when you actually post.
posted by Flunkie at 5:12 AM on May 12, 2008
posted by Flunkie at 5:12 AM on May 12, 2008
And also of note regarding that: "Part 3" is labelled with "length 5" because I put five nonbreaking space characters in that part of the string. Which Metafilter converted to five normal spaces, which show up in a web browser as a single space, not five. Ugh.
posted by Flunkie at 5:15 AM on May 12, 2008
posted by Flunkie at 5:15 AM on May 12, 2008
Spam filtering is really hard, and I suspect that almost nobody does it by manually writing regular expressions any longer.
When you have a moment, you should take a look on the internet and find out what the state-of-the-art is for both Open Source and commercial (subscription) spam filtering, and whether using a more modern tool for this purpose will benefit your organization. As far as I know, most of these have a facility to "learn" from user-submitted messages automatically.
Personally I've had the best experience with spambayes (it relies entirely on "learning"; downside: designed for individual users, not site-wide installation); spamassassin (which has a lot of regular expression tests plus learning; works well for site-wide installation; downside: I get about 10x more spam through the filter than on the account with spambayes. On the commercial side, the service called "postini" was apparently good enough for Google to buy.
posted by jepler at 5:30 AM on May 12, 2008
When you have a moment, you should take a look on the internet and find out what the state-of-the-art is for both Open Source and commercial (subscription) spam filtering, and whether using a more modern tool for this purpose will benefit your organization. As far as I know, most of these have a facility to "learn" from user-submitted messages automatically.
Personally I've had the best experience with spambayes (it relies entirely on "learning"; downside: designed for individual users, not site-wide installation); spamassassin (which has a lot of regular expression tests plus learning; works well for site-wide installation; downside: I get about 10x more spam through the filter than on the account with spambayes. On the commercial side, the service called "postini" was apparently good enough for Google to buy.
posted by jepler at 5:30 AM on May 12, 2008
I think thats one of my frustrations. I'm frustrated because it feels like a huge leap from "simple expressions" (for my purposes, pretty useless in blocking spam)... to "incredibly powerful expressions".
Try it in small steps, then.
Take "earn a degree," for example. The simplest variations on that might be "earn your degree" or "earn this degree" and so on. So a good first step would be really understanding how to write a regex that matches "earn ___ degree," where ___ is exactly one word, no more and no less.
The next step would be matching strings like "earn your new degree" or "earn an amazing new degree". In other words, strings that start with earn, end with degree, and have one or more words in between.
Then zero or more words in between, just so you hit those rare cases where spam's not grammatically correct.
Then strings like "earn a degree for cheap", where you also match for zero or more words at the end.
Then make the match case insensitive.
Then make sure you're matching punctuation too.
Then handle strings with "degrees" as well as "degree".
Then strings that start with either "earn" or "get".
Then strings with words that have, say, five consonants in a row, for "Qxzvz earn your degree zxmcnvx".
These are simple and not enough to fish out all your spam, but they're absolutely a solid start. My suggestion is to go through simple steps like that one by one, giving yourself plenty of time to get the hang of each new trick you learn. Use the time to try to filter out as much spam as you can using the limited expressions you know, paying attention to false positives so that you learn not to over-match, and also so that you get a feel for the sorts of patterns that occur.
When you find that there's a pattern you just don't know how to express, figure out in words what exactly you need to know how to match for it and then make that your next step.
I don't think simple expressions are useless.
posted by trig at 5:33 AM on May 12, 2008
Try it in small steps, then.
Take "earn a degree," for example. The simplest variations on that might be "earn your degree" or "earn this degree" and so on. So a good first step would be really understanding how to write a regex that matches "earn ___ degree," where ___ is exactly one word, no more and no less.
The next step would be matching strings like "earn your new degree" or "earn an amazing new degree". In other words, strings that start with earn, end with degree, and have one or more words in between.
Then zero or more words in between, just so you hit those rare cases where spam's not grammatically correct.
Then strings like "earn a degree for cheap", where you also match for zero or more words at the end.
Then make the match case insensitive.
Then make sure you're matching punctuation too.
Then handle strings with "degrees" as well as "degree".
Then strings that start with either "earn" or "get".
Then strings with words that have, say, five consonants in a row, for "Qxzvz earn your degree zxmcnvx".
These are simple and not enough to fish out all your spam, but they're absolutely a solid start. My suggestion is to go through simple steps like that one by one, giving yourself plenty of time to get the hang of each new trick you learn. Use the time to try to filter out as much spam as you can using the limited expressions you know, paying attention to false positives so that you learn not to over-match, and also so that you get a feel for the sorts of patterns that occur.
When you find that there's a pattern you just don't know how to express, figure out in words what exactly you need to know how to match for it and then make that your next step.
I don't think simple expressions are useless.
posted by trig at 5:33 AM on May 12, 2008
I've heard of folks who loved working with Regex Buddy- costs ya 40 bucks, windows only.
Nthing the idea that regex-based spam hunting is the path to perdition- when you get something that is general enough to whack most of the spam, you find the false positive rate is unacceptably high.
posted by jenkinsEar at 5:34 AM on May 12, 2008
Nthing the idea that regex-based spam hunting is the path to perdition- when you get something that is general enough to whack most of the spam, you find the false positive rate is unacceptably high.
posted by jenkinsEar at 5:34 AM on May 12, 2008
I really should read more of the "more inside" before I post. Sorry. I missed the whole "not a programmer" part.
Now that I'm caught up, I'm going to nth the chorus of "this is not going to be particularly useful for fighting spam", even if you become a transcendental reg exp master.
Get yourself a good, premade, Bayesian spam filter. It will do the job a whole lot better than what you're proposing, and cost a whole hell of a lot less effort on your part.
posted by Flunkie at 5:43 AM on May 12, 2008
Now that I'm caught up, I'm going to nth the chorus of "this is not going to be particularly useful for fighting spam", even if you become a transcendental reg exp master.
Get yourself a good, premade, Bayesian spam filter. It will do the job a whole lot better than what you're proposing, and cost a whole hell of a lot less effort on your part.
posted by Flunkie at 5:43 AM on May 12, 2008
The key to understanding complicated regexes is to understand that every complicated regex is just a bunch of simpler ones strung together in order.
When I'm writing code that uses regexes, I always make them in pieces, using variables to hold the fiddly little intermediate parts. Regexes are much easier to write than they are to read, and using named parts in this way stops me going insane when debugging or modifying them. For example, here's the regex-definition part of a bash script I wrote recently:
Without the intermediate variables, you'd have to try to figure that out from the raw regex itself:
Using a language that allows inline expansion of variables, like bash or perl, makes this easier, since there's very little extra syntax to muddy the regexes themselves. It's still do-able in more conventional languages but messier. As it happens, I ended up needing to translate that bash script to Visual Basic Script, so I can show you the equivalent of the above section:
That said, I agree with everybody else who has said that purely regex-based spam filtering is a losing battle. If you're currently relying on some kind of hacked-together in-house spam filter, stop doing that. It's a wheel you don't need to reinvent. I've currently got 9609 mails in my Gmail spam folder, only 78 of which I've flagged by hand, and in the four months since I started this MeTa thread I've only experienced two false positives. And I didn't have to write a single regex to make it happen.
posted by flabdablet at 5:59 AM on May 12, 2008
When I'm writing code that uses regexes, I always make them in pieces, using variables to hold the fiddly little intermediate parts. Regexes are much easier to write than they are to read, and using named parts in this way stops me going insane when debugging or modifying them. For example, here's the regex-definition part of a bash script I wrote recently:
# Define regexes to match main quiz result line and folded title/author lines re_word='[^ ]+' #run of nonspaces re_words="$re_word( $re_word)*" #words separated by at most one space re_num='[0-9.,]+' #run of digits, points or commas re_voice="^ +(#?)" #leading spaces, optional hash re_quiz_no="($re_num) +" #number, trailing spaces re_lang="(EN|SP) +" #at least two trailing spaces re_title="($re_words) +" #words with two or more trailing spaces re_author="($re_words) +" #words with trailing spaces re_il="(LG|MG|UG) +" re_bl="($re_num) +" #number with at least one trailing space re_points="($re_num) +" #same again re_wcnt="($re_num) +" #same again re_fnf="($re_word)$" #word at end of line re_quiz="$re_voice$re_quiz_no$re_lang$re_title$re_author$re_il$re_bl$re_points$re_wcnt$re_fnf" re_title2="($re_words) " #alternate pattern: only one trailing space re_author2="($re_word,( $re_word)*) +" #alternate pattern: word, comma, maybe more names, spaces re_quiz2="$re_voice$re_quiz_no$re_lang$re_title2$re_author2$re_il$re_bl$re_points$re_wcnt$re_fnf" re_ext_title="^ {17,19}($re_words)$" re_ext_author="^ {62,71}($re_words)$" re_ext_title_author="^ {17,19}($re_words) +($re_words)$"re_quiz, re_quiz2 and re_ext_title_author are the only ones that get used later on. All the others are intermediates done purely for my own benefit as a human reader. As a non-coder, once you know that bash will replace an instance of $some_variable with the contents of some_variable, you won't be terribly surprised to hear that re_quiz matches any line consisting of a voice flag, a quiz number, a language code, a title, an author, an IL (whatever that is), a BL (likewise), a number of points, a word count and a fiction/nonfiction flag in that order.
Without the intermediate variables, you'd have to try to figure that out from the raw regex itself:
^ +(#?)([0-9.,]+) +(EN|SP) +([^ ]+( [^ ]+)*) +([^ ]+( [^ ]+)*) +(LG|MG|UG) +([0-9.,]+) +([0-9.,]+) +([0-9.,]+) +([^ ]+)$which is really no fun at all. I am a coder, and I wouldn't like having to do that for a living.
Using a language that allows inline expansion of variables, like bash or perl, makes this easier, since there's very little extra syntax to muddy the regexes themselves. It's still do-able in more conventional languages but messier. As it happens, I ended up needing to translate that bash script to Visual Basic Script, so I can show you the equivalent of the above section:
' Define regex patterns to match main quiz result line and folded title/author lines repWord = "[^ ]+" ' run of nonspaces repWords = repWord & "( " & repWord & ")*" ' words separated by at most one space repNum = "[0-9.,]+" ' run of digits, points or commas repVoice = "^ +(#?)" ' leading spaces, optional hash repQuizNum = "(" & repNum & ") +" ' number, trailing spaces repLang = "(EN|SP) +" ' at least two trailing spaces repTitle = "(" & repWords & ") +" ' words with two or more trailing spaces repAuthor = "(" & repWords & ") +" ' words with trailing spaces repIL = "(LG|MG|UG) +" repBL = "(" & repNum & ") +" ' number with at least one trailing space repPoints = "(" & repNum & ") +" ' same again repWdCnt = "(" & repNum & ") +" ' same again repFnf = "(" & repWord & ")$" ' word at end of line set reQuiz = new regexp reQuiz.pattern = repVoice & repQuizNum & repLang & repTitle & repAuthor & _ repIL & repBL & repPoints & repWdCnt & repFnf repTitle2 = "(" & repWords & ") " ' alternate pattern: only one trailing space repAuthor2 = "(" & repWord & ",( " & repWord & ")*) +" ' alternate pattern: word, comma, maybe more names, spaces set reQuiz2 = new regexp reQuiz2.pattern = repVoice & repQuizNum & repLang & repTitle2 & repAuthor2 & _ repIL & repBL & repPoints & repWdCnt & repFnf set reExtTitle = new regexp reExtTitle.pattern = "^ {17,19}(" & repWords & ")$" set reExtAuthor = new regexp reExtAuthor.pattern = "^ {62,71}(" & repWords & ")$" set reExtTitleAuth = new regexp reExtTitleAuth.pattern = "^ {17,19}(" & repWords & ") +(" & repWords & ")$"As well as making things more readable, this by-pieces technique makes regex debugging a lot easier. I can build my regexes up from tested pieces, so I know that as soon as I stop getting matches I'm expecting to get, the last piece I added is probably where the trouble is - and because that piece generally has its own name, it's easy to find and fix the fault.
That said, I agree with everybody else who has said that purely regex-based spam filtering is a losing battle. If you're currently relying on some kind of hacked-together in-house spam filter, stop doing that. It's a wheel you don't need to reinvent. I've currently got 9609 mails in my Gmail spam folder, only 78 of which I've flagged by hand, and in the four months since I started this MeTa thread I've only experienced two false positives. And I didn't have to write a single regex to make it happen.
posted by flabdablet at 5:59 AM on May 12, 2008
Regexes are extremely compact representations of discrete finite state machines. If you don't understand the math behind that, it can be hard to use them. Also, there are a lot of things that DFA's cannot do which you might think they could.
I entirely disagree. Personally, I was able to write rather complex regexes long before knowing how finite state machines worked. Although, as you point out, knowing how they work can be instructive in understanding why there are certain things you cannot do with regular expressions, or cannot do easily.
posted by grouse at 6:02 AM on May 12, 2008
I entirely disagree. Personally, I was able to write rather complex regexes long before knowing how finite state machines worked. Although, as you point out, knowing how they work can be instructive in understanding why there are certain things you cannot do with regular expressions, or cannot do easily.
posted by grouse at 6:02 AM on May 12, 2008
Response by poster: jepler - "Spam filtering is really hard, and I suspect that almost nobody does it by manually writing regular expressions any longer."
jenkinsEar - "Nthing the idea that regex-based spam hunting is the path to perdition..."
Flunkie - "Get yourself a good, premade, Bayesian spam filter."
I completely agree, and would love to follow the above advice, however I'm somewhat of a "noob" at this company, and I seriously doubt anyone higher/engineering is going to listen to my advice. Its also quite possible that we already have some sort of bayesian (or "learning") built into our spam filter, and I just am not aware of it. (But if so, why are we wasting our time updating regex filter expressions ???)
The filtering solution we use (and provide to clients) was something developed in-house, so I guess I'm going to have to poke around and start asking questions about why we do things the way we do. Thanks for all the great advice so far everyone! :)
posted by jmnugent at 6:04 AM on May 12, 2008
jenkinsEar - "Nthing the idea that regex-based spam hunting is the path to perdition..."
Flunkie - "Get yourself a good, premade, Bayesian spam filter."
I completely agree, and would love to follow the above advice, however I'm somewhat of a "noob" at this company, and I seriously doubt anyone higher/engineering is going to listen to my advice. Its also quite possible that we already have some sort of bayesian (or "learning") built into our spam filter, and I just am not aware of it. (But if so, why are we wasting our time updating regex filter expressions ???)
The filtering solution we use (and provide to clients) was something developed in-house, so I guess I'm going to have to poke around and start asking questions about why we do things the way we do. Thanks for all the great advice so far everyone! :)
posted by jmnugent at 6:04 AM on May 12, 2008
I don't know what (or even if) you're using a tool to check the results of or regexes, but the online one I use (and swear by) is this one.
All that being said, I agree completely with the above posts advising against using regex for spam filters. If you don't understand the expression thoroughly, it's way too easy to end up blocking regular mail as a side effect. If you were my sys-admin blocking my email due to a too-generic regex, there'd be hell to pay :)
posted by cgg at 7:20 AM on May 12, 2008
All that being said, I agree completely with the above posts advising against using regex for spam filters. If you don't understand the expression thoroughly, it's way too easy to end up blocking regular mail as a side effect. If you were my sys-admin blocking my email due to a too-generic regex, there'd be hell to pay :)
posted by cgg at 7:20 AM on May 12, 2008
RegExes are "idiot-proof" in the same way that prescription bottles are "child-proof", and there's not really much to do about it except break down and learn them.
posted by toomuchpete at 7:30 AM on May 12, 2008
posted by toomuchpete at 7:30 AM on May 12, 2008
Your time would be better spent convincing your boss to use a real spam filter... I don't know what the filter du jour is these days, but the last time I installed one I used SpamBayes and it did a great job. These filters use stronger magic than regular expressions.
At the risk of sounding like a jerk: You really shouldn't be adding regular expressions to your spam filter if you don't know how they work. The business of filtering spam without trashing ham using only regular expressions is error-prone and difficult for experts, let alone novices. It is not the best way to block spam.
As others have pointed out, (\w{4}\s+\w\s+\w+) is going to match a lot of things you don't want it to in addition to "Earn a degree", and you'll make a mess of incoming email if you add this to the filter.
posted by qxntpqbbbqxl at 9:07 AM on May 12, 2008
At the risk of sounding like a jerk: You really shouldn't be adding regular expressions to your spam filter if you don't know how they work. The business of filtering spam without trashing ham using only regular expressions is error-prone and difficult for experts, let alone novices. It is not the best way to block spam.
As others have pointed out, (\w{4}\s+\w\s+\w+) is going to match a lot of things you don't want it to in addition to "Earn a degree", and you'll make a mess of incoming email if you add this to the filter.
posted by qxntpqbbbqxl at 9:07 AM on May 12, 2008
Using a live regex editor will really help you understand what you're doing. I use Rubular regularly, and it saves my sanity. You paste the string you want to match in at the bottom, then construct your regex bit by bit until it matches correctly. Lastly, you perturb your test string a little to find those edge cases and false positives.
Really, regexes are a very simple concept, it's just that the syntax gets ugly after a while. Once you learn them well, you'll be amazed at how handy they are.
Also see
posted by chrisamiller at 9:45 AM on May 12, 2008
Really, regexes are a very simple concept, it's just that the syntax gets ugly after a while. Once you learn them well, you'll be amazed at how handy they are.
Also see
posted by chrisamiller at 9:45 AM on May 12, 2008
Free regex builder software: Rad Software Regular Expression Designer.
Anyone know of a REVERSE regex builder? Put in the regex and it gives you possible matches?
posted by elle.jeezy at 10:54 AM on May 12, 2008
Anyone know of a REVERSE regex builder? Put in the regex and it gives you possible matches?
posted by elle.jeezy at 10:54 AM on May 12, 2008
Anyone know of a REVERSE regex builder? Put in the regex and it gives you possible matches?
I don't know of any programs to do this but this is an example of implementation. An algorithm to accomplish this, (actually from the automata to the set of strings), is described in Introduction to Automata Theory, Languages, and Computation. Don't let the long title scare you. Most of the basic algorithms for manipulating finite automata and regular expressions are pretty simple.
posted by rdr at 11:39 AM on May 12, 2008
I don't know of any programs to do this but this is an example of implementation. An algorithm to accomplish this, (actually from the automata to the set of strings), is described in Introduction to Automata Theory, Languages, and Computation. Don't let the long title scare you. Most of the basic algorithms for manipulating finite automata and regular expressions are pretty simple.
posted by rdr at 11:39 AM on May 12, 2008
I'm sorry. The algorithm described in my link is from an automata to a regular expression. Going from a regular expression to it matches should be much simpler.
posted by rdr at 11:53 AM on May 12, 2008
posted by rdr at 11:53 AM on May 12, 2008
I use kregexpeditor (it is part of KDE) It lets you enter a string that you want to match and then lets you know if the regex you have entered actually matches.
On thing that it does is give a natural language description of what the regex will match so for \w{4}\s+\w\s+\w+ it gives something like:
A word character repeated exactly 4 times followed by a space character repeated at least one time followed by one word character followed by a space character repeated at least one time...
It puts these up as you type in your regex so it gives you quick feedback.
There is also regex coach . I used it in the past but don't remember much about it.
posted by bdc34 at 4:02 PM on May 12, 2008
On thing that it does is give a natural language description of what the regex will match so for \w{4}\s+\w\s+\w+ it gives something like:
A word character repeated exactly 4 times followed by a space character repeated at least one time followed by one word character followed by a space character repeated at least one time...
It puts these up as you type in your regex so it gives you quick feedback.
There is also regex coach . I used it in the past but don't remember much about it.
posted by bdc34 at 4:02 PM on May 12, 2008
I want to be a dissenting opinion on the Friedl book. Everyone quotes it as the definitive text, but I honestly don't think it's a good book for most people approaching this subject for the first time. On the other hand, it's not for people approaching the subject, is it? It's for people "Mastering" the subject.
Noted PerlMonk japhy has some text from his sadly unfinished book on regexes here. Might be useful.
Anyway, nthing the idea that Bayesian is the way to go, not regular expressions.
And if I had a tip, it would be this. There are two parts to constructing a regular expression: figuring out what you want to match, and writing the regex; it may be that the former is actually harder, and you know what? There's always someone around who can help you with the latter.
Like in this example:
>If all you want to match is "Earn a degree", then the expression "Earn a degree" will work just fine. But it won't match "Earn an amazing degree!!!"
The first task you need to set yourself is to spell out logically that you want to match "earn a degree" or "earn an online degree" or a bunch of other similar phrases.
You have to reduce that to something more formal -- 'I want to match "earn", followed by a space, the letter a, and sometimes an "n" and sometimes a space, and then there can be some other stuff, but the word degree will definitely be in the rest of the line somewhere'.
If you can get to that point, any geek worth his salt can construct the regex for you.
posted by AmbroseChapel at 6:45 PM on May 12, 2008
Noted PerlMonk japhy has some text from his sadly unfinished book on regexes here. Might be useful.
Anyway, nthing the idea that Bayesian is the way to go, not regular expressions.
And if I had a tip, it would be this. There are two parts to constructing a regular expression: figuring out what you want to match, and writing the regex; it may be that the former is actually harder, and you know what? There's always someone around who can help you with the latter.
Like in this example:
>If all you want to match is "Earn a degree", then the expression "Earn a degree" will work just fine. But it won't match "Earn an amazing degree!!!"
The first task you need to set yourself is to spell out logically that you want to match "earn a degree" or "earn an online degree" or a bunch of other similar phrases.
You have to reduce that to something more formal -- 'I want to match "earn", followed by a space, the letter a, and sometimes an "n" and sometimes a space, and then there can be some other stuff, but the word degree will definitely be in the rest of the line somewhere'.
If you can get to that point, any geek worth his salt can construct the regex for you.
posted by AmbroseChapel at 6:45 PM on May 12, 2008
This thread is closed to new comments.
posted by the dief at 4:08 AM on May 12, 2008 [1 favorite]