^FREE.*(books|degree|sex)a+[0-9]*\s+.*Then get some data that this regexp is being used for, and write your own little simulator program, sticking parentheses into the regexp (I'm making this syntax up):
exp = new Regexp ( "(a+)([0-9]*)(\s+)(.*)" );
testString = "Green eggs and ham aaa0293 super!";
matchInfo = exp.matchAgainst ( testString );
if ( matchInfo.matches() ) {
for ( i = 0 to matchInfo.matchPartCount(); i++ ) {
part = matchInfo.matchPart(i);
print ( "Part " + (i+1) + ", len " + part.length() + ": " + part );
}
} else {
print ( "Does not match" );
}This program would print out:Part 1, len 3: aaa
Part 2, len 4: 0293
Part 3, len 5:
Part 4, len 6: super!Then fiddle around with the input string a little at a time, seeing what how changes affect the data that's matched, or even whether it's matched at all.# Define regexes to match main quiz result line and folded title/author linesre_word='[^ ]+' #run of nonspacesre_words="$re_word( $re_word)*" #words separated by at most one spacere_num='[0-9.,]+' #run of digits, points or commasre_voice="^ +(#?)" #leading spaces, optional hashre_quiz_no="($re_num) +" #number, trailing spacesre_lang="(EN|SP) +" #at least two trailing spacesre_title="($re_words) +" #words with two or more trailing spacesre_author="($re_words) +" #words with trailing spacesre_il="(LG|MG|UG) +"re_bl="($re_num) +" #number with at least one trailing spacere_points="($re_num) +" #same againre_wcnt="($re_num) +" #same againre_fnf="($re_word)$" #word at end of linere_quiz="$re_voice$re_quiz_no$re_lang$re_title$re_author$re_il$re_bl$re_points$re_wcnt$re_fnf"re_title2="($re_words) " #alternate pattern: only one trailing spacere_author2="($re_word,( $re_word)*) +" #alternate pattern: word, comma, maybe more names, spacesre_quiz2="$re_voice$re_quiz_no$re_lang$re_title2$re_author2$re_il$re_bl$re_points$re_wcnt$re_fnf"re_ext_title="^ {17,19}($re_words)$"re_ext_author="^ {62,71}($re_words)$"re_ext_title_author="^ {17,19}($re_words) +($re_words)$"
re_quiz, re_quiz2 and re_ext_title_author are the only ones that get used later on. All the others are intermediates done purely for my own benefit as a human reader. As a non-coder, once you know that bash will replace an instance of $some_variable with the contents of some_variable, you won't be terribly surprised to hear that re_quiz matches any line consisting of a voice flag, a quiz number, a language code, a title, an author, an IL (whatever that is), a BL (likewise), a number of points, a word count and a fiction/nonfiction flag in that order.^ +(#?)([0-9.,]+) +(EN|SP) +([^ ]+( [^ ]+)*) +([^ ]+( [^ ]+)*) +(LG|MG|UG) +([0-9.,]+) +([0-9.,]+) +([0-9.,]+) +([^ ]+)$which is really no fun at all. I am a coder, and I wouldn't like having to do that for a living.
' Define regex patterns to match main quiz result line and folded title/author linesrepWord = "[^ ]+" ' run of nonspacesrepWords = repWord & "( " & repWord & ")*" ' words separated by at most one spacerepNum = "[0-9.,]+" ' run of digits, points or commasrepVoice = "^ +(#?)" ' leading spaces, optional hashrepQuizNum = "(" & repNum & ") +" ' number, trailing spacesrepLang = "(EN|SP) +" ' at least two trailing spacesrepTitle = "(" & repWords & ") +" ' words with two or more trailing spacesrepAuthor = "(" & repWords & ") +" ' words with trailing spacesrepIL = "(LG|MG|UG) +"repBL = "(" & repNum & ") +" ' number with at least one trailing spacerepPoints = "(" & repNum & ") +" ' same againrepWdCnt = "(" & repNum & ") +" ' same againrepFnf = "(" & repWord & ")$" ' word at end of lineset reQuiz = new regexpreQuiz.pattern = repVoice & repQuizNum & repLang & repTitle & repAuthor & _ repIL & repBL & repPoints & repWdCnt & repFnfrepTitle2 = "(" & repWords & ") " ' alternate pattern: only one trailing spacerepAuthor2 = "(" & repWord & ",( " & repWord & ")*) +" ' alternate pattern: word, comma, maybe more names, spacesset reQuiz2 = new regexpreQuiz2.pattern = repVoice & repQuizNum & repLang & repTitle2 & repAuthor2 & _ repIL & repBL & repPoints & repWdCnt & repFnfset reExtTitle = new regexpreExtTitle.pattern = "^ {17,19}(" & repWords & ")$"set reExtAuthor = new regexpreExtAuthor.pattern = "^ {62,71}(" & repWords & ")$"set reExtTitleAuth = new regexpreExtTitleAuth.pattern = "^ {17,19}(" & repWords & ") +(" & repWords & ")$"
As well as making things more readable, this by-pieces technique makes regex debugging a lot easier. I can build my regexes up from tested pieces, so I know that as soon as I stop getting matches I'm expecting to get, the last piece I added is probably where the trouble is - and because that piece generally has its own name, it's easy to find and fix the fault.
posted by the dief at 4:08 AM on May 12, 2008 [1 favorite has favorites]