Little Regex Help?
March 10, 2010 9:15 AM   Subscribe

I need a little help with regular expressions.

I watn to extract a person's date of birth and date of death (bolded) from a bio like this:
Dodgson, Charles Lutwidge. Lorem ipsum dolor sit amet, born near Daresbury, Cheshire, England, January 27, 1832. Ut dictum interdum ante commodo viverra. Phasellus ultrices interdum feugiat. He left Rugby at the end of 1849. Proin ipsum quam, mattis sed blandit nec, pellentesque a mauris. Went on to Oxford in January 1851. Curabitur porta blandit lobortis. Sed vel tellus nibh, ut ullamcorper velit. Sed ac arcu ac nisi tincidunt auctor vel et nisi. Quisque et nulla et nisi tristique pellentesque nec non sapien. Nunc accumsan posuere est sit amet elementum. Mauris neque justo, until his death in Surrey, England, on January 14, 1898, viverra nec diam.
I wrote a regular expression (born.+[\d]{4}) that finds "born" and ends in a four-digit number, but it finds the last four-digit number in the entire string, and I need to find the next four-digit number after "born."

(I've already got a regular expression that will match the date once I get "born near Daresbury, Cheshire, England, January 27, 1832.")
posted by kirkaracha to Computers & Internet (11 answers total) 4 users marked this as a favorite
 
born.+?[\d]{4}
posted by sanko at 9:22 AM on March 10, 2010


Yeah, that's because regular expressions are greedy - they want to match the longest string possible.

According to this handy dandy guide,
Place a question mark after the quantifier to make it lazy. <>

posted by muddgirl at 9:23 AM on March 10, 2010


You don't mention the language you're using, but the ? operator usually turns off greedy matching: (born.+?[\d]{4}) should match the first year after "born".
posted by qxntpqbbbqxl at 9:23 AM on March 10, 2010


I always have a problem getting brackets to render correctly. Stupid preview!

Anyway, what sanko said - you need to add a question mark after the + sign to make the + sign "lazy" instead of "greedy".
posted by muddgirl at 9:24 AM on March 10, 2010


Great, thanks!
posted by kirkaracha at 9:43 AM on March 10, 2010


The Regex Coach lets you interactively build and debug regexps.
posted by willem at 10:07 AM on March 10, 2010


I know the question has been resolved, but there is also this neat text to regex tools for beginners, tired people, or experts with headaches ... http://www.txt2re.com/. It works pretty neat but the layout might cause a headache in and of itself ;)
posted by shownomercy at 10:45 AM on March 10, 2010


OK, now I'm stuck getting just the date part in PHP. This works fine:

$get_birth_date_string = preg_match('(born.+?[\d]{4})',$bio,$matches);
$birth_date_string = $matches[0];

Produces: "born near Daresbury, Cheshire, England, January 27, 1832"

I just want "January 27, 1832." I tried this:

$get_birth_date = preg_match('([a-zA-Z]{3,9}\ \d|\d\d\,\ \d\d\d\d)',$birth_date_string,$matches);
/*
three to nine numbers (May/September), a space, a one- or two-digit number, a comma, a space, and a four-digit number
*/
$birth_date = $matches[0];

Which gives me "January 27." I'm not sure why because I'm only using one set of parentheses and $matches[0] should match the entire date.
posted by kirkaracha at 2:54 PM on March 10, 2010


Which gives me "January 27." I'm not sure why because I'm only using one set of parentheses and $matches[0] should match the entire date.

Because you're misrepresenting what atoms the | is considering. Regrouping your regex (and removing the pointless backslashes)

(([a-zA-Z]{3,9} \d)|(\d\d\, \d\d\d\d))

That's what the engine sees. What you mean is this:

([a-zA-Z]{3,9} (\d|(\d\d\), \d\d\d\d)

Or I'd just write:

(\w{3,9} \d{1,2}, \d{4})
posted by sbutler at 4:13 PM on March 10, 2010


Sorry, I've got a stray '(' in my second pattern. I'll let you figure out where it is :)
posted by sbutler at 4:17 PM on March 10, 2010


Works great, thanks!
posted by kirkaracha at 7:13 PM on March 10, 2010


« Older What's the source of this quote?   |   Identify this songbird Newer »
This thread is closed to new comments.