mail usprint this pagerss feed

www.liip.ch

Liip is hiring!

The greedyness of non-greedy regular expressions

We all love regular expressions, don't we? Well I usually do, but recently I lost quite a lot of time to find out this bit of particular behavior, so i thought i might share this.

Greedyness is basically the question whether an expression matches as much as it can, or as little as necessary. You find an excellent explanation and examples here. The default is to match as much as possible. The syntax *? , +? and ?? makes the quantifiers non-greedy. However, even the non-greedy expressions may match more than you expect.

Consider this String:

<a href="/link">Label</a> <a href="/en">[en]</a> <a href="/fr">[fr]</a>

And the regular expression

<a href=".*?">\[[a-z]*\]</a>

The ".*?" after the href=" tells the regexp engine to match non-greedy. If we omit the non-greedy order, the expression would generate one single match of the whole string. The non-greedy version generates two matches. But the first match is not the absolute minimum match on the input string:

<a href="/link">Label</a> <a href="/en">[en]</a>

The non-greedyness seems to result in the expression generating the maximum number of matches, but not necessarily the smallest possible match strings. A fixed regular expression to only match the language links would be this expression, which explicitely excludes the closing quotation mark from the match.

<a href="[^"]*">\[[a-z]*\]</a>

I played around with this nice web tool to test regular expressions while writing this post.

Comments (8) |  Permalink

Comments

michael @ 24.07.2009 11:30 CEST
For testing regexps is the regex coach simply the best tool, I think:
http://weitz.de/regex-coach/
Toby @ 24.07.2009 12:24 CEST
I love regex, too, mut there are better tools for the shown task: Use DOM to extract links, not PCRE.
david @ 24.07.2009 13:16 CEST
thanks toby, a good point to add. in general, parsing html in regexp is not fun and tends to break because of the nested structure of html. i was fixing existing code where the original developper did not use dom and we expect the html to be invalid in all sorts.

as the regexp behaviour i described is not only true for parsing html but for everything you do with regexp, i still hope this post helps some people.
Lapis @ 24.07.2009 16:51 CEST
Do you know FluentDom (fluentdom.org)?
If you are planning a refactoring of your code, give it a try! It is really easy to use.
It is a jQuery like fluent interface using xpath expressions.
david @ 24.07.2009 17:11 CEST
i love jquery, and i read about fluentdom.org, but had not yet time to give it a try. sounds very promising.
Steve Clay @ 25.07.2009 23:39 CEST
Good to know. I assumed non-greedy implied minimal length, too.

I love rework for pattern work, even if the PHP code output isn't always accurate.
Olly @ 30.07.2009 02:17 CEST
I used to use this regex, which should match basically any case that one can encounter in a website:
"/(]*hrefs*=s*(['"]))s*(.+(?=\2))/isU"

It works with single and double quotes, random spaces, attributes between the '
Michael @ 30.06.2010 18:09 CEST
Thank you. I've been working on this for a while, but I couldn't "get it" until I read your post. Problem solved.

add a comment

Your email adress will never be published.
Comment spam will be deleted!

For Spammers Only
Name*
E-Mail
URL
Comment*
Notify me via E-Mail when new comments are made to this entry
Remember me (needs cookies)