The greedyness of non-greedy regular expressions

We all love regular expressions, don’t we? Well I usually do, but recently I lost quite a lot of time to find out this bit of particular behavior, so i thought i might share this.

Greedyness is basically the question whether an expression matches as much as it can, or as little as necessary. You find an excellent explanation and examples here. The default is to match as much as possible. The syntax *? , +? and ?? makes the quantifiers non-greedy. However, even the non-greedy expressions may match more than you expect.

Consider this String:

And the regular expression

The “.*?” after the href=” tells the regexp engine to match non-greedy. If we omit the non-greedy order, the expression would generate one single match of the whole string. The non-greedy version generates two matches. But the first match is not the absolute minimum match on the input string:

The non-greedyness seems to result in the expression generating the maximum number of matches, but not necessarily the smallest possible match strings. A fixed regular expression to only match the language links would be this expression, which explicitely excludes the closing quotation mark from the match.

I played around with this nice web tool to test regular expressions while writing this post.

For testing regexps is the regex coach simply the best tool, I think:
http://weitz.de/regex-coach/

I love regex, too, mut there are better tools for the shown task: Use DOM to extract links, not PCRE.

thanks toby, a good point to add. in general, parsing html in regexp is not fun and tends to break because of the nested structure of html. i was fixing existing code where the original developper did not use dom and we expect the html to be invalid in all sorts.

as the regexp behaviour i described is not only true for parsing html but for everything you do with regexp, i still hope this post helps some people.

Do you know FluentDom (fluentdom.org)?
If you are planning a refactoring of your code, give it a try! It is really easy to use.
It is a jQuery like fluent interface using xpath expressions.

i love jquery, and i read about fluentdom.org, but had not yet time to give it a try. sounds very promising.

Good to know. I assumed non-greedy implied minimal length, too.

I love rework for pattern work, even if the PHP code output isn’t always accurate.

I used to use this regex, which should match basically any case that one can encounter in a website:
“/(]*hrefs*=s*([‘”]))s*(.+(?=2))/isU”

It works with single and double quotes, random spaces, attributes between the ‘

Thank you. I’ve been working on this for a while, but I couldn’t “get it” until I read your post. Problem solved.

Thank you. You can try using live regex tester. It’s good for real-time testing.