Parsing non-well-formed XML documents considered harmful …

… to no surprise, actually ;) My patch from the other day to allow parsing of non-well-formed XML documents in PHP 5.1 seemed to cause some disagreement, if that should be done at all (see the comments). Ian Eure and David Hay think, it
should be removed completely, for Tim Bray it’s shooting yourself in the foot (at least if you use this in business critical applications) and Daniel Veillard insists on fixing the other side instead of using this feature.

I agree with Tim and Daniel completely. Don’t use this feature on a regular basis or even by default. Fix the other side if you can or simply just reject it. But if you use it, use it carefully and only when you really know what you are doing. And validate the XML after it was “recovered”.

To Ian and David: It’s not enabled by default and PHP still throws a bunch of warnings for each error. I think, it should be the desicion of the PHP application developer, if he wants to use this feature and not the desicion by PHP, if such a tool should be provided at all. Nothing is forced upon anyone and the default behaviour is still rejecting
such documents. And you should discuss this with Daniel and ask him to remove that feature from libxml2 in the first place ;)

By the way, I enabled this feature, because I had a little application, which parsed RSS feeds with constantly non well formed HTML code in it (not properly escaped et al.). I couldn’t fix the other side, so I had to deal with HTML tagsoup embedded in XML and this looked like an easy way to go.

Update: Nice discussion as well at Phil Ringnalda’s Blog and a nice quote: “If there’s one thing that the RSS Draconian Wars taught us, it’s that you don’t want to be involved in any discussion of XML and error handling.” I should have maybe known that before ;)

To be honest, I support draconian parsing. I dislike the idea that we would condone mangled or ill-formed XML — its like saying we support badly written HTML or bad grammar in English. Sure, it might make life a little harder, but things are meant to be done right for good reasons.

Couldn’t the Tidy extension be used if this problem is encountered.

I still *strongly* disagree with this functionality being present *at all*. This is an XML parser, not a tag-soup parser. The solution is to complain to whoever is providing and/or generating the invalid RSS and get them to fix their broken software. Or find an alternate feed that isn’t broken.

But, if you’re committed to leaving it in, please at least make it painfully obvious that what the developer is asking for is *wrong*. A big fat E_WARN message when this functionality is invoked would be a good start.

Ian Eure aren’t you a little fanatical ?

Anything that’s in touch with humans is chaotic, would you expect xml to override this ? ;)

Yep, when it comes to standards, I am.

Would you stand for it if you bought a 1/8″ bolt at a hardware store, then found that it wasn’t 1/8″ when you tried to use it? How is this any different? Something is being advertised as XML when it is not.

It’s not like it’s hard to fix broken XML. What do you think is harder, fixing your software to nest and close tags properly, or hacking a parser to try and cope with this kind of braindamage?

If you can’t get something as basic as nesting and closing tags right, you have no business working with XML in the first place. The requirement for XML documents to be well-formed is a deliberate requirement, intended to force people to adhere to the standard. And as barriers to entry go, it’s pretty low. It’s really saying something if people
can’t even get this right.

As I said, this way lies madness. If parsers start accepting malformed XML, there is no reason for authors to write valid XML documents, and we end up in the exact same place as the HTML tag-soup mess the web has become.

Not just “considered”, it *is* harmful. HTML tag soup is the result, this is the path to XML tag soup.

Like the case insensitivity of XML in PHP4 was a bad idea, XML tag soup in PHP5 is a bad idea.