Parsing non-well-formed XML documents in PHP 5.1

DISCLAIMER: Do not use this feature, if you don’t really know what you do. Fixing the other side and rejecting the document are almost always the better options than using this feature. And above all Do not set this property by default. But there are circumstances where this feature can be useful, and that’s the reason why I enabled it.
See also the comments and follow up post for more discussion. (Disclaimer added 20 August 2004)

I just commited a patch to the 5.1 branch of PHP, which allows you to parse not well-formed XML documents and adds the missing elements, eg. missing closing tags.

This can be very useful, if you have to parse XML documents, on which you don’t have any influence. Of course libxml2 just has to guess, what’s wrong, so it’s not always perfect, but for simple errors it’s certainly good enough.

To use this feature, you just have to set the DomDocument property recover to true before loading the XML document and then loading the XML document will always return something more or less useful:

$xml = new DomDocument();
$xml->loadXML('<root><tag>hello world</root>');
echo $xml->saveXML();

which will return (besides a bunch of errors, which still will show up):

<?xml version="1.0"?>
<root><tag>hello world</tag></root>

I added this to libxml2 XML parser to be able to recover
broken files, but please please, try to get the
other side fixed rather than setting this by default.
XML was successful because all parsers actually
enforced the well formedness rules. You don’t
want XML data flow to be as messy as HTML
input, so make your share to avoid the problem !
If there is only so much I can ask the PHP
community, then let it be that you won’t abuse this
and keep XML clean.

Thanks in advance,


Hi Daniel

Thanks for your wise words. Exactly my thoughts and ideas. This feature should only be used under exceptional circumstances and not as a general rule…
When this feature ever will be documented, I’ll add something like this to the documentation.


This “feature” defeats one of the main purposes of XML, namely that malformed documents are rejected. Malformed XML should *not* be accepted. Period. When it’s malformed, it’s not XML, it’s just more tag soup.

This way lies madness. We’ll get into the same situation as with HTML. Please revert this.

I have to say that I agree with Ian – if XML is not well formed then it isn’t really xml…

I think that the cost-benefit of this is substantially negative, but everyone knew I’d say that.
However, if anyone is using the XML they’re getting to drive business transactions, and is using this deplorable try-to-guess-what-I-meant flag, and something breaks and substantial money is lost, the person who’s sending the XML would be entitled to sue and would probably win. If you advertise that you accept XML input, then I regard that as a
promise to fail deterministically if something goes wrong at my end and I send you borked data. Guessing what messages mean is a very dangerous business. Good luck; you’ll need it.

Parsing non well formed documents considered harmful
My patch from the other day to allow parsing of non-well-formed XML documents seemed to cause some disagreement, if that should be done at all (see the comments). Ian Eure and David Hay think, it should be removed again, for Tim Bray it’s shooting y…

Since you committed this patch, will you please, please also commit some _extensive_ documentation containing not only a disclaimer, but also a discussion of the problems you can get into when trying to ‘recover’ from broken XML.

You might also want to give some examples when _never_ to do this (Tim has a _very_ strong point), and some examples for when it is probably ok-ish (like parsing RSS and trying to auto-correct the html in weblog editor).

It would be a shame if the moronic* part of the PHP developers got into another bad habit because of poor documentation.


I’ve totally been exposed to all the explanations pro and con about the XML spec and draconian parsing, so I’m going to try and side-step that battle. :)

However, while I’m here, I just want to suggest that one possible benefit of this feature is the ability for an application to not only reject a mal-formed XML document, but to also provide the user with a well-formed document that they *may* wish to submit instead. (Assuming a proper review to insure the library has correctly fixed the

Of course, if the file is totally hosed, then there’s no way a machine can properly guess what the human is trying to do. However, if it’s an “obvious” typo, then you could display a clear diff with the changes and the user could easily approve them without needing to wade through a mess of angle-bracket tags to debug the file.

After all, isn’t one of the great benefits of machines is the ability quickly and easily do what takes us poor humans far too long? It seems to me that fixing markup is one of those tasks.

Finally, before I go home for the weekend, I want to add that PHP 5 *already* has this feature as part of the Dave Raggett’s Tidy extension. Tidy’s whole purpose in life is to convert your unwashed (HT|X)ML, and now we bundle a PHP interface to this library as part of PHP 5. It’s a little limited when it comes to its XML support, but it’s usable
for most XML files.


This makes me want to learn PHP.

This makes me want to cry.

This is the best thing ever!!1! Finally I don’t have to waste all those keystrokes closing everything! woohoo!