Processing Large XML Documents with PHP 5

There are different ways to process XML Documents in PHP 5. One can process them with SimpleXML, SAX, XMLReader or DOM and they all have their pro and cons (See my “XML in PHP 5” workshop slides for more details about them). But when it comes to large XML documents, the choices look quite limited.
Therefore I did some benchmark testing with the different extensions. The XML document is approx. 10MB big and consists of a lot of blog-entries from Planet PHP. The to-be-solved excercise was to get the title of the entry with the ID 4365. Not that much of a complicated task and with more complicated questions,
the results may differ.

The results (as text file) were actually not that surprising. SAX and XMLReader were very low on memory usage, but slower than with DOM/XPath. Here’s a chart of the initial
results (parsing the full document).

But if we assume, there’s only one ID = 4365, then we don’t have to process the full document and we can stop after the first one (aka FO or firstonly in the results) was found. As this entry is in the first 10% in our example, the results are quite different. To no
surprise. With this approach and some luck with the entries order, we can cut down the processing time considerably, which is not possible with the DOM approach. There, it’s all-or-nothing.

In the result charts you maybe also recognized the option “Expand” and “Expand & SimpleXML”. I added a new method to XMLReader this weekend called “expand()” (it’s in CVS now). With this method, you can convert a node catched with XMLReader to a DOMElement. See also the libxml2 page for more information. This can be very useful, if you want to do DOM Operations on only a little part of a huge XML document. With the “Expand” script, we expand the node matching ID = 4365 with XMLReader and then apply an XPath operation on it. As you can see, it needs some lines
of code (the expand() method only returns a node, but we need a document for XPath), but after that, we can use every XPath expression and DOM Method we want. Even convert it to SimpleXML, as we do in the “Expand & SimpleXML” script. It’s maybe a little bit useless in this case, as we don’t save a lot of coding or time, but if your subtrees
are more complex or you want to build a new XML document, this can be quite useful. The time and memory used is approx. the same as with the plain XMLReader script (no surprise, since most of the time is spent in traversing the XML document and not parsing the subtree).

I also did some benchmarks with XSLT ( the chart). First I did the traditional method with loading the whole XML document into memory and then transform it. Time and memory used is more or less the same as with plain DOM processing, which is no surprise, since the
task this script has to do is almost the same as we did with the XPath stuff. But it gets interesting with the expand() feature of XMLReader. As we just want to transform the one entry, we search for it with XMLReader, create a DOMElement, resp. a DOMDocument and feed only that to the XSLT processor. This saves a lot of memory and scales very well
on the memory side. It takes longer time-wise (if you parse the full document, but that’s the worst case scenario anyway), but if your XML documents are really huge (more than your available RAM for example), then this (or other XMLReader approaches) is the only feasible solution, IMHO.

To sum up: XMLReader is a powerfull extension to parse large XML documents, it’s usually much faster than SAX (twice as fast), while still scaling without problems on the memory side. With the expand() method, it’s now also possible to mix the features of DOM/SimpleXML/XSLT with XMLReader, if you only have to process parts of an XML
document.

Here are the scripts for reference:
SAX
XMLReader
Expand
Expand & SimpleXML
DOM & XPath
XSLT
XSLT w/ XMLReader

I need to dedicate a box to PHP5 testing.

Thanks for the write up on this, because it gives me some ideas for some strategies to use with XML Reader.

Interesting, at the libxml2 level, SAX is like
twice as fast as the xmlReader, but your
experiment point out one more problem of
the SAX API which is when you cross languages
boundaries, the callbacks are extermely
expensive especially if converting strings
is needed. That’s why SAX is not a good
API to export a fast parser to say PHP or
Python, any advantage you may gain with
the C parser is lost in the marshalling process
of the strings. The reader in comparison
allows to minimize the marshalling, you have
far more integer (cheaper), all attributes and
their values are marshalled only if asked for,
and checking the element type allows to
short-circuit potentially expensive operations.
I think that adding Reader operations like
NextElement(Name?) or NextType(type) would
have even more potential for fast processing,
and would be very convenient for the kind of
operations you describe NextElement(title)
would stop only once per article (and there
is glob of optimization possible at the libxml2
level for such searching).

Daniel

Hai Daniel,

I want to try you reference scripts but in need the memreport.php, it would be nice if you could zend it to me pleas???

thank in advance
greetings,
Ashook

Has anyone run a test on HUGE (multi-GB) XML files? Would XMLReader scale to this size?

Hi Oskar: XMLReader scales to any size as it only parses chunk by chunk.

very usefull thanks!!

i am currently using the combination of xmlreader+expand and then manage the single needed node with DOM

great speed!!

This was enormously useful info, thanks! Appreciate the examples. While it’s slightly clunky to expand/SimpleXML import, it’s VERY convenient; I was pleasantly surprised to see there isn’t much of a performance penalty, either.

Looking forward to using XMLReader for processing giant feeds… (150MB+ XML…)

Hi,
Problem:I have to parse an xml file of size greater than 5gb.If i do that using DOM it throws Out of memory exception.
Which parser should i use…should i go for SAX…Or will i face the same problem…..

If you have PHP 5, i’d recommend XMLReader

The reference scripts have been removed. Would much appreciate to have em available again.

Anyhow, i would like to make some speed/memory comparisons between DOM, SAX, XMLReader and
SimpleXML as implementations of an xml2array parser. The structure should be the one given by PEAR:XML_Unserializer. Any idea or prediction which of them is most useful for this job ? The XML structure
i would work with is not very deep folded. Attributes are
rare too.

Thx so far

Hi,
I read your blog it seems intresting and show me a way to find solution for parsing and storing large XML in DB (upto 200 MB)

can send the example scripts that you mentioned.

The examples are online again, sorry to all who didn’t find them…

I’m trying to use XMLReader on a Windows install of PHP 5.1.2 but can’t find a way to enable it. Does it require that PHP be recompiled with –enable-xmlreader added to the configure line or is there a build of PHP out there with this already done? I’m doing this for a client and don’t have the time to figure out how to compile PHP myself.

I’m trying to parse a 28MB XML file and load it into a MySQL database. Is there another way to do it that won’t require me to recompile PHP?

Any help would greatly be appreciated. Thanks.

Does this work with crossdomain fetching, without crossdomain.xml
My testing showed varied results.
someone clarify pls

Nice article but couldn’t you give any example about xml reading with xmlreader. I was looking for an example.

i think it’s wierd that we are going in circles. the reason large data volumes got broken down into a relational db was for this EXACT situation. fast searching through giant data.’

now we’ve gone full circle, back to flat files, and needing a way to search them quickly again?

why not just structure your file hierarchy like a database, where the folder names represent tables, and the data in the xml is structured in a relational manner – this keeps the sizes down, and your main title file, would only have two elements, name, and location.

technology seems to chase it’s tail an awful lot. i mean, if you have php5, then why not just use a db if you suspect your files are going to become huge???

that said, if your files are reasonable enough and you don’t want a db – this advice is great, and thanks a lot!
/tre

Could I ask you ‘memreport.php’ and Xml/Xsl files to test your benchmark ?
Your benchmark and blog are really interesting
Thank you very much
JC

Hey Christian

Great post and thanks for doing the work, saved me some time and effort.

All the files you need are here:
https://svn.bitflux.org/repos/public/php5examples/largexml/

Perfect information – thank you. Saved me a lot of time. XMLReader it is then :)

@trevor – I agree that we are going in circles, but XML is often used to transport data and not for storage. So sometimes huge XML files has to be read and imported.

I’d just like to queue up with all the others.

I’m using XMLReader to do exact the thing “Not Web Design” mentioned – and it works great. Especially in combination with expand and SimpleXML.

Thanks!

Great test results,
but where can I find the planet.xml?

Thanks in advance,

Blabi

Great article!
someone has tried to parse large xml files using XMLReader + XMLWriter?

nice post,
with this information i made my xml operation work faster, i never knew that my xmlparsing could be done so fast…
expand() -> nice idea

Why do you first read the XML using XMLReader, then pass it to DOMDocument (importNode), and then pass it to simplexml? Wouldn’t that last step just be a waste?

Robert: To get easily to the title of the node (which would need more line in pure DOM) and to show that it’s possible.

Performance wise it’s not really a waste. It’s zero-copy internally and as said, with pure DOM you’d had to use XPath or several lines of iterating. Not necessarily faster then the simplexml approach