- using the "fastXmlPull" parser which Fluid uses to parse HTML for its renderer
- using the browser's native DOM, available as the return from an XmlHttpRequest
- Thomas Frank's "xml2json" parser, held at http://www.thomasfrank.se/xml_to_json.html. Frank references a better implementation held at www.terracoder.com, however this site as of the time of writing has been down for several weeks.
Firstly the in-browser measurements, performed in various browsers on different machines:
40 documents in 1057ms: 26.425ms per call
4 documents in 1604ms: 401ms per call
40 documents in 1132ms: 28.3ms per call
4 documents in 3247ms: 811.75ms per call
16.8ms per call
467.75ms per call
17.175ms per call
40 documents in 770ms: 19.25ms per call
4 documents in 2399ms: 599.75ms per call
40 documents in 669ms: 16.725ms per call
40 documents in 2187ms: 54.675ms per call
4 documents in 2422ms: 605.5ms per call
40 documents in 3406ms: 85.15ms per call
parsed 4 documents in 2328ms: 582ms per call
40 documents in 951ms: 23.775ms per call
4 documents in 1666ms: 416.5ms per call
40 documents in 756ms: 18.9ms per call
4 documents in 2071ms: 517.75ms per call
8.2ms per call
41.75 ms per call
10.5 ms per call
The first three lines establish basic normalisation between the three machines. Basically Antranig:Yura:Justin is around in a ratio 1.5:1.2:1
The Frank and Resig parsers are extremely slow. Whilst they make good use of RegExps to potentially accelerate performance in browsers, their overall approach to parsing is very suboptimal, frequently duplicating the document text in memory several times over during the course of parsing. The Frank parser in particular makes very frequent use of "eval" which will result in poor performance on any runtime. These two parsers take roughly an order of magnitude longer than the other approaches.
NB - the Resig parser is somewhat specialised for HTML, at the expense of behaviour on XML documents. The base code has been tweaked in a few instances to allow parsing to continue on general XML but the produced JSON is not correct, as a result of an assumption in the code that any node type which is capable of containing CDATA is not also capable of containing child nodes. However, the overall run time is probably still a good estimate of the speed of this parser.
The "fastXmlPull" parser (the Fluid homegrown approach) is in general the fastest. Despite the name it is actually capable of parsing realistic browser HTML as well as XML. Whilst on some platforms, the DOM approach appears marginally faster (on the very latest Firefox beta at 19ms against 25ms) these tests give the DOM method an unfair advantage since they only measure the cost of iteration over the DOM. The process of building the DOM structure itself occurs inside the browser's XHR engine and is not possible to profile separately from the overall fetch process for the files. Once this is taken into account, fastXmlPull would probably have equal or greater performance on all platforms.
Measurements in Python and Java
Now, for measurements in other languages. Yura on his machine has run benchmarks on CPython and JPython, and on mine I ran an equivalent conversion in Java:
This test set only contains 4 documents, read repeatedly from the filesystem. Therefore it neglects caching and memory image effects. These documents are around 10k each - so the Java performance equates to a conversion speed of around 10MB/sec, which would be roughly the expected sustained read speed from a fairly good disk. If the files were kept compressed in a ZIP volume, the CPU cost would begin to dominate.