This page summarises the results of a number of techniques for converting the McCord XML data into a JSON form suitable to be stored in CouchDb. A sample of the current XML schema can be seen at https://source.fluidproject.org/svn/fluid/engage/trunk/src/testdata/artifacts. Note that this schema is not stable, and is expected to change repeatedly over the course of the project. McCord may indeed adopt a JSON data format of their own in timeIndeed, each museum will likely have their own variations on a schema, and so our approach fundamentally embraces this variability.
- using the "fastXmlPull" parser which Fluid uses to parse HTML for its renderer Renderer
- using the browser's native DOM, available as the return from an XmlHttpRequest
- Thomas Frank's "xml2json" parser, held at http://www.thomasfrank.se/xml_to_json.html. Frank references a better implementation held at www.terracoder.com, however this site as of the time of writing has been down for several weeks.
Firstly the in-browser measurements, performed in various browsers on different machines:
40 documents in 1057ms: 26.425ms per call
4 documents in 1604ms: 401ms per call
40 documents in 1132ms: 28.3ms per call
4 documents in 3247ms: 811.75ms per call
16.8ms per call
467.75ms per call
17.175ms per call
40 documents in 770ms: 19.25ms per call
4 documents in 2399ms: 599.75ms per call
40 documents in 669ms: 16.725ms per call
40 documents in 2187ms: 54.675ms per call
4 documents in 2422ms: 605.5ms per call
40 documents in 3406ms: 85.15ms per call
parsed 4 documents in 2328ms: 582ms per call
40 documents in 951ms: 23.775ms per call
4 documents in 1666ms: 416.5ms per call
40 documents in 756ms: 18.9ms per call
4 documents in 2071ms: 517.75ms per call
8.2ms per call
41.75 ms per call
10.5 ms per call
The first three lines establish basic normalisation between the three machines. Basically Antranig:Yura:Justin is around in a ratio 1.5:1.2:1
The Frank and Resig parsers are extremely slow. Whilst they make good use of RegExps to potentially accelerate performance in browsers, their overall approach to parsing is very suboptimal, frequently duplicating the document text in memory several times over during the course of parsing. The Frank parser in particular makes very frequent use of "eval" which will result in poor performance on any runtime. These two parsers take roughly an order of magnitude longer than the other approaches.
Measurements in Python and Java
Now, for measurements in other languages. Yura on his machine has run benchmarks on CPython and JPython, and on mine I ran an equivalent conversion in Java:
jython:real 2m27.557s, user 2m26.105s, sys 0m1.316s
real 1m6.248s, user 1m4.452s, sys 0m0.480s
real 1m4.939s, user 1m3.556s, sys 0m0.360s
real 0m1.408s, user 0m1.344s, sys 0m0.040s
This test set only contains 4 documents, read repeatedly from the filesystem. Therefore it neglects caching and memory image effects. These documents are around 10k each - so the Java performance equates to a conversion speed of around 10MB/sec, which would be roughly the expected sustained read speed from a fairly good disk. If the files were kept compressed in a ZIP volume, the CPU cost would begin to dominate.
Persisting the JSON Data in CouchDB
In terms of writing the converted JSON documents to store, we can look at the following resource: http://aartemenko.com/texts/couchdb-bulk-inserts-performance/. McCord indicates that they have around 117,000 records of this form - corresponding to around 1.2Gb of raw data. The number of records puts us towards the left end of the CouchDB insertion graphs - insertion rate is still high. The second graph indicates that the very largest bulk sizes will lead to highest insertion rate - perhaps in excess of 1500 documents/second. This again would place conversion speed as the bottleneck - on this hardware, CouchDb can perform an insert in around 0.6ms. This would imply buffering data in units of around 100Mb. If the fileset and DB were on the same machine, alternating reading and writing in these large units would also lead to better use of the machine's I/O capacity.
Total McCord XML-JSON Conversion (minutes)
Some ballpark estimates for minimum conversion time on whole workflow—including conversion and persistence—on the various platforms:
- If Java were used for the conversion, we would expect to be able to commit the set of 117,000 documents into CouchDb in around 3 minutes.
- In Python, this time would probably extend to around 8-9 minutes.
Total McCord XML-JSON Conversion (minutes)