Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This page summarises the results of a number of techniques for converting the McCord XML data into a JSON form suitable to be stored in CouchDb. A sample of the current XML schema can be seen at Note that this schema is not stable, and is expected to change repeatedly over the course of the project. McCord may indeed adopt a JSON data format of their own in time.

We have evaluated multiple approaches, in several languages. In particular, "pure" benchmarks written in Python and Java, using the respective languages' XML parsers, and Javascript benchmarks written in three styles -

  1. using the "fastXmlPull" parser which Fluid uses to parse HTML for its renderer
  2. using John Resig's HTML parser (now delivered in env.js)
  3. using the browser's native DOM, available as the return from an XmlHttpRequest

Firstly the in-browser measurements, performed in various browsers on different machines:






40 documents in 1057ms: 26.425ms per call

4 documents in 1604ms: 401ms per call

40 documents in 1132ms: 28.3ms per call


16.8ms per call

467.75ms per call

17.175ms per call


40 documents in 770ms: 19.25ms per call

4 documents in 2399ms: 599.75ms per call

40 documents in 669ms: 16.725ms per call


40 documents in 2187ms: 54.675ms per call

4 documents in 2422ms: 605.5ms per call

40 documents in 3406ms: 85.15ms per call


8.2ms per call

41.75 ms per call

10.5 ms per call

The first three lines establish basic normalisation between the three machines. Basically Antranig:Yura:Justin is around in a ratio 1.5:1.2:1

Now, for measurements in other languages. Yura on his machine has run benchmarks on CPython and JPython, and on mine I ran an equivalent conversion in Java:



Total time

Per doc



jython:real 2m27.557s, user 2m26.105s, sys 0m1.316s


Yura/CPython 2.6

real 1m6.248s, user 1m4.452s, sys 0m0.480s


Yura/CPython 3


real 1m4.939s, user 1m3.556s, sys 0m0.360s




real 0m1.408s, user 0m1.344s, sys 0m0.040s






So, the above figures show that the best Javascript performance we have, in Chrome, broadly equivalent to the worst Python performance, in Jython. The pure Java implementation is at least 3x faster than the best Python performance.

This test set only contains 4 documents, read repeatedly from the filesystem. Therefore it neglects caching and memory image effects. These documents are around 10k each - so the Java performance equates to a conversion speed of around 10MB/sec, which would be roughly the expected sustained read speed from a fairly good disk. If the files were kept compressed in a ZIP volume, the CPU cost would begin to dominate.

In terms of writing the converted JSON documents to store, we can look at the following resource: McCord indicates that they have around 117,000 records of this form - corresponding to around 1.2Gb of raw data. The number of records puts us towards the left end of the CouchDB insertion graphs - insertion rate is still high. The second graph indicates that the very largest bulk sizes will lead to highest insertion rate - perhaps in excess of 1500 documents/second. This again would place conversion speed as the bottleneck - on this hardware, CouchDb can perform an insert in around 0.6ms. This would imply buffering data in units of around 100Mb. If the fileset and DB were on the same machine, alternating reading and writing in these large units would also lead to better use of the machine's I/O capacity.

Some ballpark estimates for minimum conversion time on the various platforms:

  • If Java were used for the conversion, we would expect to be able to commit the set of 117,000 documents into CouchDb in around 3 minutes.
    *In Python, this time would probably extend to around 8-9 minutes.
  • Perhaps a Javascript V8 (Chrome-like) solution might take more than 15 minutes.
  • A Rhino Javascript solution, performing somewhat Firefox 2-like, might take 1-2 hours.
  • No labels