Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »

This page summarises the results of a number of techniques for converting the McCord XML data into a JSON form suitable to be stored in CouchDb. A sample of the current XML schema can be seen at https://source.fluidproject.org/svn/fluid/engage/trunk/src/testdata/artifacts. Note that this schema is not stable, and is expected to change repeatedly over the course of the project. McCord may indeed adopt a JSON data format of their own in time.

We have evaluated multiple approaches, in several languages. In particular, "pure" benchmarks written in Python and Java, using the respective languages' XML parsers, and Javascript benchmarks written in three styles -

  1. using the "fastXmlPull" parser which Fluid uses to parse HTML for its renderer
  2. using John Resig's HTML parser (now delivered in env.js)
  3. using the browser's native DOM, available as the return from an XmlHttpRequest

Firstly the in-browser measurements, performed in various browsers on different machines:

Machine/Browser

fastXmlPull

Resig

DOM

Antranig/FF3

40 documents in 1057ms: 26.425ms per call

4 documents in 1604ms: 401ms per call

40 documents in 1132ms: 28.3ms per call

Justin/FF3

16.8ms per call

467.75ms per call

17.175ms per call

Yura/FF3

40 documents in 770ms: 19.25ms per call

4 documents in 2399ms: 599.75ms per call

40 documents in 669ms: 16.725ms per call

Antranig/FF2

40 documents in 2187ms: 54.675ms per call

4 documents in 2422ms: 605.5ms per call

40 documents in 3406ms: 85.15ms per call

Justin/Chrome

8.2ms per call

41.75 ms per call

10.5 ms per call

The first three lines establish basic normalisation between the three machines. Basically Antranig:Yura:Justin is around in a ratio 1.5:1.2:1

Now, for measurements in other languages. Yura on his machine has run benchmarks on CPython and JPython, and on mine I ran an equivalent conversion in Java:

Machine/Language

Reps

Total time

Per doc

Yura/Jython

20000

jython:real 2m27.557s, user 2m26.105s, sys 0m1.316s

7ms/doc

Yura/CPython 2.6

real 1m6.248s, user 1m4.452s, sys 0m0.480s

3.3ms/doc

Yura/CPython 3

20000

real 1m4.939s, user 1m3.556s, sys 0m0.360s

3.3ms/doc

Yura/CPython

400

real 0m1.408s, user 0m1.344s, sys 0m0.040s

3.5ms/doc

Antranig/Java

4000

3600ms

0.9ms/doc

So, the above figures show that the best Javascript performance we have, in Chrome, broadly equivalent to the worst Python performance, in Jython. The pure Java implementation is at least 3x faster than the best Python performance.

This test set only contains 4 documents, read repeatedly from the filesystem. Therefore it neglects caching and memory image effects. These documents are around 10k each - so the Java performance equates to a conversion speed of around 10MB/sec, which would be roughly the expected sustained read speed from a fairly good disk. If the files were kept compressed in a ZIP volume, the CPU cost would begin to dominate.

In terms of writing the converted JSON documents to store, we can look at the following resource: http://aartemenko.com/texts/couchdb-bulk-inserts-performance/. McCord indicates that they have around 117,000 records of this form - corresponding to around 1.2Gb of raw data. The number of records puts us towards the left end of the CouchDB insertion graphs - insertion rate is still high. The second graph indicates that the very largest bulk sizes will lead to highest insertion rate - perhaps in excess of 1500 documents/second. This again would place conversion speed as the bottleneck - on this hardware, CouchDb can perform an insert in around 0.6ms. This would imply buffering data in units of around 100Mb. If the fileset and DB were on the same machine, alternating reading and writing in these large units would also lead to better use of the machine's I/O capacity.

Some ballpark estimates for minimum conversion time on the various platforms:

  • If Java were used for the conversion, we would expect to be able to commit the set of 117,000 documents into CouchDb in around 3 minutes.
    *In Python, this time would probably extend to around 8-9 minutes.
  • Perhaps a Javascript V8 (Chrome-like) solution might take more than 15 minutes.
  • A Rhino Javascript solution, performing somewhat Firefox 2-like, might take 1-2 hours.
  • No labels