Overview
Engage is going to have to embrace an incredible amount of diversity in terms of how data is structured. Each museum's collection is different, and it is a daunting task to attempt to unify all types of collections with a single schema. Rather than creating a "one size fits all" approach at the database level, the approach we'll take in Engage is to enable flexible schemas.
Without being able to design a predetermined schema during the design phase, a traditional SQL-based relational database is likely inappropriate.
Potential Technology
- CouchDB - an interesting system based on storage of JSON objects implemented in Erlang. Queries are implemented by "views" written in standard Javascript obeying some consistency constraints. Highly horizontally scalable, and coming into some maturity (started 2005). Became an Apache top-level project in Feb 2008. This is only addressible via a RESTful JSON API, and hence would always incur an "out of process" communication hit.
- Persistence from the Google App Engine - this is a custom implementation by Google, using a novel storage idiom based on a "partitioned key" - each persistent item is assigned a key path, of which the first element may be used to assign to a persistence group or transactional unit. The storage is queried using GQL, a reduced SQL-like language which, like the CouchDB Javascript system is "join-less". Currently the App Engine source code is not available and these apps must be hosted on Google's infrastructure.
- A JSR-170 compliant Java Content Repository (JCR). This stores data in a version-managed hierarchy of "nodes" each of which may store a flat collection of key-value pairs. Querying is performed via reduced XPath or a limited SQL subset. This choice limits the hosting language to Java, and JCR mandates no wire protocol beyond a recommendation for WebDAV. Also, many useful aspects of the implementation (transaction support, lazy result sets, etc.) lie outside the specification for JSR-170, though improved support is coming in JSR-283. The full JSR-170 spec, however, only appears in one implementation, Apache Jackrabbit.
- Fedora Repository is part of the overall Fedora Commons repository project. Whilst Fedora Repository provides support for internal XML-oriented storage attached to a "Digital Object", it excels particular in management of binary data streams - likely to be a suitable repository for storing extremely large image or video streams. Whilst Fedora itself is implemented in Java, it specifies (XML-based) RESTful protocols for query and update. Also notable is the inclusion of Mulgara, an RDF-based triplestore, as part of the package (this somewhat answers the same semi-structured data requirement as CouchDB, for example).
CouchDB
Links
- CouchDb's Apache Site
- What is CouchDB Ex-Powerpoint presentation online from Damien Katz
- Blog posting about using CouchDb from Python
- Another post about Couch and Python
- CouchDb's author weighs its pros and cons
- Sam Ruby on Ascetic Database Architectures
- A critique of CouchDB
- Sam Ruby's response to Dare's Critique
- Some informal CouchDb performance metrics - note the more recent benchmarks in the comments at the end.
- Jan Lehnardt's CouchDB-related blog
- Christopher Lenz talks about "join-like" queries in CouchDB
- CouchDB - A use case Kore Nordmann talks about how to implement a groups and permissions system with the CouchDB query system
- Reddit thread with random commentary on CouchDB
- CouchDB bulk insert's performance Shows a performance "floor" at 2.5 million records on a 1Gb machine, developers have not yet responded
- CouchDB in the browser, plus a bit about the Cloud vision for Couch
General concerns
- Amazon SimpleDB and Eventual Consistency In a distributed world, reads need not always be up to date.
Labels:
None
1 Comment
comments.show.hideJun 05, 2009
David Trelles
The main goal of this document is to know more about CouchDB in depth, some advantages and disadvantages.
The Reasons why CouchDB is "better" than Mysql. The advantages.
Some possible disadvantages.
Note that many of these issues can be avoided by rethinking the problem. For example, reliable counts are possible in Couch, through inspecting revision properties - Transactions and Locks article shows how to rethink a few common cases. Similarly, denormalised data makes more sense in a distributed world - see link above on Eventual Consistency.
Alternatives
Clients
Libraries
Miscellaneous
Full Text Searching
(Feel free to modify this document or suggest changes)