Skip to end of metadata
Go to start of metadata

Overview

Engage is going to have to embrace an incredible amount of diversity in terms of how data is structured. Each museum's collection is different, and it is a daunting task to attempt to unify all types of collections with a single schema. Rather than creating a "one size fits all" approach at the database level, the approach we'll take in Engage is to enable flexible schemas.

Without being able to design a predetermined schema during the design phase, a traditional SQL-based relational database is likely inappropriate.

Potential Technology

  • CouchDB - an interesting system based on storage of JSON objects implemented in Erlang. Queries are implemented by "views" written in standard Javascript obeying some consistency constraints. Highly horizontally scalable, and coming into some maturity (started 2005). Became an Apache top-level project in Feb 2008. This is only addressible via a RESTful JSON API, and hence would always incur an "out of process" communication hit.
  • Persistence from the Google App Engine - this is a custom implementation by Google, using a novel storage idiom based on a "partitioned key" - each persistent item is assigned a key path, of which the first element may be used to assign to a persistence group or transactional unit. The storage is queried using GQL, a reduced SQL-like language which, like the CouchDB Javascript system is "join-less". Currently the App Engine source code is not available and these apps must be hosted on Google's infrastructure.
  • A JSR-170 compliant Java Content Repository (JCR). This stores data in a version-managed hierarchy of "nodes" each of which may store a flat collection of key-value pairs. Querying is performed via reduced XPath or a limited SQL subset. This choice limits the hosting language to Java, and JCR mandates no wire protocol beyond a recommendation for WebDAV. Also, many useful aspects of the implementation (transaction support, lazy result sets, etc.) lie outside the specification for JSR-170, though improved support is coming in JSR-283. The full JSR-170 spec, however, only appears in one implementation, Apache Jackrabbit.
  • Fedora Repository is part of the overall Fedora Commons repository project. Whilst Fedora Repository provides support for internal XML-oriented storage attached to a "Digital Object", it excels particular in management of binary data streams - likely to be a suitable repository for storing extremely large image or video streams. Whilst Fedora itself is implemented in Java, it specifies (XML-based) RESTful protocols for query and update. Also notable is the inclusion of Mulgara, an RDF-based triplestore, as part of the package (this somewhat answers the same semi-structured data requirement as CouchDB, for example).

CouchDB

General concerns

  • No labels

1 Comment

  1. The main goal of this document is to know more about CouchDB in depth, some advantages and disadvantages.

    The Reasons why CouchDB is "better" than Mysql. The advantages.

    1. Schema less db, which means that developing is very fast so we don't need to do a db update every time we add a column.
    2. Everything is over http, simple http, get post, put, delete requests, which means works with varnish/squid out of the box.
    3. Attachments - you can store file attachments.
    4. Map Reduce - no more sql queries, use amazingly scalable map-reduce based views. Views once saved are lighting fast.
    5. Futon javascript interface, comes with a friendly js interface for displaying and editing data.
    6. javascript server using mozilla spidermonkey to construct views - means no need php in flash.
    7. Zero config replication - work from home with no internet.
    8. Python couchdb library.
    9. Bulk updates, deletes - it's possible to store 100000 docs in one post request.
    10. Each couchdb document is just a simple JSON compatible doc.
    11. It uses Erlang, which means it is scalable for multicore multiprocessor machines. The key for good performance.
    12. Low memory requirement Takes 150MB compared to 8Gig taken by Mysql for a similar db setup.
    13. Similar to zodb, but much more cleaner and intuitive.

    Some possible disadvantages.

    1. It doesn't support transactions. It means that enforcing uniqueness of one field across all documents is not safe, for example, enforcing that a username is unique. Another consequence of CouchDB's inability to support the typical notion of a transaction is that things like inc/decrementing a value and saving it back are also dangerous. There aren't many instances that we would want to simply inc/decrement some value where we couldn't just store the individual documents separately and aggregate them with a view.
    2. Relational data. If the data makes a lot of sense to be in 3rd normal form, and we try to follow that form in CouchDB,  we are going to run into a lot of trouble. A possible way to solve this problem is with view collations, but we might constantly going to be fighting with the system. If  the data can be reformatted to be much more denormalized, then CouchDB will work fine.
    3. Data warehouse. The problem with this is that temporary views in CouchDB on large datasets are really slow. Using CouchDB and permanent views could work quite well. However, in most of cases, a Column-Oriented Database of some sort is a much better tool for the data warehousing job.

    Note that many of these issues can be avoided by rethinking the problem. For example, reliable counts are possible in Couch, through inspecting revision properties - Transactions and Locks article shows how to rethink a few common cases. Similarly, denormalised data makes more sense in a distributed world - see link above on Eventual Consistency.

    Alternatives

    • Feather DB (CouchDB clone in Java)
    • StrokeDB (A CouchDB-like database written in Ruby to make embedding into Ruby apps easier)
    • Zodb (Zope Object Database)

    Clients

    • CouchDB-FUSE: mount document attachments on a virtual filesystem
    • Fuschia is a graphical document browser for CouchDB
    • Levitz - XUL based CouchDb utility client
    • Valance GUI client in PyGTK

    Libraries

    • CouchDB4J Java bindings
    • Erlang interface to CouchDB (discontinued!!)
    • Erlang interfaces:
      • erlang_couchdb
      • eCouch
    • Perl interfaces:
      • Net::CouchDb
      • CouchDB::Client
      • POE::Component::Client::CouchDB
    • Perl tools:
      • CouchDB::View, handling Perl views on both the client and server sides
      • CouchDB::Deploy, simple configuration to help deploy applications that use CouchDB
    • PHP libraries
      • PHPillow, an object orientated wrapper for CouchDB.
      • PHP library for CouchDb
    • Ruby libraries
      • CouchFoo (ActiveRecord matching API to CouchDB)
      • CouchObject (Ruby client + JsServer for views in Ruby)
    • CouchDB Python Library
    • CouchDB Common Lisp Library
    • jQuery CouchDB Library (quite interesting...)
    • Squeak CouchDB Library
    • Paisley: A Twisted Python CouchDB Client
    • Storing GeoData (PHP, Google Geocoding Service)

    Miscellaneous

    • CouchApp Utlities for developing standalone CouchDB applications using just HTML and Javascript.
    • CouchDBX Packaging CouchDB for Mac OS X.
    • Interactive CouchDB A CouchDB emulator/visualizer written in 100% Javascript.
    • Lounge A proxy-based partitioning/clustering framework for CouchDB.

    Full Text Searching

    • CouchDB Lucene Enables full-text searching of CouchDB documents using Lucene.
    • CouchDB Solr2 Integrates full-text indexing and searching with CouchDB. (Inactive as of January 2009)
    • HyperCouch Full text indexing of CouchDB via Hyper Estraier.

    (Feel free to modify this document or suggest changes)