NoSQL Database by Christof Strauch - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

5. Document Databases

 

In this chapter another class of NoSQL databases will be discussed. Document databases are considered by many as the next logical step from simple key-/value-stores to slightly more complex and meaningful data structures as they at least allow to encapsulate key-/value-pairs in documents. On the other hand there is no strict schema documents have to conform to which eliminates the need schema migration efforts (cf. [Ipp09]). In this chapter Apache CouchDB and MongoDB as the two major representatives for the class of document databases will be investigated.

5.1. Apache CouchDB

5.1.1. Overview

CouchDB is a document database written in Erlang. The name CouchDB is nowadays sometimes referred to as “Cluster of unreliable commodity hardware” database, which is in fact a backronym according to one of it’s main developers (cf. [PLL09]).

CouchDB can be regarded as a descendant of Lotus Notes for which CouchDB’s main developer Damien Katz worked at IBM before he later initiated the CouchDB project on his own1. A lot of concepts from Lotus Notes can be found in CouchDB: documents, views, distribution, and replication between servers and clients. The approach of CouchDB is to build such a document database from scratch with technologies of the web area like Representational State Transfer (REST; cf. [Fie00]), JavaScript Object Notation (JSON) as a data interchange format, and the ability to integrate with infrastructure components such as load balancers and caching proxies etc. (cf. [PLL09]).

CouchDB can be briefly characterized as a document database which is accessible via a RESTful HTTP- interface, containing schema-free documents in a flat address space. For these documents JavaScript functions select and aggregate documents and representations of them in a MapReduce manner to build views of the database which also get indexed. CouchDB is distributed and able to replicate between server nodes as well as clients and servers incrementally. Multiple concurrent versions of the same document (MVCC) are allowed in CouchDB and the database is able to detect conflicts and manage their resolution which is delegated to client applications (cf. [Apa10c], [Apa10a]).

The most notable use of CouchDB in production is ubuntu one ([Can10a]) the cloud storage and replication service for Ubuntu Linux ([Can10b]). CouchDB is also part of the BBC’s new web application platform (cf. [Far09]). Furthermore some (less prominent) blogs, wikis, social networks, Facebook apps and smaller web sites use CouchDB as their datastore (cf. [C+10]).

5.1.2. Data Model and Key Abstractions Documents

The main abstraction and data structure in CouchDB is a document. Documents consist of named fields that have a key/name and a value. A fieldname has to be unique within a document and its assigned value may a string (of arbitrary length), number, boolean, date, an ordered list or an associative map (cf. [Apa10a]). Documents may contain references to other documents (URIs, URLs) but these do not get checked or held consistent by the database (cf. [PLL09]). A further limitation is that documents in CouchDB cannot be nested (cf. [Ipp09]).

A wiki article may be an example of such a document:

" Title " : " CouchDB ",

" Last editor " : "172.5.123.91" ,

" Last modified ": "9/23/2010" ,

" Categories ": [" Database ", " NoSQL ", " Document Database "],

" Body ": " CouchDB is a ..." ,

" Reviewed ": false

Besides fields, documents may also have attachments and CouchDB maintains some metadata such as a unique identifier and a sequence id2) for each document (cf. [Apa10b]). The document id is a 128 bit value (so a CouchDB database can store 3.438 different documents) ; the revision number is a 32 bit value determined by a hash-function3.

CouchDB considers itself as a semi-structured database. While relational databases are designed for struc- tured and interdependent data and key-/value-stores operate on uninterpreted, isolated key-/value-pairs document databases like CouchDB pursue a third path: data is contained in documents which do not correspond to a fixed schema (schema-free) but have some inner structure known to