3. Basic Concepts, Techniques and Patterns
This chapter outlines some fundamental concepts, techniques and patterns that are common among NoSQL datastores and not unique to only one class of nonrelational databases or a single NoSQL store. Specific concepts and techniques of the various classes of NoSQL datastores and individual products will be discussed in the subsequent chapters of this paper.
In a keynote titled “Towards Robust Distributed Systems” at ACM’s PODC1 symposium in 2000 Eric Brewer came up with the so called CAP-theorem (cf. [Bre00]) which is widely adopted today by large web companies (e.g. Amazon, cf. [Vog07], [Vog08]) as well as in the NoSQL community. The CAP acronym stands for (as summarized by Gray in [Gra09]):
Consistency meaning if and how a system is in a consistent state after the execution of an operation. A distributed system is typically considered to be consistent if after an update operation of some writer all readers see his updates in some shared data source. (Nevertheless there are several alternatives towards this strict notion of consistency as we will see below.)
Availability and especially high availability meaning that a system is designed and implemented in a way that allows it to continue operation (i.e. allowing read and write operations) if e.g. nodes in a cluster crash or some hardware or software parts are down due to upgrades.
Partition Tolerance understood as the ability of the system to continue operation in the presence of network partitions. These occur if two or more “islands” of network nodes arise which (temporarily or permanently) cannot connect to each other. Some people also understand partition tolerance as the ability of a system to cope with the dynamic addition and removal of nodes (e.g. for maintainance purposes; removed and again added nodes are considered an own network partition in this notion; cf. [Ipp09]).
Now, Brewer alleges that one can at most choose two of these three characteristics in a “shared-data system” (cf. [Bre00, slide 14]). In his talk, he referred to trade-offs between ACID and BASE systems (see next subsection) and proposed as a decision criteria to select one or the other for individual use-cases: if a system or parts of a system have to be consistent and partition-tolerant, ACID properties are required and if availability and partition-tolerance are favored over consistency, the resulting system can be characterized by the BASE properties. The latter is the case for Amazon’s Dynamo (cf. [DHJ+07]), which is available and partition-tolerant but not strictly consistent, i.e. writes of one client are not seen immediately after being committed to all readers. Google’s Bigtable chooses neither ACID nor BASE but the third CAP-alternative being a consistent and available system and consequently not able to fully operate in the presence of network partitions. In his keynote Brewer points out traits and examples of the three different choices that can be made according to his CAP-theorem (see table 3.1).
Table 3.1.: CAP-Theorem – Alternatives, Traits, Examples (cf. [Bre00, slides 14–16])
With regards to databases, Brewer concludes that current “Databases [are] better at C[onsistency] than Availability” and that “Wide-area databases can’t have both” (cf. [Bre00, slide 17])—a notion that is widely adopted in the NoSQL community and has influenced the design of nonrelational datastores.
The internet with its wikis, blogs, social networks etc. creates an enormous and constantly growing amount of data needing to be processed, analyzed and delivered. Companies, organizations and individuals offering applications or services in this field have to determine their individual requirements regarding performance, reliability, availability, consistency and durability (cf. [Gra09]). As discussed above, the CAP-theorem states that a choice can only be made for two options out of consistency, availability and partition tolerance. For a growing number of applications and use-cases (including web applications, especially in large and ultra-large scale, and even in the e-commerce sector, see [Vog07], [Vog08]) availability and partition tolerance are more important than strict consistency. These applications have to be reliable which implicates availability and redundancy (consequently distribution among two or more nodes, which is necessary as many systems run on “cheap, commoditized and unreliable” machines [Ho09a] and also provides scalability). These properties are difficult to achieve with ACID properties therefore approaches like BASE are applied (cf. [Ipp09]).
The BASE approach according to Brewer forfeits the ACID properties of consistency and isolation in favor of “availability, graceful degradation, and performance” (cf. [Bre00, slide 12]). The acronym BASE is composed of the following characteristics: