NoSQL Database by Christof Strauch - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

6. Column-Oriented Databases

 

In this chapter a third class of NoSQL datastores is investigated: column-oriented databases. The approach to store and process data by column instead of row has its origin in analytics and business intelligence where column-stores operating in a shared-nothing massively parallel processing architecture can be used to build high-performance applications. Notable products in this field are Sybase IQ and Vertica (cf. [Nor09]). However, in this chapter the class of column-oriented stores is seen less puristic, also subsuming datastores that integrate column- and row-orientation. They are also described as “[sparse], distributed, persistent multidimensional sorted [maps]” (cf. e.g. [Int10]). The main inspiration for column-oriented datastores is Google’s Bigtable which will be discussed first in this chapter. After that there will be a brief overview of two datastores influenced by Bigtable: Hypertable and HBase. The chapter concludes with an investigation of Cassandra, which is inspired by Bigtable as well as Amazon’s Dynamo. As seen in section 2.3 on classifications of NoSQL datastores, Bigtable is subsumed differently by various authors, e.g. as a “wide columnar store” by Yen (cf. [Yen09]), as an “extensible record store” by Cattel (cf. [Cat10]) or as an entity-attribute-value1 datastore by North (cf. [Nor09]). In this paper, Cassandra is discussed along with the column-oriented databases as most authors subsume it in this category and because one of its main inspirations, Google’s Bigtable, has to be introduced yet.

6.1. Google’s Bigtable

Bigtable is described as “a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers” (cf. [CDG+06, p. 1]). It is used by over sixty projects at Google as of 2006, including web indexing, Google Earth, Google Analytics, Orkut, and Google Docs (formerly named Writely )2. These projects have very different data size, infrastructure and latency requirements: “from throughput-oriented batch-processing jobs to latency- sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data” (cf. [CDG+06, p. 1]). According to Chang et al. experience at Google shows that “Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability” (cf. [CDG+06, p. 1]). Its users “like the performance and high availability provided by the Bigtable implementation, and that they can scale the capacity of their clusters by simply adding more machines to the system as their resource demands change over time”. For Google as a company the design and implementation of Bigtable has shown to be advantageous as it has “gotten a substantial amount of flexibility from designing our own data model for Bigtable. In addition, our control over Bigtable’s implementation, and the other Google infrastructure upon which Bigtable depends, means that we can remove bottlenecks and inefficiencies as they arise” (cf. [CDG+06, p. 13]).

Bigtable is described as a database by Google as “it shared many implementation strategies with databases”, e.g. parallel and main-memory databases. However, it distinguishes itself from relational databases as it “does not support a full relational data model”, but a simpler one that can be dynamically controlled by clients. Bigtable furthermore allows “clients to reason about the locality properties of the data” which are reflected “in the underlying storage” (cf. [CDG+06, p. 1]). In contrast to RDBMSs, data can be indexed by Bigtable in more than one dimension—not only row- but also column-wise. A further distinguishing proposition is that Bigtable allows data to be delivered out of memory or from disk—which can be specified via configuration.

6.1.1. Data Model

Chang et al. state that they “believe the key-value pair model provided by distributed B-trees or distributed hash tables is too limiting. Key-value pairs are a useful building block, but they should not be the only building block one provides to developers.” Therefore the data model they designed for Bigtable