NoSQL Databases: Cassandra vs. DynamoDB

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the differences between two NoSQL databases Cassandra and DynamoDB. The discussion begins with a brief overview of NoSQL and the data store types for these NoSQL databases, followed by more focus discussion about Cassandra and DynamoDB.

NoSQL Overview

NoSQL stands for “Not Only SQL” (EMC, 2015; Sahafizadeh & Nematbakhsh, 2015). NoSQL is used for modern, scalable databases in the age of Big Data. The scalability feature enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015). The platform can incorporate two types of scalability to support the processing of Big Data; horizontal scaling and vertical scaling. The horizontal scaling allows distributing the workload across many servers and nodes. Servers can be added in the horizontal scaling to increase the throughput (Sahafizadeh & Nematbakhsh, 2015). The vertical scaling, on the other hands, more processors, more memories, and faster hardware can be installed on a single server (Sahafizadeh & Nematbakhsh, 2015). NoSQL offers benefits such as mass storage support, reading and writing operations are fast, the expansion is easy, and the cost is low (Sahafizadeh & Nematbakhsh, 2015). Examples of the NoSQL databases are MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB.

NoSQL Data Stores Types

Data stores are categorized into four types of store: document-oriented, column-oriented or column family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). The purpose of the document-oriented database is to store and retrieve collections of information and documents. Moreover, it supports complex data forms in various format such as XML, JSON, in addition to the binary forms such as PDF and MS Word (EMC, 2015; Hashem et al., 2015). The document-oriented database is similar to a tuple or in the relational database. However, the document-oriented database is more flexible and can retrieve documents and information based on their contents. The document-oriented data store offers additional features such as the creation of indexes to increase the search performance of the document (EMC, 2015). The document-oriented data stores can be used for the management of the content of web pages, as well as web analytics of log data (EMC, 2015). Example of the document-oriented data stores includes MongoDB, SimpleDB, and CouchDB (Hashem et al., 2015). The purpose of the column-oriented database is to store the content in columns aside from rows, with attribute values belonging to the same column stored contiguously (Hashem et al., 2015). The column family database is used to store and render blog entries, tags, and viewers’ feedback. It is also used to store and update various web page metrics and counters (EMC, 2015). Example of the column-oriented database is BigTable. In (EMC, 2015; Erl, Khattak, & Buhler, 2016) Cassandra is also listed as a column-family data store. The key-value data store is designed to store and access data with the ability to scale to a very large size (Hashem et al., 2015). The key-value data store contains value and a key to access that value. The values can be complex (EMC, 2015). The key-value data store can be useful in using login ID as the key to the preference value of customers. It is also useful in web session ID as the key with the value for the session. Examples of key-value databases include DynamoDB, HBase, Cassandra, and Voldemort (Hashem et al., 2015). While HBase and Cassandra are described to be the most popular and scalable key-value store (Borkar, Carey, & Li, 2012), DynamoDB and Cassandra are described to be the two popular AP (Availability and Partitioning tolerance) systems (M. Chen, Mao, & Liu, 2014). Others like (Kaoudi & Manolescu, 2015) describes Apache Accumulo, DynamoDB, and HBase as the popular key-value stores. The purpose of the graph database is to store and represent data which uses a graph model with nodes, edges, and properties related to one another through relations. Example of the graph database is Neo4j (Hashem et al., 2015). Table 1 provides examples of NoSQL Data Stores.

Table 1. NoSQL Data Store Types with Examples.

Cassandra

Cassandra is described as the most popular NoSQL database (C. P. Chen & Zhang, 2014; Mishra, Dehuri, & Kim, 2016). It is the second-generation distributed key-value store which was developed by Facebook in 2008 (Bifet, 2012; Cattell, 2011; C. P. Chen & Zhang, 2014; Rabl, Sadoghi, & Jacobsen, 2012). It is also described as a clustered, key-value database which uses column-oriented storage and redundant storage for accessibility in both read/write performance and data sizes (Mishra et al., 2016).

Cassandra can handle the very large amount of data which spread out across many servers. It also provides a highly available service without a single point of failure (Bahrami & Singhal, 2015; Tilmann Rabl et al., 2012). Failure detection and recovery are fully automated (Cattell, 2011). It adopts concepts from both DynamoDB and BigTable. Cassandra integrated the distributed technology of DynamoDB with the data model of BigTable (M. Chen et al., 2014; Tilmann Rabl et al., 2012). Thus, the architecture of Cassandra is a mixture of BigTable of Google and DynamoDB of Amazon, providing availability and scalability (M. Chen et al., 2014; Tilmann Rabl et al., 2012). Example of Cassandra’s application is Netflix which is using it as the back-end database for its streaming services (Bifet, 2012).

Cassandra, like HBase, is written in Java and used under Apache licensing (Cattell, 2011). Cassandra has column groups, uses memory to cache the updates which get flushed into a disk, and the representation of the disk representation is compacted periodically. Cassandra can be used for partitioning and replication (Cattell, 2011). The partition and copy techniques in Cassandra are said to be similar to those of DynamoDB to achieve consistency (M. Chen et al., 2014). However, Cassandra is said to have a weaker concurrency model than other systems, as there is no locking technique and replicas are updated asynchronously (Cattell, 2011).

When using Cassandra, newly available nodes are brought automatically into a cluster using “phi accrual algorithm” to detect node failure and determine cluster membership in a distributed fashion using a “gossip-style algorithm” (Cattell, 2011). Tables in Cassandra are in the form of distributed four-dimensional structured mapping, where the four dimension including row, column, column family and super column (M. Chen et al., 2014). Cassandra provides the concept of “super column” providing another level of grouping within column groups (Cattell, 2011). The row is distinguished by a string-key with arbitrary length (M. Chen et al., 2014). The number of columns to be read or written does not matter because the operation on rows is an auto operation (M. Chen et al., 2014). The columns can constitute clusters which are called column families, similar to the data model of BigTable (M. Chen et al., 2014).

Cassandra uses an “ordered hash index” providing the benefit of both the hash and B-Tree indexes (Cattell, 2011). However, the sorting is slower in “ordered hash index” than with the B-Tree index (Cattell, 2011). Cassandra is said to be gaining a lot of momentum as an open source project as it has reportedly scaled to about 150 machines or more in the production platform of Facebook (Cattell, 2011). Cassandra uses the eventual-consistency model which is said to be not adequate. However, “quorum reads” of a majority of replicas provide a technique to get the latest data (Cattell, 2011). The writes in Cassandra are atomic within a column family (Cattell, 2011). Moreover, Cassandra supports versioning and conflict resolution techniques (Cattell, 2011). The key functions of Cassandra involve P2P (Peer-to-Peer) system structured and unstructured, decentralized storage system, symmetric system orientation, efficient latencies, linear scalability, the map is indexed by a unique “low-key,” “column-key” (Kalid, Syed, Mohammad, & Halgamuge, 2017).

With respect to security, Cassandra supports the encryption of all password using MD5 hash function and the passwords are very weak, which can cause a threat if any malicious user can bypass client authorization (Sahafizadeh & Nematbakhsh, 2015). The user can extract the data because of the lack of authorization technique in inter-node message exchange (Sahafizadeh & Nematbakhsh, 2015). Thus, Cassandra is potential for denial of service attack because it performs one threat per one client and it does not support inline auditing (Sahafizadeh & Nematbakhsh, 2015). Cassandra uses a query language called Cassandra Query Language (CQL), which is similar to SQL (Sahafizadeh & Nematbakhsh, 2015). Experiments showed that the injection attack is possible in Cassandra using CQL like SQL injection (Sahafizadeh & Nematbakhsh, 2015). Moreover, Cassandra has a limitation of managing inactive connection (Sahafizadeh & Nematbakhsh, 2015).

DynamoDB

DynamoDB belongs to Amazon and is highly available, scalable, key-value, and low-latency NoSQL database (Kalid et al., 2017; Mishra et al., 2016; Russell & Van Duren, 2016). DynamoDB is described as one of the earliest NoSQL databases affecting the design of other NoSQL databases such as Cassandra (Mishra et al., 2016). DynamoDB supports both the key-value and document data store (Sahafizadeh & Nematbakhsh, 2015). The goal of DynamoDB is higher performance and high throughput (Thuraisingham, Parveen, Masud, & Khan, 2017). DynamoDB can expand and shrink as required by the applications (Thuraisingham et al., 2017). It supports in-memory cache using DynamoDB Accelerator providing millisecond responses for millions of requests per seconds (Thuraisingham et al., 2017).

With respect to security, in DynamoDB the data security, authentication, and access control can be implemented on a per-table basis, to leverage the AWS identity and access management system (Russell & Van Duren, 2016). However, data encryption is not supported in DynamoDB. It supports the communication between the client and server using the HTTPS protocol. DynamoDB supports authentication and authorization, and requests need to be signed using HMAC-SHA256 (Sahafizadeh & Nematbakhsh, 2015).

Summary of Comparison between DynamoDB and Cassandra

DynamoDB was one of the earliest NoSQL databases impacting the design of other NoSQL databases such as Cassandra. Cassandra integrated the data model from BigTable of Google, and the distributed system technology of DynamoDB is providing availability and scalability. DynamoDB and Cassandra are popular for Availability and Partitioning tolerance. The partitioning and copying techniques in Cassandra are similar to those of DynamoDB to achieve consistency. DynamoDB does not support encryption while Cassandra does use MD5 for all passwords. However, DynamoDB supports HTTPS protocol. Table xx summarizes the comparison between DynamoDB and Cassandra.

Table 2. Summary of Comparison between DynamoDB and Cassandra

References

Bahrami, M., & Singhal, M. (2015). The role of cloud computing architecture in big data Information granularity, big data, and computational intelligence (pp. 275-295): Springer.

Bifet, A. (2012). Mining big data in real time. Informatica, 37(1).

Borkar, V. R., Carey, M. J., & Li, C. (2012). Big data platforms: what’s next? XRDS: Crossroads, The ACM Magazine for Students, 19(1), 44-49.

Cattell, R.-i. (2011). Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), 12-27.

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: a survey. Mobile Networks and Applications, 19(2), 171-209.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Erl, T., Khattak, W., & Buhler, P. (2016). Big Data Fundamentals: Concepts, Drivers & Techniques: Prentice Hall Press.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Kalid, S., Syed, A., Mohammad, A., & Halgamuge, M. N. (2017). Big-data NoSQL databases: A comparison and analysis of “Big-Table”,“DynamoDB”, and “Cassandra”. Paper presented at the Big Data Analysis (ICBDA), 2017 IEEE 2nd International Conference on.

Kaoudi, Z., & Manolescu, I. (2015). RDF in the clouds: a survey. The VLDB Journal, 24(1), 67-91.

Mishra, B. S. P., Dehuri, S., & Kim, E. (2016). Techniques and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing (Vol. 17): Springer.

Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment, 5(12), 1724-1735.

Rabl, T., Sadoghi, M., & Jacobsen, H. (2012). Solving Big Data Challenges for Enterprise Application Performance Management.

Russell, B., & Van Duren, D. (2016). Practical Internet of Things Security: Packt Publishing Ltd.

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Thuraisingham, B., Parveen, P., Masud, M. M., & Khan, L. (2017). Big Data Analytics with Applications in Insider Threat Detection: CRC Press.

Share this:

Related

Published by Think and Knowledge Tank