The Distributed Architecture Behind Apache Cassandra
When the client asked me to start research about a NoSQL database one year ago, I could not imagine how rich the experience would be in terms of technical breadth and depth. After some research, we discovered the Apache Cassandra would meet the requirements we were looking for and that is why I would like to share some technical information about my journey learning about Cassandra.
The NoSQL world and Cassandra’s birth
The database management software world changed some time ago driven mainly by high-tech companies that handle huge amounts of distributed data over clusters of commodity server machines. These machines need to face common availability issues to attend high volume of simultaneous users. For that reason Facebook engineers decided to create a new solution for their user’s inbox search problems and compose a new distributed storage system using the best features of two other existing software from Amazon (Dynamo) and Google (Big Table). The idea was to create a new system for managing structured data that is designed to scale many servers with no single point of failure to overcome common services outages and avoid negative impact to end users.
Cassandra was designed to fall in the “AP” intersection of the CAP theorem that states that any distributed system can only guarantee two of the following capabilities at same time; Consistency, Availability and Partition tolerance. In this way Cassandra is a best fit for a solution seeking a distributed database that brings high availability to a system and is also very tolerant to partition to its data when some node in the cluster is offline, which is common in distributed systems.
Basic data structure
Cassandra is classified as a column based database, which means that its basic structure to store data is based upon a set of columns, which are comprised, by a pair of column key and column value. Every row is identified by a unique key, a string without a size limit, called partition key. Each set of columns are called column families, similar to a relational database table.
The following relational model analogy is often used to introduce Cassandra to Newcomers:
This analogy helps make the transition from the relational to non-relational world. However, do not use this analogy while designing Cassandra column families. Instead, think of the Cassandra column family as a map of a map: an outer map keyed by a row key, and an inner map keyed by a column key. Both maps are sorted.
A nested sorted map is a more accurate analogy than a relational table, and will help you make the right decisions about your Cassandra data model.
- A map gives efficient key lookup, and the sorted nature gives efficient scans. In Cassandra, we can use row keys and column keys to do efficient lookups and range scans.
- The number of column keys is unbounded. This means, you can have wide rows.
- A key can itself hold a value, meaning In other words, you can have a valueless column.
Cassandra cluster topology
A Cassandra instance stores one or more tables according to the user definition. The common topology for a Cassandra installation is a set of instances installed into different server nodes, forming a cluster of nodes also referenced as the Cassandra ring. Each node in the ring is responsible for storing a copy of column families defined by the partition key and replication factor configured.
The figure 2 shows a Cassandra ring with three nodes storing different keys that are calculated through a hash function in Cassandra to decide the location of data and its replicas. The default configuration for the replication factor is 3, which means that each data stored on node 1 will also be replicated (copied) to the nodes 2 and 3.
The colors in the ring represents the set of keys stored in each node according to the range number returned by the hash function. In this example all keys from 1 to 38 will be placed on node 1, so the key 1 will be stored on node 1 and one copy of it will be stored on node 3 and 2 considering a replication factor configuration of three.
If some application need to access the key 1 but node 1 is down then Cassandra will try to get a copy of it from nodes 3 or 2.
Another important aspect is that there is no master node on the cluster; each node can act as a master, known as coordinator node. This happens when a client connects to any of Cassandra nodes. It then acts as the coordinator and that node will be responsible to read or write data from/to the right nodes that owns the keys.
It should also be mentioned, that Cassandra supports specific configuration for datacenter deployments so that one can specify which nodes will be located in the same datacenter and even the rack position. This configuration helps to increase the level of high-availability and also to reduce the read latency, so that clients can read data from the nearest node. For this last feature there is a specific configuration called “Network Topology Strategy” defined on keyspace definition.
All Nodes in Cassandra communicates with each other through a peer-to-peer communication protocol called Gossip Protocol that broadcasts information about data and nodes health. One interesting thing about this protocol is that it gossips! By this I mean, one node does not need to talk with all nodes to know something about them. To avoid the communication chaos when one node talks to another node it not only provides information about its status. However, it also provides the latest information about the nodes that it had communicated with before. Through this process, there is a reduction in network log, more information is kept and the efficiency of information gathering increases.
The gossip protocol is also used to failure detection it behaves very much like TCP protocol, trying to get an acknowledge response before considering whether a Node is up or down. Essentially, when two nodes communicate with one another; for instance when Node 1 sends a SYN message (similarly to the TCP protocol) to the Node 2, it expects to receive an ACK message back and then send again an ACK message to Node 2 confirming the 3 way handshake. If the Node 2 does not reply to the SYN message, it is marked as down. Even when the nodes are down, the other nodes will be periodically pinging and that is how the failure detection happens.
This protocol is also important to provide the client’s driver information about the cluster to allow it to choose the better available Node to connect to and in order to load balance connections and find the nearest and fastest path to read required data.
Now that we covered the overall architecture, in which Cassandra is built on. We can delve deeper into the details about how all data is written and read. We already commented that Cassandra if focused on high availability and partition tolerance, but it does not mean that there is no data consistency. In fact the consistency level on Cassandra is tunable by the user. You can choose from low to high level of consistency. The common terms used for both read and write data are ONE, QUORUM and ALL. This means that considering the default replication factor of three (3) defined for the tables of a keyspace and a consistency level of ALL, one write operation on Cassandra will wait for the data to be written and confirmed by all 3 nodes before replying to the client. QUORUM consistency means majority of nodes (N/2+1).
In a cluster perspective when a client connects to a Node to write some data, it first must check which node the partition key of that data belongs to. This is then followed by the coordinator node, which the client is connected to. It sends that data to the right node that should store that key, depending on the consistency level defined by the user (Consistency Level of ALL) the coordinator waits all nodes respond to the request before reply to the client. So it is up to the user define which consistency level is suitable for each part of the solution.
Write and Read Path
In a single node perspective when a client requests to write data in a Cassandra node, the request is persisted on a commit log file on disk and then the data is written in a memory table called memtable. When the memtable is full, after reaching a pre-configured threshold, it is flushed to disk in an immutable structure called SSTable. Each table on Cassandra has a respective memtable and SSTable.
Internally each Cassandra node handles the data between memory and disk using mechanisms to avoid less disk access operations as possible and for do that it uses a set of caches and indexes in memory to make it faster to find the data on right location. One of them is the Bloom Filter which is an in memory structure that checks if row data exists in the memtable before accessing the SSTables on disk. Other is the Index Partition index that stores a list of partition keys and the start position of rows in the data file written on disk. This process is represented by the figure below.
My objective in this article is to focus on the architecture building blocks of the Cassandra, but I would like to comment on how data modeling works in Cassandra. This is one of the hardest parts in working with Cassandra, mainly because it is paradigm shift from the well-known relational database world. The first thing one will notice, as a Cassandra newcomer user note when you start working with it, is the lack of “JOINs” between tables. One imagines how could you model your business problem domain entities to reach the desired solution? The answer is that you do not model base on key entities and its relationships in the way to normalize the data, but you need to model base on the queries your application needs to fulfill its user interface demands, creating a de-normalized model. So if you need to create new visions of the same data, the recommended practice is to create a new table (column family) for it. This technique is called Query Based Modeling. Firstly, you think about the queries that you need to execute and then you model the tables based on it. Yes, you will end with duplicated data stored, but the reason for that is you are trading disk space for read performance, in fact disk space is cheaper.
One of the things that makes me impressed about Cassandra is the level of configuration options available to tune its behavior to fit in your solution so as a distributed database it is prepared to bring to you a high level of system availability with no single point of failure. Also it allows you to have low latency for write data and you can find some detailed benchmarks with other NoSQL products on the internet.
You may have heard of Apache Cassandra and find it of interest to use it in your project. I would firstly recommend you evaluate your business requirements and verify if your project requires the use of this type of database management system, otherwise you may face many difficulties in implementation that could be solved using traditional relational databases. One should bear in mind that Cassandra was created to solve specific problems of availability and speed to write and access large volumes of data. Some use cases have been tested and are also well addressed by Cassandra as time series data storage, immutable events persistence and for analytical database. In the financial industry there are companies using Cassandra as part of a fraud detection system.
My advice to those considering whether to use Cassandra is to begin reading some of the learning resources available on community websites and from companies that are supporting the Cassandra development as the DataStax. You can then proceed with a proof-of-concept with a small cluster and with a specific use case in mind. In parallel research about successful implementation cases using Cassandra as a distributed persistence storage, this will help you take clear and assertive decisions to build a good solution.
Apache Cassandra Website: https://cassandra.apache.org/
Planet Cassandra Community: http://planetcassandra.org
Datastax Website: http://www.datastax.com/