Thursday 17 May 2012

Cassandra - The Legend

According to the myth, Cassandra, the daughter of Priam - king of Troy - and Hecuba, was asked by the god Apollo for her love and devotion. Cassandra had resorted to giving divination as a freelancer after a long period in unemployment. Her decline to Apollo's proposal provoked the god's wrath who cursed her to have a bad lot as a prophet; her divination would not be believed any longer. A long period of frustration followed; but Cassandra was cunning and along with some native Americans, the Apache, they built a Key-Value store called after their names... That's more or less how Apache Cassandra was created.  

This text is intended to serve as a hands-on tutorial for Cassandra that takes the newcomers step-by-step to the world of Apache Cassandra.

Features

Cassandra offers the following features:
  • Linear Incremental Scalability - the more nodes you have the more you get out of Cassandra (High R/W throughput).
  • Fault Tolerance - No single point of failure.
  • Eventual Consistency and with tunable level of consistency.
  • Distributed Architecture according to the Dynamo-ring (equal nodes).
  • Abides by the BigTable Data Model (Column Families, Columns).
  • Customizable/Tunable (There's a lot to read about that) .
  • Easy to install (at least on Linux that I have tried). Download it from here. For installation instructions and prerequisites read this.
  • Good documentation and support.
In the examples that follow, we assume that version 1.1.0 of Cassandra was installed. Tests were carried out on an Ubuntu 10.04.

Installation Instructions:  It is quite easy to install Cassandra on your system. Just download Cassandra from here, and then create the directory /var/lib/cassandra, own it and allow everyone to write therein. For example:

$ wget http://apache.forthnet.gr/cassandra/1.1.0/apache-cassandra-1.1.0-bin.tar.gz
$ tar -zxvf apache-cassandra-1.1.0-bin.tar.gz
$ sudo mkdir /var/lib/cassandra
$ sudo chown $USER /var/lib/cassandra
$ sudo chmod a+w /var/lib/cassandra

You have now successfully installed Cassandra!

The BigTable Data Model

This is more or less what the
BigData DM is about...

Cassandra makes use of the BigTable Data Model and it necessary to understand a few things before taking our first steps into setting up our own database.
Simple! Isn't it?
In brief, there are four basic concepts in this DM: Keyspaces, Columns, Column Families (CF). 
  A keyspace is the counterpart of a database or a schema in relation DB terminology. It is the namespace over which the database structure is defined and a way to isolate data stores. As a principle, different applications use different keyspaces. 
  A Column is the quantum of data used for storage. A column consists of a name, a value and a timestamp. A column may have a predefined name (static) or an assigned name in runtime (dynamic). Therefore, Cassandra is not schema-less as one can define a structure before hand and at the same time new columns can be added by the application if this is necessary.

Schematic representation of a Column
in the BigTable data model for Cassandra
This introduces a completely different rationale compared to the crisp and strict one that comes with a relational database like SQL (Disclaimer: you should take some time understanding the pros and cons of a relational database and a NoSQL one like Cassandra. In some cases, a combination of both is recommendable. So, everything has its purpose. NoSQL database have not replaced relational ones).

A Column Family is the counterpart of a table in relational terminology.

A CF is a set of columns
 As it is put in the Cassandra Wiki

"A column family is a container for rows, analogous to the table in a relational system. Each row in a column family can referenced by its key."

Setting up Cassandra 

Setting up the Server and the Client

We first need to start the Cassandra database server. For that, from inside the installation directory, run the command:

$ bin/cassandra -f

This will start Cassandra and will keep it in the foreground (otherwise it will run as a daemon). Next, we need to connect to Cassandra. For that, run:

$ bin/cassandra-cli

If you see the following message, it means you are successfully connected to the database:

Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.1.0

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown]


This will start a console from which you have access to your brand new database.

Create and Use a Keyspace

Creating a keyspace is the first task we need to do. Let's start by creating a new keyspace:

create keyspace CodeOfHonour;

this is the output we see in the client console:

[default@unknown] create keyspace CodeOfHonour;
bb9f00fa-9b6d-32c3-a85d-1780921f65da
Waiting for schema agreement...
... schemas agree across the cluster

and start using this keyspace:

use CodeOfHonour;

The following message should be returned:

[default@unknown] use CodeOfHonour;
Authenticated to keyspace: CodeOfHonour

Definition of our "Schema"

We then need to create a Column Family:

CREATE COLUMN FAMILY users 
WITH comparator=UTF8Type 
AND column_metadata=[
{column_name: full_name, validation_class: UTF8Type},
{column_name: email, validation_class: UTF8Type}
];

Which describes the following entity called User:

ER Diagram for our entity

Notice that the Key was not explicitly declared in the command for the creation of the column family. A Key is always present in a column family! Additionally, data-types are different from the ones that appear in SQL. In Cassandra data-types are defined in terms of Validators and Comparators. A very good article on data-types is available from datastax.

We may overview the column family we created by typing:

[default@CodeOfHonour] describe users;

this will return (I deleted a few lines and details for brevity) all necessary information one needs to know about the given column family:

ColumnFamily: users 
  Key Validation Class: BytesType 
  Default column value validator: BytesType 
  Columns sorted by: UTF8Type
  Column Metadata: 
    Column Name: full_name 
      Validation Class: UTF8Type 
    Column Name: email 
      Validation Class: UTF8Type 
  Compaction Strategy: SizeTieredCompactionStrategy 
  Compression Options: 
    sstable_compression: SnappyCompressor

Registering Key-Value Pairs

Let us now add our first entry in the database - a new user with assigned key 'ID::asdf':

set users[utf8('ID::asdf')][utf8('full_name')] = 
   utf8('Chung');
set users[utf8('ID::asdf')][utf8('email')] = 
   utf8('my@mail.org');

Let us now inspect the line that we have just added:

[default@CodeOfHonour] get users[utf8('ID::asdf')]; 

this will return the following:

=> (column=email, value=my@mail.org, timestamp=1337289942300000)
=> (column=full_name, value=Chung, timestamp=1337289941659000)
Returned 2 results.

If you have added a few Users and you want to list them, use the command:

list Users;

Let's go back now a few steps to take a look at the output of the command describe users. The Validator of the Keys of this Column Family is Key Validation Class: org.apache.cassandra.db.marshal.BytesType. In other words, the key of the Column Family is binary - not UTF8-encoded. This is why we had to type all these lengthy commands for writing to and reading from this CF. Without having to change the structure of the CF, we can make an explicit assumption that what the client provides is UTF8-encoded, so utf8() can be omitted. To do so just run the following command:

[default@CodeOfHonour] assume users keys as utf8;
Assumption for column family 'users' added successfully.

Now, one can simply run:

[default@CodeOfHonour] get users['ID::asdf'];   
=> (column=email, value=my@mail.org, timestamp=1337289942300000)
=> (column=full_name, value=Chung, timestamp=1337289941659000)
Returned 2 results.

1 comment:

  1. This a great tutorial on how to install and run Cassandra, I've used it on a Centos 6.5 without any difficulties. Thanks Pantelis!

    ReplyDelete