Monday 24 June 2013

Cassandra configuration

Chuck Norris can optimize Cassandra without
reading any instructions!
 In this post we'll go through Cassandra's architecture trying to explain how exactly data is stored (in memory and filesystem). It is important to have an overview of the overall workflow in order to identify critical points, optimize its performance and avoid OutOfMemoryErrors. According to the Cassandra Wiki 4GB is the bare minimum for running Cassandra and I can verify that it is necessary to have 4GB even for testing. For production it is recommendable to use minimum 8GB of RAM while some people report that 12GB is OK. In all cases, one should heed to fine tune Cassandra to avoid nasty surprises...

How it works

You need to strike a
balance between performance
and consumption of RAM resources.
Here is what happens when you register data to your Cassandra node. Insertion of data is an O(1) operation as it performed sequentially and no search processes are needed. Once some new data are committed to Cassandra they are first appended in the commitlog (a file on the filesystem). Afterwards, data are moved into a table in memory called a memtable where they are up for grabs. The data will stay in the memtable for a certain period of time or until it becomes full. Afterwards they are flushed into an SSTable and they then lie in the filesyste. In case some unexpected failure happens on your machine and the data have not been flushed to the hard disk, then the commitlog-to-memtable transfer will be replayed and no data will be lost. To this end, various questions arise naturally: How ofter should the memtables be flushed? How large should these be? These are some things we need to calibrate by changing certain values inside cassandra.yaml (this reminds a lot of people the file my.cnf of MySQL).
How data are written to Cassandra
The larger the commitlog files are, the longer will it take for Cassandra to start up and additionally the more memory resources will it consume on startup. What you can do - to the extent that Cassandra is shut down not due to some failure - is that you can apply a flush before shutting it down.

For n rows there are O(log n) SSTables tables on the file system containing the flushed data. The more these tables are the more is it difficult for the client to go through them and find the key if it is there. Periodically, when the number of different SSTables exceed a certain threshold, they are merged into one single SSTable. This operation is known as compaction.

Periodic Compaction of SSTables
Once SSTables are merged into a single SSTable, old tables are not immediately deleted. Instead they are marked as obsolete and are cleaned during garbage collection.

No comments:

Post a Comment