We did a small two person hackday to get to know Cassandra a bit. Lukas already learned a bit about Cassandra at Berlin Buzzwords but it was time to get some hands on experience to better understand how Cassandra would work out in a real client project. Our agenda for the day was however essentially: get it to run, see how to get it connected with PHP and then import some data and do some queries.
But first a very brief summary of what Cassandra is about:
To get up and running we tried different approaches. One was just to download one of the VMs provided by DataStax (the main company driving Cassandra). That sort of works but the VM operating system is not really nicely configured. But it was at least enough to run a first few queries we found in the Cassandra wiki.
Now we wanted to install the PDO driver for Cassandra. So we setup a new VM and installed Cassandra. Compiling the PDO driver proofed to be a bit more complicated. We messed around for a while until Henrik saved us by uploading debian packages for us compatible for Ubuntu 14.04 64bit:
sudo dpkg -i thrift-0.9.1_amd64.deb
sudo dpkg -i pdo_cassandra-0.5.1_amd64.deb
sudo php5enmod pdo_cassandra
We then started playing a bit. We looked for some interesting data sources and found this blog post listing a bunch of options for us, we ended up going with a dump of reddit posts. Not a particularly challenging data set for Cassandra, but enough to get a feel for things. Cassandra provides a PostgreSQL insipired variant of COPY to import CSVs. Using the following table definition, we started writing some queries against the dataset.
CREATE TABLE reddit_posts (
PRIMARY KEY (id, num_comments)
We quickly realized that for index supported queries, one needs to define composite key indexes. With a bit of time we got it to work.
$db = new PDO("cassandra:host=localhost;port=9160", null, null, array(PDO::CASSANDRA_ATTR_THRIFT_DEBUG => true));
$stmt = $db->prepare("SELECT * FROM mykeyspace.reddit_posts WHERE num_comments > 11 limit 1 allow filtering");
We then played with a counter, but realized that counters cannot be in a composite index. Finally we briefly looked at their “transactions”, which however require every query essentially specify a condition for concurrency.
Pros and cons
- Write speed
- High availability due to decentralisation, no master node
- Linearly scalable without downtime
- Consistency highly controllable, use when needed
- Ad-hoc queries possible via SQL-ish language (*)
- Index supported queries need to be planned in advance, may require denormalization to support different index supported queries in the same data-set
- Not all queries possible (very limited aggregation capabilities), so in practice other systems might need to be attached for searching
Tags: cassandra, database