Clojure and Cassandra with clj-cassandra


I’ve just finished playing with Robert Luo’s clj-cassandra for today and it’s time to recap what’s happened so far.

First of all simply downloading clj-cassandra and running lein jar didn’t do it for me, too many missing dependencies, I ended up downloading the jar from Clojars.

Somehow I managed to get all necessary dependencies down to be able to actually run my testing environment, I took cassandra-0.6.1.jar from the Cassandra bin download. The UUID jar can be gotten from here.

Anyway, my repl.sh looks like this:

java -cp .:lib/commons-collections-3.2.1.jar:lib/commons-pool-1.5.4.jar:lib/jline-0.9.94.jar:lib/slf4j-log4j12-1.5.8.jar:lib/slf4j-api-1.5.8.jar:lib/log4j-1.2.14.jar:lib/libthrift-r917130.jar:lib/clojure-1.1.0.jar:lib/clojure-contrib-1.1.0.jar:lib/apache-cassandra-0.6.1.jar:lib/cassandra-javautils.jar:lib/clhm-production.jar:lib/high-scale-lib.jar:lib/clj-cassandra-0.1.1.jar:lib/uuid-3.2.jar:classes jline.ConsoleRunner clojure.lang.Repl

As you can see the how to on clj-cassandra’s Github page is pretty spartan. Let’s flesh it out but first you might want to read WTF is a SuperColumn and Up and running with Cassandra.

In this short tutorial we’re going to go for a simplified version of the blog post tags in Arin’s tutorial, but using articles and feeds instead since I’m heavily into feed reading at the moment 🙂

So we will have three tables (CFs) of feeds, articles and links between articles and feeds. If an article could’ve been uniquely linked to a single feed we could’ve used super columns just like in Arin’s tutorial. Instead of comments to a blog post we would’ve mapped articles to a feed. However that’s not the case as many feeds are simply aggregations of articles who really belong to other feeds and that doesn’t account for Twitter where many tweets show up in many different feeds. Therefore some kind of normalization (link table) is required if we are to avoid drowning in duplicate data.

For this example I’ve added the following to the default Keyspace1 in storage-conf.xml:

<ColumnFamily CompareWith="BytesType" Name="Feeds"/>
<ColumnFamily CompareWith="BytesType" Name="Articles"/>
<ColumnFamily CompareWith="TimeUUIDType" Name="ArFeLinks" />

As you can see we’re sorting by TimeUUID when it comes to the links between articles and feeds, more on that later.

Let’s start walking the code, from top to bottom:

(ns cassandra-test
   (:use cassandra.client)
   (:import
      [java.util UUID]
      [com.eaio.uuid.UUID]
      [cassandra TimeUUID]))

(defn mk-table
   [tbl]
   (-> (mk-client "localhost" 9160)
      (key-space "Keyspace1")
      (column-family tbl)))

Note the time stuff, we will be using some of the functionality directly, that’s why we need to import it. As a guy who works with PHP for a living I find the -> macro to be ironic, no further comments on the above listing 🙂

(def articles-tbl (mk-table "Articles"))
(def arfelinks-tbl (mk-table "ArFeLinks"))
(def feeds-tbl (mk-table "Feeds"))
    
(dotimes [i 100] 
   (set-attrs! 
      feeds-tbl 
      (str "www.feed" i ".com/feed") 
      (hash-map :title (str "Feed " i) :xmlurl (str "http://www.feed" i ".com/feed"))))

(dotimes [i 100]
   (set-attrs! 
      articles-tbl 
      (str "www.somedomain.com/article" i) 
      (hash-map :title (str "Article " i) :htmlurl (str "www.somedomain.com/article" i))))

So we add 100 articles and feeds, for simplicity’s sake I’ve limited the amount of attributes each article has to simply :title and :htmlurl where the htmlurl is also the key.

Note the use of dotimes, as a Clojure noob I was tempted to use for to begin with before I realized it’s just a list comprehension.

Let’s start linking (I just realized I could’ve used add-collection! instead of the below, oh well…):

(dotimes [i 4] 
   (set-attr! 
      arfelinks-tbl 
      "www.feed1.com/feed" 
      (TimeUUID/getTimeUUID) 
      (str "www.somedomain.com/article" i)))

So we’re creating a few links between feed 1 and some of the articles, note the UUID stuff for generating keys, these ids can later be used in a GUI for instance so that we can go in and fetch exactly the information we need without doing any scans. But for now they simply serve to order our stuff by date so we don’t have to work so much when we want a list of the 20 newest articles belonging to feed X.

Now we are ready to do a query:

(println 
   (get-keys-attrs 
      articles-tbl 
      (map last (get-collection arfelinks-tbl "www.feed1.com/feed"))
      [:title]))

So get-collection will fetch the {“uuid”: “article html url”} columns which we then use as input to get-keys-attrs.

This is where you would join in an rdbms or submit some kind of map reduce job to a system that supports it. However AFAIK Cassandra isn’t doing map reduce yet so we have to submit two requests to get the the information that we need (the titles of the articles). First to get the article keys/urls and then get the articles themselves with the help of the keys.

As you can see going in the opposite direction i.e. fetching all feeds to which an article belongs to would be much more of a chore. That’s because we’ve planned it that way, this need for careful planning is one of many drawbacks compared to an rdbms. That key-value stores is not a silver bullet has become clear as daylight during the past couple of weeks, it’s good to know them though and how they work in order to be able to apply them where they are a good fit.


Related Posts

Tags: , ,