Riak at inagist.com
At inagist.com we have been using Riak and yes we are loving it. We moved away from Cassandra after it started taxing our limited resources. The nice thing about Cassandra was the data model. Super columns allowed us to store metadata for a resource as needed. For example the retweets and replies of a tweet were stored in their own super columns associated with a tweet and we could pull it out as needed. Concurrency issues were also not a bother. We could do simultaneous updates to columns and super columns and not worry about data consistency issues. This is seriously tricky when maintaining tweet statistics. Popular tweets keep getting retweeted and replied to concurrently by many people.
When looking for alternatives Riak was our first choice primarily because of it being in Erlang and since it had a map-reduce option which looked seriously promising. The ability to have a choice of backends was another compelling factor. Here are some of the interesting stuff that we have worked out in using Riak.
Using the Data model to our advantage
At the heart of Riak everything is a key-value. All metadata is associated with the value and has to be read and updated as a single unit. The most interesting metadata is of-course the Links that you store along with a keys value. Interesting because Riak's map reduce has an extra option called link walking. This allows you to filter the links on a document by tag or bucket and feed the linked documents to the next phase. Infact Riak's map-reduce allows you to have any combination of link, map, reduce options to process your data. And yes these are optional too. So infact you can have a link-reduce, link-link-reduce or link-link type queries too.
Why is this interesting? It allows us to store metadata on a seperate resource and link it to the main resource. Meaning we could have say 10 buckets storing the ids of the retweets as Links and the main resource has a link to these 10 buckets. You could parse through this list of with a link-link query. This reduces the contention on one resource for updates, parallelizes the read and spreads it across the cluster. We store replies and retweet details of a tweet in this model.
A link has three attributes Bucket, Key and Tag. Its supposed to refer to a Bucket and Key if you intend to further get data out of the linked document. But if you know what you are upto this allows to do for some serious extra data management. We currently store some tweet meta-data in Bucket and Key with a well known tag. When we later want to query for all replies to a
tweet we do a link-link-reduce on well known tags and get the replies or retweets out. I'm not getting into specifics but it should give you the idea.
Interfacing with Riak
Most references point to using Riak via the HTTP interface or via the protocol buffers client. Great if you are working from a non Erlang environment.
We currently use the built in client for Riak over the protocol buffers client. With the main processing being in Erlang, and being a distributed app at that, there was no point in going thru extra layers to get into Riak. This also gives us some interesting options, like for example the "Your Friends" tab on the conversations page that you see once you log into inagist.com. This does a link-link-reduce-reduce where an extra reduce talks to remote erlang process for the logged in user to filter out only replies from his followers. See it in action on a popular tweet like this one from @BarackObama. Mind you the "Your Friends" feature will work only after your account is enabled on inagist.com, but the Popular tab works for anyone and pulls out only the replies which are of interest.
Storage options
And yes the back-end, currently we run the innostore back-end based on Embedded Innodb. This kind of makes Riak a distribution layer over a trusted storage layer. Of the back-ends available this has worked best for us giving a consistent performance along with being reasonable on the resource usage. But the biggest factor here is that it gives an option to plugin what you want like the trial we did with Tokyo Cabinet.
Our biggest bottle neck now is the disk space, we keep pruning the data set at a failry fast pace, roughly one week of data is all that we hold. We get a little above 5 million tweets a day from the twitter pipe and we keep cleaning out as the disks fill up.
A big thank you to the guys at Basho for Riak, its seriously awesome.