Searching with RiakSearch

I had in my previous post mentioned what we do with the search functionality at inagist.com. This post will look into the technical details behind the implementation. We use Riak as our storage layer, RiakSearch was a natural add on to it. I will try to detail my understanding of how riak search works and how we use it.

At its heart RiakSearch is an inverted index of terms to document id's. The inverted index maintains an ordered set of document id's, the merge_index backend which stores this index, splits this across various files. Specifically the backend has buffers which maintain the index in ETS as well as files, and segments which are files using a custom format to store ordered keys associated with an index, field name, field value tuple. Segments store metadata information regarding file offsets for key lookup and are loaded into ETS at startup for faster access. Additionally bloom filters speed up lookup in each offset. Buffers are periodicaly merged into segments, and segments once created are not updated except for the merge of segments into a single segment. Very much like the BitCask store for Riak. All this happens at the vnode level and riak core sits on top of all this and distributes the operations across vnodes. Index name, field name and field value are used to determine the hash for mapping to a vnode. At indexing time a document is split into postings which have index name, field name, field value mapping to document id and a bunch of properties. These are batched and send to vnodes responsible for each hash in parallel. 

Queries are broken into a set of logical operations which combine each individual matching term and brings up a final list of matching documents which are sorted and ranked. A query like "tweet:facebook email" is broken into something like "tweet has facebook and email". This translates to a logical and of docs having tweet:facebook and tweet:email, these operations are then send to the vnodes to stream the doc-id's matching this operation. The doc id's are then merged in order, via a merge sort since the keys are already sorted. This results in a final list of doc id's matching the query and the properties for each doc. The results are sorted based on these properties and finally returned. The properties have a term frequency for each doc and a pre-plan operation give the document frequencies for each term allowing to sort the docs based on term frequency and inverse document frequency.

Now that was a whirl-wind simplified wrap up of my reading of the search code. To note here is that its a very performance aware implementation for indexing and simple queries. Queries with low cardinality terms could throw away your search times, there is something in the works for this specific issue with inline fields. Also queries which could return millions of rows are also possible memory busters, even if you give options to limit the number of results these are operations which happen after the full results of the operation are in memory. Both of these re-iterates the fact that this is as the name implies "RiakSearch", an addon to make your life easier when working with Riak. The implementation is tailored for mating with the map-reduce operations for riak and that is readily exposed via all the interfaces to riak search.

How do we use it?

The search box on inagist.com is directly wired into riaksearch. To prevent our queries from leading to memory exhaustion we do a couple of tricks. As previously mentioned our backend is fully in Erlang and we directly talk to Riak in Erlang. We directly call the search_fold on the riak_search_client from our code and break the search operation when we have enough results. Our keys, the tweet id's are stored as negative numbers so that the sort ordering of the keys means we get the first n docs ordered in a latest first manner. We then rank them in this limited set.

The next place we use search is for threading conversations on the tweet detail page. I had in an earlier blog post mentioned how we did that with links and and link-map-reduce operations. With search we just index the replies against the tweet it is in reply to and bring it back in from search on the reply to field. This is better since link updates modifies the whole document un-necessarily, where we only needed the meta-data on the tweet to be updated.

Another place we plugin search is to run our clean-up operation. We index the tweet timestamps to the minute granularity and cleanup tweets older than a certain time period. Getting to older tweets without search would have meant we maintain the ids separately to get a handle on which id's to flush out.

While you can choose to have riaksearch index all documents that are stored in your riak cluster via a pre-commit hook, we decided to trigger the indexing via our own calls into riak search. Two reasons to this, we found the pre-commit hook fail a couple of time with a timeout under heavy load, also our indexing needs meant we index the text in a tweet at a point when the tweet was determined to be indexable by the app and not at the point of insert into the back end store. Params like the time stamp however are indexable at insert time.

Final Thoughts

RiakSearch perfectly complements Riak key value store. It frees you from having to access documents by id alone and managing your data is simpler. The fact that it works well with existing java code for text analysis is also worth mentioning. Its still in beta so I guess things are only going to get better from here.

 

Filed under  //  erlang   filtering   inagist   real time search   riak   riak search    search   twitter  
Posted by Jebu Ittiachen 

How we rate content at inagist.com

Its been often asked of us, "How do you decide scores for the tweets that you show on inagist.com?" Well here is how we determine what to show and what not to show. 

We divide tweets coming in from the Twitter stream into 2 categories, ones which have a URL and ones which do not. The assumption being that tweets with URL tend to be teasers to the content at the linked URL. Tweets from different users mentioning the same URL (possibly through a URL shortner) are scored against the URL.

Another basic assumption is how we classify your follow community. The follow community is composed of the people you follow directly and people whom you follow indirectly through lists. If you follow a person directly we give a higher score to activity from that user opposed to activity from a user in one of the lists that you follow. The reasoning here being that people you directly follow is what you see on most Twitter clients by default and possibly people whom you care about more.

Once your twitter id is enabled on inagist.com, we fetch your follow community and start watching the twitter stream on your behalf for activity from, or about, the people in this community. A retweet or a reply to a tweet makes an activity on a tweet. A retweet of a person you directly follow by a person in one of your lists, gains a higher score than a retweet of that person by someone not in your community. And a retweet of that person by a person you directly follow gains an even higher score. Similar scoring applies to replies except that reply from a person not in your community to a person in your community does not count. Conversations between people you directly follow gain higher scores than conversations across people in your lists. 

We present tweets which have a high score sorted by time on your user page. And yes options are coming soon to sort by score and filter by the score.

We do all of this in real-time for each enabled user on inagist. To be reasonable on the costing behind all of this we use bounded caches for scoring tweets. So if you have more people in your follow community you will see tweets expiring faster from your main page. We do archive these tweets which we discover, so you can go back and see those tweets which we had picked out in the past. Clicking on a list name pulls in tweets from the archive for that particular list.

Conversations page gives you more context on the activity surrounding these filtered tweets. Clicking on the activity icon   takes you to the conversation page where all the replies to a tweet are shown in order. We also present two different views on these replies. Popular replies are ones which have been further replied to or retweeted. Replies from your friends brings out only replies from your community.

We encourage you to sign in and see this in action for yourself. Also get into Twitter and start following lists of your choice on topics that interest you, or create lists of your own and curate your sources. May we recommend verifiedtlists and scobleizer for a variety of lists to get you started. Continue using twitter with your favorite client and inagist.com to surface interesting tweets from all your sources, we assure you, you will not miss a single one of those interesting 140 characters.

 

Filed under  //  filtering   inagist   ranking   real time   relevance   scoring   tweet classification  
Posted by Jebu Ittiachen