Archive for

August 2010

Link-Map-Reduce in Riak an example from inagist.com

My last post felt a little incomplete without some code backing it up. I'm following it up with a code sample of how exactly this map reduce is wired up. 

I will walk through how we do the "Popular Replies" section on the conversation page. Again here is a @BarackObama tweet, with more than a 500 replies. Popular replies extracts only those replies which have been further replied to, re-tweeted or a reply from the author of the tweet itself. Right now its picked out 1 of these 500+ replies.

Data Model

Resonses to a tweet are captured in a bucket of its own <<"tweet_responses_bucket">>. Each tweet is keyed by its tweet id as a 128 bit binary <<TweetId:128>>. Response details are not stored directly on this resource but a linked value in a bucket called "tweet_responses_subkeys_bucket". Responses are stored as links on a resource keyed as <<TweetId:128, (ResponseId rem 10):8>> in this bucket. This resource is added as a link on the {<<"tweet_responses_bucket">>, <<TweetId:128>>} resource and tagged as <<"tweet_response">>. A reply is recorded as a link of the form {{<<ResponseId:128>>, <<ResponseAuthorId:128>>}, <<"reply">>}. A link is represented as {{Bucket, Key}, Tag}, this link does not point to a valid bucket, key pair but is purely for our own interpretation.

Here is how it would look

 

           <<"tweet_responses_bucket">>

           ----------------------------

 

           |----------------------------------------|

           |   <<20337776197:128>>                  |

           |----------------------------------------|

           |   Links                                |

           |                                        |

           | {{<<"tweet_responses_subkeys_bucket">>,| 

           |  <<20337776197:128,0:8>>},             |

           |  <<"tweet_response">>},                |

           | {{<<"tweet_responses_subkeys_bucket">>,| 

           |  <<20337776197:128,1:8>>},             |

           |  <<"tweet_response">>},                |

           |  ....                                  |

           |----------------------------------------|

           |   Value                                |

           |                                        |

           |----------------------------------------|

 

 

           <<"tweet_responses_subkeys_bucket">>

           ------------------------------------

 

           |----------------------------------------|

           |   <<20337776197:128,0:8>>              |

           |----------------------------------------|

           |   Links                                |

           |                                        |

           |{{<<20339861590:128>>,<<18035803:128>>},|

           |  <<"reply">>},                         |

           |  ....                                  |

           |----------------------------------------|

           |   Value                                |

           |                                        |

           |----------------------------------------|

 

           |----------------------------------------|

           |   <<20337776197:128,1:8>>              |

           |----------------------------------------|

           |   Links                                |

           |                                        |

           |{{<<20337857101:128>>,<<82294968:128>>},|

           |  <<"reply">>},                         |

           |  ....                                  |

           |----------------------------------------|

           |   Value                                |

           |                                        |

           |----------------------------------------|

 

 

 

Code

And now here is the piece of code this does the extraction of the popular replies. The function gives a sorted list of {TweetId, AuthorId} tuples which are then looked up and served.

Hopefully the code is self explanatory. Of interest is the make_local_fun which creates a function reference which can be passed over to a remote node, without the remote node having a copy of this compiled code in its path.

Feel free to comment on anything I have overlooked or could be done better :)

Filed under  //  code   erlang   map-reduce   riak  
Posted by Jebu Ittiachen 

Riak at inagist.com

At inagist.com we have been using Riak and yes we are loving it. We moved away from Cassandra after it started taxing our limited resources. The nice thing about Cassandra was the data model. Super columns allowed us to store metadata for a resource as needed. For example the retweets and replies of a tweet were stored in their own super columns associated with a tweet and we could pull it out as needed. Concurrency issues were also not a bother. We could do simultaneous updates to columns and super columns and not worry about data consistency issues. This is seriously tricky when maintaining tweet statistics. Popular tweets keep getting retweeted and replied to concurrently by many people. 

When looking for alternatives Riak was our first choice primarily because of it being in Erlang and since it had a map-reduce option which looked seriously promising. The ability to have a choice of backends was another compelling factor. Here are some of the interesting stuff that we have worked out in using Riak.

Using the Data model to our advantage 

At the heart of Riak everything is a key-value. All metadata is associated with the value and has to be read and updated as a single unit. The most interesting metadata is of-course the Links that you store along with a keys value. Interesting because Riak's map reduce has an extra option called link walking. This allows you to filter the links on a document by tag or bucket and feed the linked documents to the next phase. Infact Riak's map-reduce allows you to have any combination of link, map, reduce options to process your data. And yes these are optional too. So infact you can have a link-reduce, link-link-reduce or link-link type queries too. 

Why is this interesting? It allows us to store metadata on a seperate resource and link it to the main resource. Meaning we could have say 10 buckets storing the ids of the retweets as Links and the main resource has a link to these 10 buckets. You could parse through this list of with a link-link query. This reduces the contention on one resource for updates, parallelizes the read and spreads it across the cluster. We store replies and retweet details of a tweet in this model.

A link has three attributes Bucket, Key and Tag. Its supposed to refer to a Bucket and Key if you intend to further get data out of the linked document. But if you know what you are upto this allows to do for some serious extra data management. We currently store some tweet meta-data in Bucket and Key with a well known tag. When we later want to query for all replies to a
 tweet we do a link-link-reduce on well known tags and get the replies or retweets out. I'm not getting into specifics but it should give you the idea.

Interfacing with Riak

Most references point to using Riak via the HTTP interface or via the protocol buffers client. Great if you are working from a non Erlang environment.

We currently use the built in client for Riak over the protocol buffers client. With the main processing being in Erlang, and being a distributed app at that, there was no point in going thru extra layers to get into Riak. This also gives us some interesting options, like for example the "Your Friends" tab on the conversations page that you see once you log into inagist.com. This does a link-link-reduce-reduce where an extra reduce talks to remote erlang process for the logged in user to filter out only replies from his followers. See it in action on a popular tweet like this one from @BarackObama. Mind you the "Your Friends" feature will work only after your account is enabled on inagist.com, but the Popular tab works for anyone and pulls out only the replies which are of interest.

Storage options

And yes the back-end, currently we run the innostore back-end based on Embedded Innodb. This kind of makes Riak a distribution layer over a trusted storage layer. Of the back-ends available this has worked best for us giving a consistent performance along with being reasonable on the resource usage. But the biggest factor here is that it gives an option to plugin what you want like the trial we did with Tokyo Cabinet.

Our biggest bottle neck now is the disk space, we keep pruning the data set at a failry fast pace, roughly one week of data is all that we hold. We get a little above 5 million tweets a day from the twitter pipe and we keep cleaning out as the disks fill up.

A big thank you to the guys at Basho for Riak, its seriously awesome.

Filed under  //  erlang   map-reduce   nosql   riak  
Posted by Jebu Ittiachen 

How we rate content at inagist.com

Its been often asked of us, "How do you decide scores for the tweets that you show on inagist.com?" Well here is how we determine what to show and what not to show. 

We divide tweets coming in from the Twitter stream into 2 categories, ones which have a URL and ones which do not. The assumption being that tweets with URL tend to be teasers to the content at the linked URL. Tweets from different users mentioning the same URL (possibly through a URL shortner) are scored against the URL.

Another basic assumption is how we classify your follow community. The follow community is composed of the people you follow directly and people whom you follow indirectly through lists. If you follow a person directly we give a higher score to activity from that user opposed to activity from a user in one of the lists that you follow. The reasoning here being that people you directly follow is what you see on most Twitter clients by default and possibly people whom you care about more.

Once your twitter id is enabled on inagist.com, we fetch your follow community and start watching the twitter stream on your behalf for activity from, or about, the people in this community. A retweet or a reply to a tweet makes an activity on a tweet. A retweet of a person you directly follow by a person in one of your lists, gains a higher score than a retweet of that person by someone not in your community. And a retweet of that person by a person you directly follow gains an even higher score. Similar scoring applies to replies except that reply from a person not in your community to a person in your community does not count. Conversations between people you directly follow gain higher scores than conversations across people in your lists. 

We present tweets which have a high score sorted by time on your user page. And yes options are coming soon to sort by score and filter by the score.

We do all of this in real-time for each enabled user on inagist. To be reasonable on the costing behind all of this we use bounded caches for scoring tweets. So if you have more people in your follow community you will see tweets expiring faster from your main page. We do archive these tweets which we discover, so you can go back and see those tweets which we had picked out in the past. Clicking on a list name pulls in tweets from the archive for that particular list.

Conversations page gives you more context on the activity surrounding these filtered tweets. Clicking on the activity icon   takes you to the conversation page where all the replies to a tweet are shown in order. We also present two different views on these replies. Popular replies are ones which have been further replied to or retweeted. Replies from your friends brings out only replies from your community.

We encourage you to sign in and see this in action for yourself. Also get into Twitter and start following lists of your choice on topics that interest you, or create lists of your own and curate your sources. May we recommend verifiedtlists and scobleizer for a variety of lists to get you started. Continue using twitter with your favorite client and inagist.com to surface interesting tweets from all your sources, we assure you, you will not miss a single one of those interesting 140 characters.

 

Filed under  //  filtering   inagist   ranking   real time   relevance   scoring   tweet classification  
Posted by Jebu Ittiachen