Why Erlang?
I often get these weird looks whenever I mention that inagist is written in Erlang. So here is some of the key areas where Erlang is a winner for us.
What we do
At inagist we try to summarize a real-time stream in real-time. Currently we work on top of the Twitter stream api. By summarizing I mean filter in real time the popular tweets in a stream and group tweets based on trends. See it in action at justinbieber.inagist.com, libya.inagist.com (try with chrome / safari for best results since it uses websockets). We do this summary on a stream which could be combined in any number of ways. ie; a users own stream (my stream), a keyword based search stream (libya.inagist.com), keyword + geo-location based tweet stream (sxsw.inagist.com).
Lightweight Processes
The key differentiator here is the real-time nature of how we summarize the tweet stream. Instead of possibly persisting each tweet in the stream and running off-line analytics we maintain limited caches for each user and keep popping in and out each tweet as it gains popularity or when key words are repeated in the stream for trend detection. Here is where a key aspect of Erlang fits in. Each of the stream consumer is modelled as an Erlang process, being light weight and isolated. It essentially is a proxy for each user provisioned. It receives tweets from the stream, manipulates the caches and responds to api queries for serving data. Each of these stream consumers are gen_server implementations, tied in a supervisor chain. In case one of the consumers goes down the supervisor brings it back up with no impact on the rest of the user base.
Messages
So how do we couple this to the stream of incoming data, with messages. Each tweet is delivered to the consumers as messages from the stream api client. Each tweet consumer is part of a process group tree spanning across machines. The moment a tweet is received from the n/w it is spawned off as a seperate process which json decodes the tweet and send messages to the distribution tree. The message trickles down to the consumer process which does its job of cache updations. Being asynchronous messages the client is not concerned about how many consumers there are to a tweet. If a consumer process is interested in a tweet it can consume it. Being a decentralized delivery it scales when load of incoming tweets goes up or if the number of consumers increase.
Distributed Erlang
As the number of consumers increase another key aspect of Erlang comes into play. Distribution across machines. Consumer processes by design are known only by a process id. Erlang works with local process ids the same way as it does with distributed processes. As long as a consumer is part of the distribution tree the tweet will be delivered to the process. This helps us in easily scaling out. Individual machines could fail independently without affecting the cluster as a whole only users provisioned on that machine are offline for the time the machine is offline.
Real Time Delivery
To be as real-time as possible we prefer to deliver over websockets to connected clients. Messages come into play here again. Each stream consumer generates messages as its caches figure out a trending tweet or trend. Web socket clients tap into this message stream, convert them to JSON and deliver to the client. Our chrome browser plugin is a websocket client which utilizes this delivery model. The extension notifies via desktop popups when ever a trend is detected or a tweet gains popularity above a certain level. We also bring in a different angle to real-time search with the extension. When a trend is detected the extension automatically starts searching for prominent tweets with those detected trends.
Streaming Search
I have written previously about how we use Riak for search and storage. In addition we have built some custom stuff which enables streaming search. Whenever we index documents in Riak we also send out the indexing data as messages to the search infrastructure. We also send this index terms as soon as a tweet is received on a seperate index. Here we have process waiting for <index, field, value> tuples to match against and notifies waiting processes of a document matching the search criteria. We currently support ==, and, or, >, <, >=, =< operators so we can detect any tweet containing sxsw (==), justin bieber (and), documents containing a text or lat long within a bounding box etc. Stream consumers use this to get real time filtered tweets from the stream. Also the chrome plugin taps into this search stream to notify a user whenever a tweet matches a detected trend or an explicit search query. This is really powerful since we now automatically figure out what a users interest topics are by way of trends and we can let the user know whenever there is something matching this interest topic in real time. This whole streaming search works with such low over head because of the nature of the the message based architecture that we can stream this all the way upto the browser typically under a second or two. You can see it in work when you click on the "Live Stream" heading on pages like these justinbieber.inagist.com.
I have just given a high level view of where Erlang acts as a differentiator but it should give some insight into why we do Erlang all the way. Drop in a comment or @jebui and will be happy to give more information if needed.
PS: I have not heard any of Justin Bieber's or Lady Gaga's albums they just happen to have a very active tweet stream.