Thursday, January 01, 2015

Probabilistic techniques, data streams and online learning - Looking forward to a bigger 2015

I look forward to 2015 as the year when randomized algorithms, probabilistic techniques and data structures become more pervasive and mainstream. The primary driving factors for this will be more and more prevalence of big data and the necessity to process them in near real time using minimal (or constant) memory bandwidth. You are given data streams where possibly you will see every data only once in your lifetime and you need to churn out analytics from them in real time. You cannot afford to store all of them in a database on disk since it will incur an unrealistic performance penalty to serve queries in real time. And you cannot afford to store all information in memory even if you add RAM at your own will. You need to find clever ways to optimize your storage, employ algorithms and data structures that use sublinear space and yet deliver information in real time.

Many such data structures are already being used quite heavily for specialized processing of data streams ..


These data structures are becoming more and more useful as we prepare to embrace and process larger data sets with fairly strict online requirements. And it has started making a difference. Take for example Impala, the open source analytic database from Cloudera that works on top of Hadoop. Impala's NDV aggregate function (number of distinct values) uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space. This blog post has the details of the performance improvement that it offers in comparison to the standard distinct count. The immensely popular NoSQL store Redis also offers a HyperLogLog implementation that you can use to get an approximation on the cardinality of a set using randomization. Salvatore has the details here on the implementation of HyperLogLog algorithm in Redis.

The most important reason these algorithms and data structures are becoming popular is the increased focus on our "online" requirements. We are not only processing bigger and bigger data set, we need results faster too. We just cannot afford to push all analytics to the batch mode and expect results coming out after an overnight batch processing. Various architectural paradigms like the lambda architecture also target to address this niche area. But before investing on such complex architectures, often some neat data structures that use probabilistic techniques and randomization may offer a much lighter weight solution that you are looking for.

Consider processing the Twitter stream and generating analytics (of whatever form) online. This means that immediately after seeing one twitter feed you must be able to predict something and update your model at the same time. Which means you need to memorize the data that you see in the feed, apply it to update your model and yet cannot store the entire hose that you have seen so far. This is online learning and is the essence of techniques like stochastic gradient descent that help you do this - the model is capable of making up to date predictions after every data that you see. John Myles White has an excellent presentation on this topic.

Consider this other problem of detecting similarities between documents. When you are doing this on a Web scale you will have to deal with millions of documents to find the similar sets. There are techniques like minhash which enable you to compress documents into signature matrices. But even then the scale becomes too big to be processed and reported to the user in a meaningful amount of time. As an example (from Mining Massive Datasets), if you process 1 million document using signatures of length 250, you still have to use 1000 bytes per document - the total comes to 1 gigabyte which very well fits into the memory of a standard laptop. But when you check for similar pairs, you need to process (1,000,000 choose 2) or half a trillion pairs of documents which will take almost 6 days to compute all similarities on a laptop. Enter probabilistic techniques and locality sensitive hashing (LSH) algorithm fits this problem like a charm. Detecting similarity is a problem that arises in recommender systems with collaborative filtering and LSH can be used there as well. The basic idea of LSH as applied to similarity detection is to use hashing multiple number of times and identify candidate pairs that qualify for similarity checking. The idea is to reduce the search space using probabilistic techniques so that we can eliminate a class of candidates which have very low chance of being similar.

Here I have only scratched the surface of the areas where we apply randomization and probabilistic techniques to solve problems that are very real today. There are plentiful other areas in data mining, graph clustering, machine learning and big data processing where similar techniques are employed to reduce the curse of dimensionality and provide practical solution at scale. 2014 has already seen a big surge in terms of popularizing these techniques. I expect 2015 to be bigger and more mainstream in terms of their usage.

Personally I have been exploring data stream algorithms a lot and have prepared a collection of some useful references. Feel free to share in case you find it useful. I hope to do something more meaningful with stream processing data structures and online learning in 2015. Have a very happy and joyous new year ..