Monday, June 07, 2010

Functional Data Structures in a Persistent Setting - Part 1

When you have multiple options of implementing a data structure, or you have multiple data structure implementations of the same ADT to choose from, what do you do ? Do you choose the one that has the simplest of implementations, the one that has the best amortized complexity or the one that has the best worst case complexity. Dick Lipton has a great post on how to decide What is the right complexity measure for algorithms ?. In case you haven't read it yet, leave this useless blog post and go read it over at his blog.

Which implementation of a data structure you should use depends on your use case. If you are more concerned about the overall time complexity of a sequence of operations on the data structure, choose the one with the best amortized complexity bounds. When you have an amortized data structure with a time bound of O(n) for a sequence of n operations, there may be some individual operations that take more than O(1). But you are ok so long the overall bound is maintained at O(n) for the whole sequence.

However, there are situations where you cannot afford amortized data structures. Many real time systems need predictability that can only be guaranteed through worst case time bounds. For such cases, you need to use a data structure implementation that gives you the worst case promise.

In my last post I discussed many functional data structures that come with the guarantee of being persistent. Which means that even if you make changes in the data structure, you will still be able to access all the earlier versions at no added performance cost. In that post I briefly touched upon some of the techniques that implementation of persistence employs that ensures there's no brutal copying going on behind the scenes. Clojure does that to great effect in implementing its family of persistent data structures. Scala 2.8 also has an implementation of Bagwell's Hash Mapped Array Trie, the secret sauce behind Clojure's persistent hash map and vectors. But this is not a post for discussing Clojure or Scala data structure implementations  .. so let's move on ..

Amortized data structures, in a sense, walk the middle path between Dick Lipton's Genius Channels and Dumb Channels. You have an adversary that makes some of the operations complicated threatening to pull down your promise. But at the same time you have the friendly operations that act as the counter-balancing force thereby maintaining the amortized bounds over the whole sequence of operations.

Consider Okasaki's BatchedQueue implementation, that uses a pair of lists to implement a queue data structure. The list of elements in the queue is split across the two lists, front and rear. front contains the half of the elements in the correct order while rear contains the other half in the reverse order. Dequeue takes place from the front (see head below), while enqueue takes place at the rear (see snoc below). Here's the Haskell implementation of the BatchedQueue taken from Okasaki's book ..





Clearly head offers worst case O(1). But tail has worst case O(n), since it has to reverse the rear list when front becomes empty. But with an amortized analysis using either the credits of the Banker's method or the potential of the Physicist's method, you can prove that all operations offer amortized O(1). Okasaki's book contains the proof of the amortized bounds for BatchedQueue.

The problem with BatchedQueue is that all bounds break in a persistent setting. Remember we talked about multiple logical futures for a persistent data structure. For one specific instance of the BatchedQueue, when you invoke tail persistently, you are actually doing the reverse multiple times, once for each of the logical futures of the operation that create the new object. Have a look at the following diagram that illustrates how BatchedQueue cannot offer an amortized O(1) in a persistent setting.


In the figure above, we have the original queue containing the numbers from 1 to 10 distributed evenly across front and rear lists. After 4 invocations of tail, we have the second scenario with only a single element left in the front list. If we have 2 invocations of tail in a persistent setting, each will invoke reverse separately despite the fact that the rear list only has 5 credits that can be spent by one of those operations.

In an upcoming post, I will discuss BankersQueue, an alternative implementation of the queue data structure that offers O(1) amortized complexity under persistent setting.

No comments: