Introduction to Programming, Aug-Dec 2008
Lecture 17, Wednesday 22 Oct 2008

The abstract datatype Set
-------------------------

A set is a collection of elements without repetitions and without
any prespecified order.  We associate two collections of
operations with sets.

Dictionary operations:

  empty   :: Set a
  isempty :: Set a -> Bool
  member  :: Set a -> a -> Bool
  insert  :: Set a -> a -> Set a
  delete  :: Set a -> a -> Set a

Set operations:

  union     :: Set a -> Set a -> Set a
  intersect :: Set a -> Set a -> Set a
  setdiff   :: Set a -> Set a -> Set a

Possible implementations:

1. The first possibility is to represent a set as a lists,
   possibly with elements repeated.  We use the definition:
  
   data Set a = Setof [a]

   With this definition, the dictionary operations can be defined
   as follows.

   empty              = Setof []
   isempty Setof l    = l == []
   member (Setof l) x = elem x l
   insert (Setof l) x = Setof (x:l)
   delete (Setof l) x = Setof (filter (/= x) l)

     or, alternatively,

   delete (Setof l) x = Setof [y | y <- l, y /= x]

   If the size of the set is n, the size of the representation N,
   may be much larger than n, because of repetitions.  

   Complexity of dictionary operations:

   empty   : O(1)
   isempty : O(1)
   member  : O(N)
   insert  : O(1)
   delete  : O(N)

   Other set operations:

   union     (Setof xs) (Setof ys) = (Setof xs++ys)
   intersect (Setof xs) (Setof ys) = [y | y <- ys, elem y xs]
   setdiff   (Setof xs) (Setof ys) = [x | x <- xs, not (elem x ys)]

   Let M and N be the size of the representations Setof xs and
   Setof ys, respectively.  Then, union takes time proportional
   to M+N while intersect and setdiff take time proportional to
   MN.

2. To reduce the size of the list representation to be equal to
   the size of the set being represented, we can inductively
   maintain the set as a list without repetitions.  The data
   definition remains the same:

   data Set a = Setof [a]

   Under the assumption that the list has no repetitions, the
   dictionary operations can be defined as follows.

   empty              = Setof []
   isempty Setof l    = l == []
   member (Setof l) x = elem x l
   insert (Setof l) x = Setof (x:(filter (/= x) l))
   delete (Setof l) x = Setof (filter (/= x) l)

   The only change is in the definition of insert, where we have
   ensured that all existing copies of x are removed before we
   put in the new value.

   Now, if the size of the set is n, the complexity of the
   dictionary operations is given by:

   empty   : O(1)
   isempty : O(1)
   member  : O(n)
   insert  : O(n)
   delete  : O(n)

   For the other set operations, intersect and setdiff can be
   retained as in the earlier representation.  When computing the
   union, we have to removed duplicates.  One way is to
   systematically insert all the values from the second set into
   the first set, making use of the fact that insert handles the
   problem of filtering out duplicates.

   union (Setof xs) (Setof [])    = Setof xs
   union (Setof xs) (Setof (y:ys) = union (insert (Setof xs) y)
                                          (Setof ys)
 
   With this definition, the union of two sets of size m and n
   takes time mn, as do intersect and setdiff.
   
3. If the values being stored in the set can be compared, we can
   maintain the set as a sorted list without repetitions.

   data (Ord a) => Set a = Setof [a]

   empty              = Setof []
   isempty Setof l    = l == []
   member (Setof l) x = elem x l
   insert (Setof l) x = Setof (insert x l)
   delete (Setof l) x = Setof (filter (/= x) l)

   Here, the insert function on sets calls the insert function on
   sorted lists that we have seen in the definition of insertion
   sort.  We do not gain anything in terms of complexity for
   dictionary operations.  We still have:

   empty   : O(1)
   isempty : O(1)
   member  : O(n)
   insert  : O(n)
   delete  : O(n)

   However, we can now use the fact that we can merge two sorted
   lists in linear time to write more efficient implementations
   of the other set operations.

   union (Setof xs) (Setof ys) = Setof (unionmerge xs ys)
     where
       unionmerge [] ys = ys
       unionmerge xs [] = xs
       unionmerge (x:xs) (y:ys)
         | x < y     = x:(unionmerge xs (y:ys))
         | y < x     = y:(unionmerge (x:xs) ys)
         | otherwise = x:(unionmerge xs ys)

   intersect (Setof xs) (Setof ys) = Setof (intersectmerge xs ys)
     where
       intersectmerge [] ys = []
       intersectmerge xs [] = []
       intersectmerge (x:xs) (y:ys)
         | x < y     = (intersectmerge xs (y:ys))
         | y < x     = (intersectmerge (x:xs) ys)
         | otherwise = x:(intersectmerge xs ys)

   
   setdiff (Setof xs) (Setof ys) = Setof (setdiffmerge xs ys)
     where
       setdiffmerge [] ys = []
       setdiffmerge xs [] = xs
       intersectmerge (x:xs) (y:ys)
         | x < y     = x:(intersectmerge xs (y:ys))
         | y < x     = (intersectmerge (x:xs) ys)
         | otherwise = (intersectmerge xs ys)


   Merging two lists of length m and n takes time m+n, so each of
   union, intersect and setdiff  now takes time m+n.
   
4. If the elements of the set can be compared with each other, we
   can move from a sorted list representation to a balanced
   search tree representation.

   data (Ord a) => Set a = Setof STree a

   Recall that to maintain balanced search trees, we store the
   height of a node along with the value so we have

   data (Ord a) => STree a = Nil | Node Int (STree a) a (STree a)

   In this reprentation, it is clear that the dictionary
   operations member, insert and deleet take time O(log n) for a
   set of size n, thus beating the other representations by a
   long way.

   How about union, intersection and setdiff?

   The naive implementation of union would be to insert each
   element of the second set in the first, or vice versa, yielding
   an algorithm with complexity min(n log m, m log n).   To
   implement this, we can first flatten out one of the sets into
   a list using inorder, preorder or postorder traversal and then
   systematically run through the elements of this list.

   Can we do better?  We know that we can perform unionmerge,
   intersectmerge and setdiffmerge on two sorted lists in linear
   time.  We can obtain a sorted list from a (balanced) search
   tree using inorder.  We can also write a function mkbtree to
   construct a balanced search tree from a sorted list.  If
   inorder and mkbtree work in linear time, we can implement
   union, intersect and setdiff in linear time in the balanced
   search tree representation.

   Let us start with analyzing inorder:

   inorder :: (STree a) -> [a]
   inorder Nil = []
   inorder (Node m t1 x t2) = (inorder t1)++[x]++(inorder t2)

   If t is balanced, the left and right subtrees t1 and t2 are
   half the size of the main tree.  Let T(n) denote the time
   required to generate the inorder traversal of a balanced tree
   with n nodes.  The time taken to combine the two recursive
   inorder traversals is proportional to the length of (inorder
   t1) because ++ takes time proportional to the length of its
   left argument.  Thus, we have

     T(n) = 2 T(n/2) + O(n)

   We have seen this recurrence before (e.g., mergesort) and the
   solution is T(n) = O(n log n), which is larger than the O(n)
   solution that we seek.

   The problem arises because Haskell lists are built up right to
   left, so it is inefficient to write a function that build a
   list left to right.  An example of this is the naive reverse
   function that we have seen earlier, which takes quadratic
   time.

     reverse [] = []
     reverse (x:xs) = (reverse xs) ++ [x]
   
   As with reverse, the key to making inorder more efficient is
   to use an auxiliary parameter.  Let us define a new function

     inorderaux :: (STree a) -> [a] -> [a]

   such that

     inorderaux t l 

   yields

     (inorder t)++l

   We can then recover inorder t as inorderaux t [].

   Here is a definition of inorderaux.

     inorderaux Nil l = l
     inorderaux (Node t1 x t2) l =
                        inorderaux t1 (x:(inorderaux t2 l))

   In other words, we compute the inorder traversal of Node t1 x
   t2 from right to left, first placing inorder t2 to the left of
   l, then adding x on the left of the resulting list and finally
   placing inorder t1 at the leftmost position.

   The complexity of inorderaux is given by:

      T(n) = 2 T(n/2) + O(1)

   The crucial feature is that we need only constant time to
   combine the two recursive computations of size n/2.  For this
   recurrence, the solution is T(n) = O(n), which is what we
   seek.

   The next function that we have to achieve in linear time is to
   mkbtree, that constructs a balanced search tree from a sorted
   list.  We need not insist that the input list be sorted.  We
   shall write mkbtree in such a way that the output of mkbtree
   is a balanced tree and

       inorder (mkbtree l) == l

   If l is sorted, this ensures that mkbtree generates a search
   tree.

   The naive way to make a balanced tree from a list is to use
   the centre element of the list as the root and recursively
   construct balanced left and right subtrees from the first and
   second halves of the list:

       mkbtree :: [a] -> (STree a)

       mkbtree [] = Nil
       mkbtree [x] = Node Nil x Nil
       mkbtree l =  Node (mkbtree left) root (mkbtree right)
         where
           m = (length l) `div` 2
           root == l!!m
           left = take m l
           right = drop (m+1) l
   
   The complexity of mkbtree is given by the following
   recurrence:

     T(n) = 2 T(n/2) + O(n)

   The O(n) factor comes because it takes linear time to compute
   the midpoint of the input list and break it up into two
   halves.