Introduction to Programming, Aug-Dec 2008
Lecture 11, Wednesday 17 Sep 2008


User defined datatypes
----------------------

A datatype is a collection of values with a collective name.  For
instance, the datatype Int consists of the values
{...,-2,-1,0,1,2,...} while the datatype Bool consists of the
values {False,True}.  Datatypes can have structure, and may be
polymorphic --- for example, tuples.  Datatypes can also be
recursively defined and hence of unbounded size --- for example,
lists.

In Haskell, we can extend the set of built-in types using the 
the data statement.

Enumerated datatypes
--------------------

This simplest form of datatype is one consisting of a finite set
of values.  We can define such a type using the "data" statement,
as follows.

  data Day = Sun | Mon | Tue | Wed | Thu | Fri | Sat

Having introduced this new type, we can directly use it in
functions such as:

  weekday :: Day -> Bool
  weekday Sun = True
  weekday Sat = True
  weekday _   = False

We can also write a function "nextday".

  nextday :: Day -> Day
  nextday Sun = Mon
  nextday Mon = Tue
  ...
  nextday Fri = Sat
  nextday Sat = Sun

What happens if we ask Haskell to evaluate "nextday Fri"?  The
answer is computed correctly as "Sat" but we get a message

  ERROR - Cannot find "show" function for:
  *** Expression : nextday Fri
  *** Of type    : Day

Similarly, if we ask whether "Tue == Wed", the response is

  ERROR - Cannot infer instance
  *** Instance   : Eq Day
  *** Expression : Tue == Wed

The problem is that we have not associated the new datatype with
any type classes, including the most basic ones such as Eq and
Show.  One way to do this is write our own instance declarations.

  instance Eq Day where
    Sun == Sun = True
    Mon == Mon = True
    ...
    Sat == Sat = True
    _   == _   = False


  instance Show Day where
    show Sun = "Sun"
    show Mon = "Mon"
    ...
    show Sat = "Sat"

These are the most natural definitions for Eq and Show --- each
value is distinct and equal only to itself and each value is
displayed in the same way it is defined.  To make things easier,
we can include these "default" instance definitions for Eq and
Show using the word "deriving" as follows:

   data Day = Sun | Mon | Tue | Wed | Thu | Fri | Sat
       deriving (Eq, Show)

In the same way, we can derive an instance definition for Ord ---
the default definition would order the values in the sequence
that they are presented, namely Sun < Mon < ... < Sat.

Note that the built in datatype Bool can be thought of as defined
in this way:

  data Bool = False | True
     deriving (Eq, Ord, Show)

In fact, even Char and Int (whose range is effectively finite
because we use a fixed number of bits to represent an Int) can be
thought of as defined in the same way.

Datatypes with parameters
-------------------------

We can go beyond finite enumerated types and describe datatypes
with a parameter, as in the following example.

   data Shape = Square Float | Circle Float | Rectangle Float Float
      deriving (Eq, Ord, Show)

   area :: Shape -> Float

   area (Square x)      = x*x
   area (Circle r)      = pi*r*r
   area (Rectangle l w) = l*w
   where
       pi = 3.1415927

Each variant of Shape has a contructor --- Square, Circle or
Float.  Each constructor is attached to a group of values, which
can vary from constructor to constructor.  The values Sun, Mon
etc in the type Day are also constructors with zero values
attached.

What happens when we derive Eq for Shape?  At the level of Shape,
this will ensure that (Square x) will be equal to (Square y)
provided x == y but (Square x) is never equal to (Circle y), etc.
When we derive Ord, we have Square < Circle < Rectangle so
(Square x) < (Circle y) for all x and y, (Circle z) < (Circle w)
if z < w, etc.


Polymorphic datatypes
---------------------

We can extend our definition of Shape to permit any numeric type
as the parameter.  Here is the corresponding definition.  Note
the conditionality on Num a.  Note also that we need to include
the type parameter a in the name of the type --- the datatype is
"Shape a" not just "Shape".

   data Num a => (Shape a) = Square a | Circle a | Rectangle a a
      deriving (Eq, Ord, Show)

   size :: (Shape a) -> a

   size (Square x)      = x
   size (Circle r)      = r
   size (Rectangle l w) = l*w

Recursive datatypes
-------------------

We can have recursive datatypes.  Here is an example.

   data Mylist = Empty | Listof Int Mylist

Here the constructors are Empty and Listof.  Empty has zero
arguments and is hence a constant, representing the base case of
the recursive type.  The constructor Listof combines a value of
type Int with a nested instance of Mylist.   Here is a
value of type Mylist corresponding to the list [1,3,2].

   Listof 1 (Listof 3 (Listof 2 Empty)

In Haskell's builtin definition of lists, Empty is written as []
and Listof is written as an infix constructor ":", so the value
above becomes the more familiar

   1 : (3 : (2 : [])

from which we can eliminate the brackets using the right
associativity of ":".

It is a small step to extend Mylist to be polymorphic.

   data Mylist a = Empty | Listof a (Mylist a)

Now, a term that uses the constructor Listof has a value of type
"a" and a nested list of the same type.  Note again that the full
name of the type is "Mylist a", not just "Mylist".

If we change the definition slightly, we get a version of lists
where each new element is appended to the right, rather than the
left. 

   data Mylist a = Empty | Listof (Mylist a) a

In this representation, a list such as [1,3,2] would be written
as

  Listof (Listof (Listof Empty 1) 3) 2

For inductively defined types, we can write inductive functions
to process them.   Just as for builtin lists, we can use pattern
matching to decompose a value into its parts.  For instance, here
is a definition of length corresponding to the last definition of
Mylist a.

  length :: (Mylist a) -> Int
  length Empty = 0
  length Listof l x = 1 + length l

To illustrate the role played by the type variables in the
definition of an inductive datatype, let us consider an example
of a polymorphic type that uses multiple type variables.  Suppose
we want to define lists that contain elements of types a and b,
such that values of types a and b alternate in the list,
beginning with a value of type a.  There is no restriction on the
last value --- if the list has an odd number of elements, the
last value is of type a, otherwise it is of type b.

Such a list will look like [x_1,y_1,x_2,y_2,....,x_m,y_m] where
each x_i is of type a and each y_i is of type b.  Notice that if
we strip of x1, the remaining list is of the form
[y_1,x_2,y_2,....,x_m,y_m].  This is again a list in which values
of type a and b alternate, except that the first value is of type
b.  This observation leads us to the following definition.

   data Twolist a b = Empty | Listof a (Twolist b a)

Notice that within Listof, the inductive call to Twolist inverts
the order of the type variables.  Thus, after a value of type a,
we have a list that has alternate values starting with b.  The
next unfolding of the inductive definition would again invert the
types, so we have a list in which the first value is of type a,
and so on.


======================================================================

Organizing functions in modules
-------------------------------

For small function definitions, it is acceptable to write all
definitions in a single file and include all dependent
definitions.  However, as programs grow in size, it is desirable
to break them up into separate units for the following reasons:

1. The functions defined in one unit may be useful in many
  contexts.  For instance, if we define quicksort and save it as
  a separate unit, we should be able to include it automatically
  in another set of functions without rewriting the definition
  of quicksort.

2. Keeping functions in separate units makes it easier to
  maintain the programs.  Finished portions are guaranteed not
  to be touched while editing definitions still under
  development, thus avoiding unintended modifications to
  definitions that are already correct and complete.

3. By separating out functions, the interdependence of functions
  on each other is more clearly specified.  In particular, we
  can identify exactly what "interface" each function provides
  to the rest of the world.  Provided we do not change this
  "interface", we can reimplement the actual function without
  changing the correctness of the overall code.  For instance,
  we might organize a unit containing a function "sort" to sort
  lists.  Initially, we may have implemented "sort" using
  insertion sort.  At a later date, if we replace the insertion
  sort implementation by a better algorithm, such as quicksort,
  the rest of the code is not affected.

The mechanism for collecting Haskell functions in a reusable unit
is to declare them as a module.  For simplicity, Haskell requires
that each module should be in a separate file and the name of the
module should be the same as that of the file containing it.

Thus, we can make a unit consisting of quicksort and mergesort as
follows:

  module Sortfunctions where

  quicksort :: (Ord a) => [a] -> [a]
  quicksort [] = []
  quicksort (x:xs) = (quicksort lower) ++ [splitter] ++ (quicksort upper)
   where
     splitter = x
     lower    = [ y | y <- xs, y <= x ]
     upper    = [ y | y <- xs, y > x ]

  mergesort :: (Ord a) => [a] -> [a]
  mergesort [] = []
  mergesort [x] = [x]
  mergesort l = merge (mergesort (front l)) (mergesort (back l))
   where
    front l = take ((length l) `div` 2) l
    back l = drop ((length l) `div` 2) l
    merge [] ys = ys
    merge xs [] = xs
    merge (x:xs) (y:ys) 
      | x < y     = x:(merge xs (y:ys)) 
      | otherwise = y:(merge (x:xs) ys)

These definitions should be stored in a file called
"Sortfunctions.hs", to match the module name.  Notice that other
than adding an initial line

  module Sortfunctions where

we have not changed the definitions of quicksort and mergesort in
any way.

We can now invoke this module in another file as follows:

  import Sortfunctions

  ...

After "import Sortfunctions", we can freely use quicksort and
mergesort.   The file that invokes "import Sortfunctions" need
not be a module --- it can be a simple Haskell file that has some
additional function definitions, which can freely use mergesort
and quicksort.  We have seen an example of invoking modules when
we used the functions "ord" and "chr" for the Char type, which
required importing the module Char.

Sometimes, we may not want to import all functions from a
module.  For instance, suppose we want to use only quicksort from
Sortfunctions and write our own mergesort.  We can then say

  import Sortfunctions hiding (mergesort)

If we did not hide mergesort, we would have to use a different
name for the new implementation of mergesort because the same
name cannot be given two different definitions.

The builtin functions in Haskell (e.g. take, drop, max, etc) are
defined in the Standard Prelude, which is implemented as a module
called Prelude.hs.  This module is imported implicitly in every
Haskell file.  However, it is possible to explicitly import
Prelude and hide some of the builtin functions in case one wants
to rewrite these functions.  For instance, if we wanted to write
different definitions for take and drop, we could initially write

  import Prelude hiding (take,drop)

Symmetrically, it may be desirable to restrict what is visible
outside a module.  Suppose we use an auxiliary function in a
module to define the main function.  We may not want this
auxiliary function to be visible outside.  If we want to restrict
the list of functions that a module exports, we write the list of
exported functions in the module header line, as follows.

  module Sortfunctions(quicksort,mergesort) where

This line specifies that among all the possible functions that
may be defined in the module Sortfunctions, only quicksort and
mergesort are actually visible to any file that imports this
module. 

======================================================================