Same (ADT) type, different meaning
July 08, 2018
There are instances where the semantics of distinct types overlap. Through ADTs and OOP, it is possible to represent this using different sets of types, while still being able to work with the set unions and intersections in a way that is type safe.
Introduction
ADTs (algebraic data types) is a feature that many functional programming languages have. An ADT is an abstract type made up of concrete types. The concrete types act as the constructors for the top-level type, meaning they are needed in order to create a value of the top-level type. Many of these same languages support a feature called pattern matching, which is a form of checking and matching the pattern of a given value.
Languages like Scala, Standard ML, and OCaml are able to check for the exhaustiveness of a pattern match, meaning they are able to check that the patterns in a pattern match expression account for all possible values of the input. For example, in Scala there exists an Option[T]
sum type which is made up of two types: None
and Some[T]
. When matching a value of type Option[T]
, the Scala compiler is able to check for the existence of patterns matching both of the possible values (and any additional matching ensuring T
in Some
is accounted for as well.)
Exhaustive checking acts as a safety net that ensures that all possible values are handled, and there are no runtime errors due to unhandled values.
Scala’s support for OOP makes for some really interesting uses of distinct ADTs with inheritance. A good example to show this off is a small interpreter with shared data structures for the lexing, parsing, and evaluation steps.
Let’s say we have a language that supports only arithmetic expressions:
expr ::= num | arith ;
arith ::= expr op expr ;
op ::= '+' | '-' | '*' ;
num ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ;
It’s a simple grammar but it’s one that demonstrates the usefulness of ADTs plus OOP.
Data Structures
Finding yourself in a situation where you have to model this as tokens, an AST, or runtime data, you may opt for a solution like the following:
sealed trait Token
sealed trait Operator extends Token
case object Plus extends Operator
case object Minus extends Operator
case object Mult extends Operator
sealed trait Expr
case class Number(num: Double) extends Token with Expr
case class Arithmetic(lhs: Expr, op: Operator, rhs: Expr) extends Expr
Notice the top-level Token
type, with it’s two children Operator
and Number
, and Expr
with Arithmetic
and Number
as well. Operator
and Number
are special, since they are both separate types and of type Token
as well – this will be useful later on when we start treating certain tokens as valid expressions. Let’s see this hierarchy in a more visual form:
With the diagram it’s easier to see that when we are working with a Token
type, we have to know how to handle both Number
, a type constructor, and Operator
, a sum type.
Another way of thinking about this is using sets:
Token = {Number, Operator}
Expr = {Number, Arithmetic}
Operator = {Plus, Minus, Mult}
where Number = Expr ∩ Token
With this, we are able to define three functions which act as the full interpreter for our language:
tokenize
, which is defined asString => list of Token
,parse
, which is defined aslist of Token => Expr
, andeval
, which is defined asExpr => Expr
.
This leaves us with a flow of data that goes from a strings, to tokens, to finally expressions. Since Number
is both a Token
and an Expr
, we will see it in use in all three functions, all in a type safe way which ensures any pattern matching is exhaustive.
Implementation
We can bring this full circle by building our interpreter. Starting with tokenize
and a helper:
def tokenize(input: String): Iterator[Token] = {
val chars = input.toIterator.buffered
for (c <- chars.toIterator.buffered if !c.isWhitespace)
yield c match {
case '+' => Plus
case '-' => Minus
case '*' => Mult
case n if n.isDigit =>
Number((n + takeWhile[Char](chars, { _.isDigit }).mkString).toDouble)
}
}
def takeWhile[T](src: BufferedIterator[T], predicate: (T) => Boolean): List[T] =
if (src.isEmpty)
Nil
else if (!predicate(src.head))
Nil
else
src.next :: takeWhile(src, predicate)
Here is the first use of Number
, a valid Token
. To test things are working as expected:
scala> tokenize("123").toList
res32: List[Token] = List(Number(123))
scala> tokenize("1+2").toList
res33: List[Token] = List(Number(1), Operator(+), Number(2))
Now we can move on to parse
, which we previously defined as list of Token => Expr
. Below is an incomplete implementation since it doesn’t parse arithmetic expressions yet, but it is complete in the sense that it handles every possible input. A list of tokens could only be made up of Operator
’s and Number
’s, both of which are checked for in the code below.
def parse(tokens: List[Token]): Either[String, Expr] =
tokens match {
case Nil => Left("invalid: empty input")
case (_: Operator) :: _ => Left("invalid: cannot start expression with operator")
case (token : Number) :: _ => Right(token)
}
Note that since we’re working with lists we have to pattern match the list itself before we can get to the values it holds. You can read the matching above as follows: first matching Nil
, which represents an empty List
, then match a list with an Operator
as the first element and anything else (including an empty list) afterwards, finally do the same but for lists with a Number
as the first element in the list.
If we wanted to test out the exhaustive checks provided by the compiler, we could comment out any of those cases and the results would be a warning (or error) such as:
<pastie>:18: warning: match may not be exhaustive.
It would fail on the following input: List(Number(_))
tokens match {
^
For an implementation that is able to parse arithmetic expressions, we could do something like:
def parse(tokens: Iterator[Token]): Either[String, Expr] =
parse(tokens.toList)
def parse(tokens: List[Token]): Either[String, Expr] =
tokens match {
// Valid expressions
case (num : Number) :: Nil =>
Right(num)
case (lhs : Number) :: (op : Operator) :: (rhs : Number) :: Nil =>
Right(Arithmetic(lhs, op, rhs))
case (lhs1 : Number) :: (op1 : Operator) :: (rhs1 : Number) :: (op2 : Operator) :: t =>
val rhs2 = parse(t).fold(err => return Left(err), ok => ok)
Right(Arithmetic(Arithmetic(lhs1, op1, rhs1), op2, rhs2))
// Invalid expressions
case Nil => Left("syntax error: empty input")
case _ => Left("syntax error: expressions are binary expressions or single numbers")
}
We overload parse
to take an Iterator
, making it easier to work with the rest of the code.
There are more matches in this expression but they operate on the list of Token
s in the same way that the previous example does – all we are doing is peeking at the values at the start of the list and ignoring what ever values may come afterwards.
At this point we’re handling all possible inputs and outputs in the parsing phase. We can now follow similar patterns for implementing the eval
function, which converts expressions into simpler representations. For the first take, we implementing a version which only handles numbers and arithmetic expressions without nested expressions:
def eval(expr: Expr): Either[String, Expr] =
expr match {
case num : Number => Right(num)
case Arithmetic(Number(lhs), op, Number(rhs)) =>
op match {
case Plus => Right(Number(lhs + rhs))
case Minus => Right(Number(lhs - rhs))
case Mult => Right(Number(lhs * rhs))
}
}
Since the only constructors for Expr
are Number
and Arithmetic
, this is an exhaustive match of all possible inputs. Let’s try it out to make sure things work:
scala> parse(tokenize("40 + 2")).flatMap(eval)
res27: scala.util.Either[String,Expr] = Right(Number(42.0))
And now for the second take where we handle nested expressions:
def eval(expr: Expr): Either[String, Number] =
expr match {
case num : Number => Right(num)
case Arithmetic(lhsExpr, op, rhsExpr) =>
(eval(lhsExpr), eval(rhsExpr)) match {
case (Left(err), _) => Left(err)
case (_, Left(err)) => Left(err)
case (Right(Number(lhs)), Right(Number(rhs))) =>
op match {
case Plus => Right(Number(lhs + rhs))
case Minus => Right(Number(lhs - rhs))
case Mult => Right(Number(lhs * rhs))
}
}
}
The second implementation is needed since Arithmetic
is able to hold Expr
values on the left or right hand side. And since these values have yet to be evaluated, we do so before completing the evaluation of the first expression:
scala> parse(tokenize("10 * 4 + 2")).flatMap(eval)
res3: scala.util.Either[String,Number] = Right(Number(42.0))
Notice the continuous use of the Number
type constructor. Because Number
is both a Token
and an Expr
, we are able to use it throughout the whole process of evaluation, and all in a way where the compiler is there to help us.
Conclusion
Class hierarchies and ADTs are nothing new. We could represent the very same hierarchy in another language, and keep most of our signatures the same as well. The same could be said about ADTs. What’s distinguishing about this is the combination of the two, allowing us to able to represent the flow of data as data structures and do so in a way where the type system has the ability to ensure we’re handling all of the possible cases.
In addition, we are able to reuse types where it makes sense to do so. The Number
type is a valid return value for a tokenizer, a parser, and an evaluator, and we can convey this by making it both a Token
and an Expr
at the same time. If our language were bigger, perhaps the same could be said about other scalar types.
And like with anything else, reusability can be taken too far. Once the use or meaning of a type (or value) starts to change an increase in scope, it makes less sense to continue using the same type. For example, if our language had typing information that we needed to pass around, the data structures used in the parsing phase will not be enough during the type checking phase. And extending the types so that they could be used in the type checker would expand the scope too much, leaving you with the information that you need but completely stripping the constructors of their ergonomics.
With that in mind, there are many instances where types and their semantics overlap, and there is a need to represent the distinct sets, their union, and their intersections. When this is the case, OOP and ADTs are a great mix.