title | date |
---|---|
The Urn Pattern Matching Library | August 2, 2017 |
Efficient compilation of pattern matching is not exactly an open problem in computer science in the same way that implementing say, type systems, might be, but it's still definitely possible to see a lot of mysticism surrounding it.
In this post I hope to clear up some misconceptions regarding the implementation of pattern matching by demonstrating one such implementation. Do note that our pattern matching engine is strictly linear, in that pattern variables may only appear once in the match head. This is unlike other languages, such as Prolog, in which variables appearing more than once in the pattern are unified together.
Pattern matching always involves a pattern (the match head, as we call it) and a value to be compared against that pattern, the matchee. Sometimes, however, a pattern match will also include a body, to be evaluated in case the pattern does match.
(case 'some-value ; matchee
[some-pattern ; match head
(print! "some body")]) ; match body
As a side note, keep in mind that case
{.lisp} has linear lookup of
match bodies. Though logarithmic or constant-time lookup might be
possible, it is left as an exercise for the reader.
To simplify the task of compiling patterns to an intermade form without them we divide their compilation into two big steps: compiling the pattern's test and compiling the pattern's bindings. We do so inductively - there are a few elementary pattern forms on which the more complicated ones are built upon.
Most of these elementary forms are very simple, but two are the simplest: atomic forms and pattern variables. An atomic form is the pattern correspondent of a self-evaluating form in Lisp: a string, an integer, a symbol. We compare these for pointer equality. Pattern variables represent unknowns in the structure of the data, and a way to capture these unknowns.
+------------------+----------+-------------+ | Pattern | Test | Bindings | +:=================+:=========+:============+ | Atomic form | Equality | Nothing | +------------------+----------+-------------+ | Pattern variable | Nothing | The matchee | +------------------+----------+-------------+
All compilation forms take as input the pattern to compile along with a symbol representing the matchee. Patterns which involve other patterns (for instance, lists, conses) will call the appropriate compilation forms with the symbol modified to refer to the appropriate component of the matchee.
Let's quickly have a look at compiling these elementary patterns before looking at the more interesting ones.
(defun atomic-pattern-test (pat sym)
`(= ,pat ,sym))
(defun atomic-pattern-bindings (pat sym)
'())
Atomic forms are the simplest to compile - we merely test that the
symbol's value is equal (with =
, which compares identities, instead of
with eq?
which checks for equivalence - more complicated checks, such
as handling list equality, need not be handled by the equality function
as we handle them in the pattern matching library itself) and emit no
bindings.
(defun variable-pattern-test (pat sym)
`true)
(defun variable-pattern-bindings (pat sym)
(list `(,pat ,sym)))
The converse is true for pattern variables, which have no test and bind
themselves. The returned bindings are in association list format, and
the top-level macro that users invoke will collect these and them bind
them with let*
{.lisp}.
Composite forms are a bit more interesting: These include list patterns and cons patterns, for instance, and we'll look at implementing both. Let's start with list patterns.
To determine if a list matches a pattern we need to test for several things:
With the requirements down, here's the implementation.
(defun list-pattern-test (pat sym)
`(and (list? ,sym) ; 1
(= (n ,sym) ,(n pat)) ; 2
,@(map (lambda (index) ; 3
(pattern-test (nth pat index) `(nth ,sym ,index)))
(range :from 1 :to (n pat)))))
To test for the third requirement, we call a generic dispatch function (which is trivial, and thus has been inlined) to compile the $n$th pattern in the list against the $n$th element of the actual list.
List pattern bindings are similarly easy:
(defun list-pattern-bindings (pat sym)
(flat-map (lambda (index)
(pattern-bindings (nth pat index) `(nth ,sym ,index)))
(range :from 1 :to (n pat))))
Compiling cons patterns is similarly easy if your Lisp is proper: We
only need to check for cons
{.lisp}-ness (or list
{.lisp}-ness, less
generally), then match the given patterns against the car and the cdr.
(defun cons-pattern-test (pat sym)
`(and (list? ,sym)
,(pattern-test (cadr pat) `(car ,sym))
,(pattern-test (caddr pat) `(cdr ,sym))))
(defun cons-pattern-bindings (pat sym)
(append (pattern-bindings (cadr pat) `(car ,sym))
(pattern-bindings (caddr pat) `(cdr ,sym))))
Note that, in Urn, cons
patterns have the more general form (pats* . pat)
(using the asterisk with the usual meaning of asterisk), and can
match any number of elements in the head. It is also less efficient than
expected, due to the nature of cdr
copying the list's tail. (Our lists
are not linked - rather, they are implemented over Lua arrays, and as
such, removing the first element is rather inefficient.)
Now that we can compile a wide assortment of patterns, we need a way to
actually use them to scrutinize data. For this, we implement two forms:
an improved version of destructuring-bind
{.lisp} and case
{.lisp}.
Implementing destructuring-bind
{.lisp} is simple: We only have
a single pattern to test against, and thus no search is nescessary. We
simply generate the pattern test and the appropriate bindings, and
generate an error if the pattern does not mind. Generating a friendly
error message is similarly left as an exercise for the reader.
Note that as a well-behaving macro, destructuring bind will not evaluate the given variable more than once. It does this by binding it to a temporary name and scrutinizing that name instead.
(defmacro destructuring-bind (pat var &body)
(let* [(variable (gensym 'var))
(test (pattern-test pat variable))
(bindings (pattern-bindings pat variable))]
`(with (,variable ,var)
(if ,test
(progn ,@body)
(error! "pattern matching failure")))))
Implementing case is a bit more difficult in a language without
cond
{.lisp}, since the linear structure of a pattern-matching case
statement would have to be transformed into a tree of if
-else
combinations. Fortunately, this is not our case (pun intended,
definitely.)
(defmacro case (var &cases)
(let* [(variable (gensym 'variable))]
`(with (,variable ,var)
(cond ,@(map (lambda (c)
`(,(pattern-test (car c) variable)
(let* ,(pattern-bindings (car c) variable)
,@(cdr c))))
cases)))))
Again, we prevent reevaluation of the matchee by binding it to a temporary symbol. This is especially important in an impure, expression-oriented language as evaluating the matchee might have side effects! Consider the following contrived example:
(case (progn (print! "foo")
123)
[1 (print! "it is one")]
[2 (print! "it is two")]
[_ (print! "it is neither")]) ; _ represents a wild card pattern.
If the matchee wasn't bound to a temporary value, "foo"
would be
printed thrice in this example. Both the toy implementation presented
here and the implementation in the Urn standard library will only
evaluate matchees once, thus preventing effect duplication.
Unlike previous blog posts, this one isn't runnable Urn. If you're interested, I recommend checking out the actual implementation. It gets a bit hairy at times, particularly with handling of structure patterns (which match Lua tables), but it's similar enough to the above that this post should serve as a vague map of how to read it.
In a bit of a meta-statement I want to point out that this is the first (second, technically!) of a series of posts detailing the interesting internals of the Urn standard library: It fixes two things in the sorely lacking category: content in this blag, and standard library documentation.
Hopefully this series is as nice to read as it is for me to write, and here's hoping I don't forget about this blag for a year again.