my blog lives here now
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
 
 
 

32 KiB

title date abbreviations
Parsing Layout, or: Haskell's Syntax is a Mess September 3rd, 2021 [{sparkles }]

Hello! Today we're going to talk about something I'm actually good at, for a change: writing compilers. Specifically, I'm going to demonstrate how to wrangle Alex and Happy to implement a parser for a simple language with the same indentation sensitive parsing behaviour as Haskell, the layout rule.

Alex and Happy are incredibly important parts of the Haskell ecosystem. If you're a Haskeller, you use a program using an Alex lexer and a Happy parser every single day - every single working day, at least - GHC! Despite this fundamental importance, Alex and Happy are... sparsely documented, to say the least. Hopefully this post can serve as an example of how to do something non-trivial using them.

However! While I'm going to talk about Alex and Happy here, it would be entirely possible to write a layout parser using Alex and whatever flavour of Parsec is popular this week, as long as your combinators are expressed on top of a monad transformer. It's also entirely possible to write a layout parser without Alex at all, but that's beyond my abilities. I am a mere mortal, after all.

Get ready to read the word "layout" a lot. Layout layout layout. How's your semantic satiation going? Should I say layout a couple more times?

The Offside Rule

So, how does Haskell layout work? A small subset of tokens (where, of, let, do1), called layout keywords, are followed by a laid out block (my terminology). The happiest (hah) case is where one of these keywords is followed by a { token. In this case, layout parsing doesn't happen at all!

main = do { putStrLn
    "foo"
  ; putStrLn "bar"
        ; putStrLn "quux" }

This abomination is perfectly valid Haskell code, since layout is disabled in a context that was started with a {. Great success though, since this is a very simple thing to support in a parser. The unhappy case is when we actually have to do layout parsing. In that case, the starting column of the token immediately following the layout token becomes the reference column (again my terminology), we emit a (virtual) opening brace, and the offside rule applies.

The offside rule says that a player must have at least two opposing players, counting the goalkeep- No no, that's not right. Give me a second. Ah! Yes. The offside rule governs automatic insertion of (virtual) semicolons and closing braces. When we encounter the first token of a new line, we are burdened to compare its starting column with the reference:

  • If it's on the same column as the reference column, we emit a semicolon. This is a new statement/declaration/case.

    do foo
       bar
    -- ^ same column, insert ; before.
    
    do
       foo
       bar
    -- ^ same column, insert ; before.
    -- yes, three spaces
    

    The two token streams above have the same prefix as do { foo; bar }{.haskell}.

  • If it's further indented than the reference column, we.. do nothing! Just go back to normal lexing. Tokens indented to the right of the reference column are interpreted as continuing the statement in the previous line. That's why you can do this:

    do
      putStrLn $
        wavy
          function
            application
          please
        don't
          though
    

    All of those tokens are (in addition to being the first token in a line) indented further than putStrLn, which is our reference column. This block has no semicolons at all!

  • If it's less indented than the reference column, we emit a virtual closing } (to end the block) and apply the rule again. This last bit is crucial: it says a single token can end all of the layout contexts it's leaving. For instance:

    foo = do a -- context 1
             do b -- context 2
                do c -- context 3
                   do d -- context 4
                      e
    bar = 123
    

    Assuming there was a layout context at the first column, i.e., we're in a module, then the token bar will be responsible for closing 4 whole layout contexts:

    • It's to the left of d, so it closes context 4;
    • It's to the left of c, so it closes context 3;
    • It's to the left of b, so it closes context 2;
    • It's to the left of a, so it closes context 1.

    With all the semicolons we have a right to, the code above is this:

    ; foo = do { a -- context 1
               ; do { b -- context 2
                    ; do { c -- context 3
                         ; do { d -- context 4
                              ; e
                              }
                         }
                    }
               }
    ; bar = 123
    

    Why do we have semicolons before foo and bar? Why, because they're in the same column as the reference token, which was presumably an import or something.

Laid-out blocks

With that, the parser productions for laid out blocks should be clear - or, at least, easily approximable. Right?
Wrong.

You might think the production for do blocks is something like the following, and you'd be forgiven for doing so. It's clean, it's reasonable, it's not actually Happy syntax, but it's a close enough approximation. Except that it's way incorrect!

expr
  : ...
  | 'do' '{' statement ';' ... '}' { ... }
  | 'do' VOpen statement VSemi ... VClose { ... }

Well, for do you might be able to get away with that. But consider the laid-out code on the left, and what the lexer naïvely produces for us on the right.

foo = let x = 1 in x
; foo = let { x = 1 in x

You see it, right? Since no token was on a column before that of the token x (the reference token for the layout context started by let), no close brace was emitted before in. Woe is us! However, the Haskell report has a way around this. They write it cryptically, like this:

...
L (t : ts) (m : ms) = }  :  (L (t : ts) ms) if m ≠ 0 and parse-error(t) 
...

The side condition parse-error(t) is to be interpreted as follows: if the tokens generated so far by L together with the next token t represent an invalid prefix of the Haskell grammar, and the tokens generated so far by L followed by the token } represent a valid prefix of the Haskell grammar, then parse-error(t) is true.

The test m ≠ 0 checks that an implicitly-added closing brace would match an implicit open brace.

I'll translate, since I'm fluent in standardese: Parse errors are allowed to terminate layout blocks, as long as no explicit { was given. This is the entire reason that Happy has an error token, which "matches parse errors"! For further reference, L is a function [Token] -> [Int] -> [Token]{.haskell} which is responsible for inserting virtual {, ; and } tokens. The [Int]{.haskell} argument is the stack of reference columns.

So a better approximation of the grammar is:

expr
  : ...
  | 'do' '{' statement ';' ... '}' { ... }
  | 'do' VOpen statement VSemi ... LClose { ... }

LClose
  : VClose {- lexer inserted '}' -}
  | error  {- parse error generated '}' -}

We have unfortunately introduced some dragons, since the parser now needs to finesse the lexer state, meaning they must be interleaved explicitly, instead of being run in sequence (using a lazy list of tokens or similar). They must be in the same Monad.

So. How do we implement this?

How we implement this

Preliminaries

To start with, we create a new Haskell project. I'd normally gloss over this, but in this case, there are adjustments to the Cabal file that must be made to inform our build of the dependencies on alex and happy. I use Stack; You can use whatever.

% stack new layout simple

To our Cabal file, we add a build-tool-depends on Alex and Happy. Cabal (the build system) comes with built-in rules to detect .x and .y files and compile these as Alex and Happy respectively.

  build-tool-depends:  alex:alex   >= 3.2.4 && < 4.0
                    ,  happy:happy >= 1.19.12 && < 2.0
  build-depends:       base >= 4.7 && < 5
                     , array >= 0.5 && < 0.6

This has been the recommended way of depending on build tools since Cabal 2. The syntax of build-tool-depends entries is package:executable [version bound], where the version bound is optional but good style. With this, running stack build (and/or cabal build) will automatically compile parser and lexer specifications listed in your other-modules field to Haskell files.

Alex generated code has a dependency on the array package.

What are we parsing

For the language we're parsing, I've chosen to go with a representative subset of Haskell's grammar: Variables, lambda expressions, let expressions, and application. For the top-level, we'll support function definitions, where the lhs must be a sequence of variables, and the rhs can optionally have a where clause.

module Syntax (Expr(..), Decl(..), Program) where

data Expr
  = Var String
  | App Expr Expr
  | Lam String Expr
  | Let [Decl] Expr
  deriving (Eq, Show)

data Decl
  = Decl { declName  :: String
         , declRhs   :: Expr
         , declWhere :: Maybe [Decl]
         }
  deriving (Eq, Show)

type Program = [Decl]

For simplicity, identifiers will be ASCII only. We're also using strings and lists everywhere, instead of more appropriate data structures (Text and Seq), for clarity. Don't forget to add the Syntax module to the other-modules field in layout.cabal.

The Lexer

Before we can parse, we must lex. But before we can lex, we must know the type of tokens. We create a separate Haskell module to contain the definition of the token type and Lexer monad. This is mostly done because HIE does not support Alex and Happy, and I've become dependent on HIE for writing correct code fast.

We'll call this new module Lexer.Support, just because. Our type of tokens must contain our keywords, but also punctuation (=, {, ;, }, \\, ->) and virtual punctuation (tokens inserted by layout). We declare:

module Lexer.Support where

data Token
  = TkIdent String -- identifiers

  -- Keywords
  | TkLet | TkIn | TkWhere

  -- Punctuation
  | TkEqual | TkOpen | TkSemi | TkClose
  | TkLParen | TkRParen
  | TkBackslash | TkArrow

  -- Layout punctuation
  | TkVOpen | TkVSemi | TkVClose

  -- End of file
  | TkEOF
  deriving (Eq, Show)

An Alex file

Alex modules always start with a Haskell header, between braces. In general, braces in Alex code represent a bit of Haskell we're inserting: The header, lexer actions, and the footer.

{
module Lexer where

import Lexer.Support
}

%encoding "latin1"

After the header, we can also include magical incantations: %wrapper will tell Alex to include a support code template with our lexer, and %encoding will tell it whether to work with bytes or with Unicode. Nobody uses the Unicode support, not even GHC: The community wisdom is to trick Alex into reading Unicode by compressing Unicode classes down into high byte characters. Yeah, yikes.

Our file can then have some macro definitions. Macros with the $ sigil are character classes, and @ macros are complete regular expressions.

$lower = [ a-z ]
$upper = [ A-Z ]

@ident = $lower [ $lower $upper _ ' ]*

And, finally, comes the actual lexer specification. We include the final magic word :- on a line by itself, and then list a bunch of lexing rules. Lexing rules are specified by:

  • A startcode, which names a state. These are written <ident> or <0>, where <0> is taken to be the "default" startcode. Rules are by default enabled in all states, and can be enabled in many;

  • A left context, which is a regular expression matched against the character immediately preceding the token;

  • A regular expression, describing the actual token;

  • A right context, which can be a regular expression to be matched after the token or a fragment of Haskell code, called a predicate. If the predicate is present, it must have the following type:

{ ... } :: user       -- predicate state
        -> AlexInput  -- input stream before the token
        -> Int        -- length of the token
        -> AlexInput  -- input stream after the token
        -> Bool       -- True <=> accept the token
  • An action, which can be ;, causing the lexer to skip the token, or some Haskell code, which can be any expression, as long as every action has the same type.

Here's a couple rules so we can get started. Don't worry - emit is a secret tool that will help us later.

:-
    [\ \t]+ ;

<0> @ident { emit TkIdent }

Alright, let's compile this code and see what we get! Oh, we get some type errors. Okay. Let's see what's up:

    Not in scope: type constructor or class ‘AlexInput’
    |
264 |   | AlexLastSkip     !AlexInput !Int
    |                       ^^^^^^^^^

Making our own wrapper

Right. That's probably related to that %wrapper thing I told you about. You'd be correct: The wrappers solve this problem by including a handful of common patterns pre-made, but we can very well supply our own! The interface to an Alex-generated lexer is documented here, but we're interested in §5.1 specifically. We have to provide the following definitions:

type AlexInput
alexGetByte       :: AlexInput -> Maybe (Word8, AlexInput)
alexInputPrevChar :: AlexInput -> Char

And we get in return a lexing function, whose type and interface I'm not going to copy-paste here. The alexGetByte function is called by the lexer whenever it wants input, so that's the natural place to do position handling, which, yes, we have to do ourselves. Let's fill in these definitions in the Lexer.Support module.

Here's an okay choice for AlexInput:

data AlexInput
  = Input { inpLine   :: {-# UNPACK #-} !Int
          , inpColumn :: {-# UNPACK #-} !Int
          , inpLast   :: {-# UNPACK #-} !Char
          , inpStream :: String
          }
  deriving (Eq, Show)

We can immediately take alexInputPrevChar = inpLast as the definition of that function and be done with it, which is fantastic. alexGetByte, on the other hand, is a bit more involved, since it needs to update the position based on what character was read. The column must be set properly, otherwise layout won't work! The line counter is less important, though.

alexGetByte :: AlexInput -> Maybe (Word8, AlexInput)
alexGetByte inp@Input{inpStream = str} = advance <$> uncons str where
  advance ('\n', rest) =
    ( fromIntegral (ord '\n')
    , Input { inpLine   = inpLine inp + 1
            , inpColumn = 1
            , inpLast   = '\n'
            , inpStream = rest }
    )

  advance (c, rest) =
    ( fromIntegral (ord c)
    , Input { inpLine = inpLine inp
            , inpColumn = inpColumn inp + 1
            , inpLast = c
            , inpStream = rest }
    )

Now, our lexer has a lot of state. We have the start codes, which form a stack. We have the stack of reference columns, and we have the input. Let's use a State monad to keep track of this, with an Either String base to keep track of errors.

newtype Lexer a = Lexer { _getLexer :: StateT LexerState (Either String) a }
  deriving
    ( Functor
    , Applicative
    , Monad
    , MonadState LexerState
    , MonadError String
    )

data Layout = ExplicitLayout | LayoutColumn Int
  deriving (Eq, Show, Ord)

data LexerState
  = LS { lexerInput      :: {-# UNPACK #-} !AlexInput
       , lexerStartCodes :: {-# UNPACK #-} !(NonEmpty Int)
       , lexerLayout     :: [Layout]
       }
  deriving (Eq, Show)

initState :: String -> LexerState
initState str = LS { lexerInput      = Input 0 1 '\n' str
                   , lexerStartCodes = 0 :| []
                   , lexerLayout     = []
                   }


runLexer :: Lexer a -> String -> Either String a
runLexer act s = fst <$> runStateT (_getLexer act) (initState s) 
I'll spare you the boring stack manipulation stuff by putting it in one of these \ elements you can expand:
startCode :: Lexer Int
startCode = gets (NE.head . lexerStartCodes)

pushStartCode :: Int -> Lexer ()
pushStartCode i = modify' $ \st ->
  st { lexerStartCodes = NE.cons i (lexerStartCodes st )
     }

-- If there is no start code to go back to, we go back to the 0 start code.
popStartCode :: Lexer ()
popStartCode = modify' $ \st ->
  st { lexerStartCodes =
         case lexerStartCodes st of
           _ :| [] -> 0 :| []
           _ :| (x:xs) -> x :| xs
     }

layout :: Lexer (Maybe Layout)
layout = gets (fmap fst . uncons . lexerLayout)

pushLayout :: Layout -> Lexer ()
pushLayout i = modify' $ \st ->
  st { lexerLayout = i:lexerLayout st }

popLayout :: Lexer ()
popLayout = modify' $ \st ->
  st { lexerLayout =
         case lexerLayout st of
           _:xs -> xs
           [] -> []
     }

Putting it all together

It's up to us to specify what an action is - remember, the action is the code block following a lexer rule - so we'll go with String -> Lexer Token. The String argument is the lexed token, and we'll have to take this slice ourselves when we implement the interface between the Alex lexer and our Lexer monad. The emit action is simple, and we'll throw in token for no extra cost:

emit :: (String -> Token) -> String -> Lexer Token
emit = (pure .)

token :: Token -> String -> Lexer Token
token = const . pure

Back to our Lexer.x, we have to write the function to interpret Alex lexer results as Lexer monad actions. It goes like this:

{
handleEOF = do
  -- TODO: handle layout
  pure TkEOF

scan :: Lexer Token
scan = do
  input@(Input _ _ _ string) <- gets lexerInput
  startcode <- startCode
  case alexScan input startcode of
    AlexEOF -> handleEOF
    AlexError (Input _ _ _ inp) ->
      throwError $ "Lexical error: " ++ show (head inp)
    AlexSkip input' _ -> do
      modify' $ \s -> s { lexerInput = input' }
      scan
    AlexToken input' tokl action -> do
      modify' $ \s -> s { lexerInput = input' }
      action (take tokl string)
}

Now we can do a stack build to compile the lexer and stack repl to play around with it!

λ runLexer scan "abc"
Right (TkIdent "abc")
λ runLexer scan "              abc"
Right (TkIdent "abc")
λ runLexer scan " {"
Left "Lexical error: '{'"

Okay, yeah, let's fill out our lexer a bit more.

<0> in     { token TkIn }
<0> \\     { token TkBackslash }
<0> "->"   { token TkArrow }
<0> \=     { token TkEqual }
<0> \(     { token TkLParen }
<0> \)     { token TkRParen }
<0> \{     { token TkOpen }
<0> \}     { token TkClose }

That's all of the easy rules we can do - All of the others interact with the layout state, which we'll see how to do in the paragraph immediately following this one. I'm writing a bit of padding here so you can take a breather and prepare yourself for the lexer states that we'll deal with now. But, please believe me when I say we're doing this lexer madness so our parser can be sane.

Actually Doing Layout (trademark pending)

We'll need two rules for the layout keywords. Alex rules are matched in order, top-to-bottom, so make sure your keywords are before your identifier rule.

<0> let     { layoutKw TkLet }
<0> where   { layoutKw TkWhere }

And the action for layout keywords, which has to go in the lexer since it'll refer to a startcode. Alex automatically generates definitions for all the startcodes we mention.

layoutKw t _ = do
  pushStartCode layout
  pure t

The interesting rules for handling layout are in the layout startcode, which we'll declare as a block to keep things a bit tidier. When in this startcode, we need to handle either an explicitly laid-out block (that is, {), or the start of a layout context: The indentation of the next token determines where we start.

<layout> {
  -- Skip comments and whitespace
  "--" .* \n ;
  \n       ;

  \{ { openBrace }
  () { startLayout }
}

The openBrace and startLayout lexer actions are also simple:

openBrace _ = do
  popStartCode
  pushLayout ExplicitLayout
  pure TkOpen

startLayout _ = do
  popStartCode

  reference <- Lexer.Support.layout
  col       <- gets (inpColumn . lexerInput)

  if Just (LayoutColumn col) <= reference
    then pushStartCode empty_layout
    else pushLayout (LayoutColumn col)
    
  pure TkVOpen

Here's another rule. suppose we have:

   foo = bar where
   spam = ham

If we just apply the rule that the next token after a layout keyword determines the column for the layout context, then we're starting another layout context at column 1! that's definitely not what we want.

The fix: A new layout context only starts if the first token is to the right of the previous layout context. That is: a block only starts if it's on the same column as the layout context, or indented further.

But! We still need to emit a closing } for the one that openBrace generated! This is the sole function of the empty_layout startcode:

<empty_layout> () { emptyLayout }
emptyLayout _ = do
  popStartCode
  pushStartCode newline
  pure TkVClose

We're on the home stretch. I mentioned another startcode - newline. It's where we do the offside rule, and our lexer will finally be complete.

The Offside Rule, again

The newline state is entered in two places: After an empty layout block (as a short-circuit), and after, well, a new line character. Comments also count as newline characters, by the way.

<0> "--" .* \n { \_ -> pushStartCode newline *> scan }
<0> \n         { \_ -> pushStartCode newline *> scan }

In the newline state, we again scan for a token, and call for an action, just like for layout. The difference is only in the action: Whenever any token is encountered, we perform the offside rule, if we're in a layout context that mandates it.

<newline> {
  \n         ;
  "--" .* \n ;
  
  () { offsideRule }
}

The code for the offside rule is a bit hairy, but follows from the spec:

offsideRule _ = do
  context <- Lexer.Support.layout
  col <- gets (inpColumn . lexerInput)

  let continue = popStartCode *> scan

  case context of
    Just (LayoutColumn col') -> do
      case col `compare` col' of
        EQ -> do
          popStartCode
          pure TkVSemi
        GT -> continue
        LT -> do
          popLayout
          pure TkVClose
    _ -> continue

Check out how cleanly those three cases map to the rules I described way back when. We compare{.haskell} the current column with the reference, and:

  • If it's EQ, add a semicolon.
  • If it's GT, continue lexing.
  • If it's LT, close as many layout contexts as possible.
**Exercise**: In the `handleEOF` action, close all the pending layout contexts. As a hint, the easiest way to emit a token that doesn't is using a startcode and a lexer action. Figuring out when we've run out is part of the challenge :)

The rule:

<eof> () { doEOF }

The action:

handleEOF = pushStartCode eof *> scan

doEOF _ = do
  t <- Lexer.Support.layout
  case t of
    Nothing -> do
      popStartCode
      pure TkEOF
    _ -> do
      popLayout
      pure TkVClose

We can write a Lexer action (not a lexer action!) to lex and Debug.Trace.trace{.haskell} - sue me - as many tokens as the lexer wants to give us, until an EOF is reached:

lexAll :: Lexer ()
lexAll = do
  tok <- scan
  case tok of
    TkEOF -> pure ()
    x -> do
      traceM (show x)
      lexAll

Now we can actually lex some Haskell code! Well, not much of it. Forget numbers, strings, and most keywords, but we can lex this:

foo = let
        x = let
          y = z
          in y
      in x
TkIdent "foo"
TkEqual
TkLet
TkVOpen
TkIdent "x"
TkEqual
TkLet
TkVOpen
TkIdent "y"
TkEqual
TkIdent "z"
TkVSemi
TkIn
TkIdent "y"
TkVClose
TkVClose
TkIn
TkIdent "x"

That is, that code is lexed as if it had been written:

foo = let {
        x = let {
          y = z
          ; in y
      }} in x

That's... Yeah. Hmm. That's not right. What are we forgetting? Ah, who am I kidding, you've guessed this bit. I even said it myself!

Parse errors are allowed to terminate layout blocks.

We don't have a parser to get errors from, so our layout blocks are terminating too late. Let's write a parser!

The Parser

Happy is, fortunately, less picky about how to generate code. Instead of appealing to some magic symbols that it just hopes really hard are in scope, Happy asks us how we want it to interface with the lexer. We'll do it &sparkles; Monadically &sparkles;, of course.

Happy files start the same way as Alex files: A Haskell code block, between braces, and some magic words. You can look up what the magic words do in the documentation, or you can guess - I'm just gonna include all the header here:

{
module Parser where

import Control.Monad.Error
import Lexer.Support
}

%name parseExpr Expr

%tokentype { Token }
%monad { Lexer }
%lexer { lexer } { TkEOF }

%errorhandlertype explist
%error { parseError }

After these magic incantations (by the way, if you can't find the docs for errorhandlertype, that's because the docs you're looking at are out of date. See here), we list our tokens in the %token directive. In the braces we write Haskell - not an expression, but a pattern.

%token
  VAR     { TkIdent $$ }
  'let'   { TkLet }
  'in'    { TkIn }
  'where' { TkWhere }

  '='     { TkEqual }
  '{'     { TkOpen }
  ';'     { TkSemi }
  '}'     { TkClose }
  '\\'    { TkBackslash }
  '->'    { TkArrow }
  '('     { TkLParen }
  ')'     { TkRParen }

  OPEN    { TkVOpen }
  SEMI    { TkVSemi }
  CLOSE   { TkVClose }

%%

The special $$ pattern says that if we use a VAR token in a production, its value should be the string contained in the token, rather than the token itself. We write productions after the %%, and they have this general syntax:

Production :: { Type }
  : rule1 { code1 }
  | rule2 { code2 }
  | ...

For starters, we have these productions. You can see that in the code associated with a rule, we can refer to the tokens parsed using $1, $2, $3, ....

Atom :: { Expr }
  : VAR          { Var $1 }
  | '(' Expr ')' { $2 }

Expr :: { Expr }
  : '\\' VAR '->' Expr { Lam $2 $4 }
  | FuncExpr           { $1 }

FuncExpr :: { Expr }
  : FuncExpr Atom { App $1 $2 }
  | Atom          { $1 }

In the epilogue, we need to define two functions, since I mentioned them way up there in the directives. The lexer function is a continuation-passing style function that needs to call cont with the next token from the lexer. The parseError function is how we should deal with parser errors.

{
lexer cont = scan >>= cont

parseError = throwError . show
}

By using the %name directive we can export a parser production as an action in the Lexer monad (since that's what we told Happy to use). Combining that with our runLexer, we can parse some expressions, yay!

λ runLexer parseExpr "(\\x -> x) (\\y -> y)"
Right (App (Lam "x" (Var "x")) (Lam "y" (Var "y")))

Laid-out productions

Now we'll introduce some productions for parsing laid-out lists of declarations, then we'll circle back and finish with the parser for declarations itself.

DeclBlock :: { [Decl] }
  : '{' DeclListSemi '}'    { $2 }
  | OPEN DeclListSEMI Close { $2 }

DeclListSemi :: { [Decl] }
  : Decl ';' DeclListSemi { $1:$3 }
  | Decl                  { [$1] }
  | {- empty -}           { [] }

DeclListSEMI :: { [Decl] }
  : Decl SEMI DeclListSemi { $1:$3 }
  | Decl                   { [$1] }
  | {- empty -}            { [] }

That is, a block of declarations is either surrounded by { ... } or by OPEN ... Close. But what's Close? That's right, you've guessed this bit too:

Close
  : CLOSE { () }
  | error {% popLayout }

Say it louder for the folks in the cheap seats - Parse! Errors! Can! End! Layout! Blocks! Isn't that just magical?

Now we can write a production for let (in Expr):

  | 'let' DeclBlock 'in' Expr { Let $2 $4 }

And one for declarations:

Decl
  : VAR '=' Expr { Decl $1 $3 Nothing }
  | VAR '=' Expr 'where' DeclBlock { Decl $1 $3 (Just $5) }

Add a name directive for Decl and..

%name parseDecl Decl

We're done!

No, seriously, that's it.

Yeah, 3000 words is all it takes to implement a parser for Haskell layout. Running this on the example where the lexer dropped the ball from earlier, we can see that the parser has correctly inserted all the missing }s in the right place because of the Close production, and the AST we get is what we expect:

λ runLexer parseDecl <$> readFile "that-code-from-before.hs"
Right
  (Decl { declName = "foo"
        , declRhs =
            Let [ Decl { declName = "x"
                       , declRhs =
                          Let
                            [ Decl { declName = "y", declRhs = Var "z"
                                   , declWhere = Nothing} ]
                            (Var "y")
                       , declWhere = Nothing
                       }
                ]
                (Var "x")
        , declWhere = Nothing
        })

I've thrown the code from this post up in an organised manner on my Gitea. The lexer worked out to be 130 lines, and the parser - just 81.

Here's why I favour this approach:

  • It's maintainable. Apart from the rendezvous in Close, the lexer and the parser are completely independent. They're also entirely declarative - Reading the lexer rules tells you exactly what the lexer does, without having to drop down to how the actions are implemented.

  • It cleanly extends to supporting ASTs with annotations - you'd change our current Token{.haskell} type to a TokenClass{.haskell} type, and a Token would be finished using the line and column from the lexer state. Annotating the AST with these positions can be done by projecting from $N in the Happy rules.

  • It's performant. Obviously the implementation here, using String, is not, but by changing how the AlexInput type behaves internally, we can optimise by using e.g. a lazy ByteString, a lazy Text, or some other kind of crazy performant stream type. I don't think anyone's ever complained about parsing being their bottleneck with GHC.

  • It's popular! The code implemented here is a simplification (wild simplification) of the approach used in GHC and Agda.

Thank you for reading this post. I have no idea what I'm going to write about next!


  1. GHC extends this set to also contain the "token" \case. However, LambdaCase isn't a single token! The &sparkles; correct &sparkles; specification is that case is a layout keyword if the preceding token is \. ↩︎