--- title: "The G-machine In Detail, or How Lazy Evaluation Works" date: January 31, 2020 maths: true --- \long\def\ignore#1{} \ignore{ \begin{code} {-# LANGUAGE RecordWildCards, NamedFieldPuns, CPP #-} #if !defined(Section) #error "You haven't specified a section to load! Re-run with -DSection=1 or -DSection=2" #endif #if defined(Section) && (Section != 1 && Section != 2) #error Section "isn't a valid section to load! Re-run with -DSection=1 or -DSection=2" #endif \end{code} } With Haskell now more popular than ever, a great deal of programmers deal with lazy evaluation in their daily lives. They're aware of the pitfalls of lazy I/O, know not to use `foldl`, and are masters at introducing bang patterns in the right place. But very few programmers know the magic behind lazy evaluation—graph reduction. This post is an abridged adaptation of Simon Peyton Jones' and David R. Lester's book, _"Implementing Functional Languages: a tutorial."_, itself a refinement of SPJ's previous work, 1987's _"The Implementation of Functional Programming Languages"_. The newer book doesn't cover as much material as the previous: it focuses mostly on the evaluation of functional programs, and indeed that is our focus today as well. For this, it details three abstract machines: The G-machine, the Three Instruction Machine (affectionately called Tim), and a parallel G-machine. In this post we'll take a look first at a stack-based machine for reducing arithmetic expressions. Armed with the knowledge of how typical stack machines work, we'll take a look at the G-machine, and how graph reduction works (and where the name comes from in the first place!) This post is written as [a Literate Haskell source file], with Cpp conditionals to enable/disable each section. To compile a specific section, use GHC like this: ```bash ghc -XCPP -DSection1 2020-01-09.lhs ``` ----- \ignore{ \begin{code} {-# LANGUAGE CPP #-} #if Section == 1 \end{code} } \begin{code} module StackArith where \end{code} Section 1: Evaluating Arithmetic with a Stack ============================================= Stack machines are the base for all of the computation models we're going to explore today. To get a better feel of how they work, the first model of computation we're going to describe is stack-based arithmetic, better known as reverse polish notation. This machine also forms the basis of the programming language FORTH. First, let us define a data type for arithmetic expressions, including the four basic operators (addition, multiplication, subtraction and division.) \begin{code} data AExpr = Lit Int | Add AExpr AExpr | Sub AExpr AExpr | Mul AExpr AExpr | Div AExpr AExpr deriving (Eq, Show, Ord) \end{code} This language has an 'obvious' denotation, which can be realised using an interpreter function, such as `aInterpret` below. \begin{code} aInterpret :: AExpr -> Int aInterpret (Lit n) = n aInterpret (Add e1 e2) = aInterpret e1 + aInterpret e2 aInterpret (Sub e1 e2) = aInterpret e1 - aInterpret e2 aInterpret (Mul e1 e2) = aInterpret e1 * aInterpret e2 aInterpret (Div e1 e2) = aInterpret e1 `div` aInterpret e2 \end{code} Alternatively, we can implement the language through its _operational_ behaviour, by compiling it to a series of instructions that, when executed in an appropriate machine, leave it in a _final state_ from which we can extract the expression's result. Our abstract machine for aritmethic will be a _stack_ based machine with only a handful of instructions. The type of instructions is `AInstr`{.haskell}. \begin{code} data AInstr = Push Int | IAdd | IMul | ISub | IDiv deriving (Eq, Show, Ord) \end{code} The state of the machine is simply a pair, containing an instruction stream and a stack of values. By our compilation scheme, the machine is never in a state where more values are required on the stack than there are values present; This would not be the case if we let programmers directly write instruction streams. We can compile a program into a sequence of instructions recursively. \begin{code} aCompile :: AExpr -> [AInstr] aCompile (Lit i) = [Push i] aCompile (Add e1 e2) = aCompile e1 ++ aCompile e2 ++ [IAdd] aCompile (Mul e1 e2) = aCompile e1 ++ aCompile e2 ++ [IMul] aCompile (Sub e1 e2) = aCompile e1 ++ aCompile e2 ++ [ISub] aCompile (Div e1 e2) = aCompile e1 ++ aCompile e2 ++ [IDiv] \end{code} And we can write a function to represent the state transition rules of the machine. \begin{code} aEval :: ([AInstr], [Int]) -> ([AInstr], [Int]) aEval (Push i:xs, st) = (xs, i:st) aEval (IAdd:xs, x:y:st) = (xs, (x + y):st) aEval (IMul:xs, x:y:st) = (xs, (x * y):st) aEval (ISub:xs, x:y:st) = (xs, (x - y):st) aEval (IDiv:xs, x:y:st) = (xs, (x `div` y):st) \end{code} A state is said to be _final_ when it has an empty instruction stream and a single result on the stack. To run a program, we simply repeat `aEval` until a final state is reached. \begin{code} aRun :: [AInstr] -> Int aRun is = go (is, []) where go st | Just i <- final st = i go st = go (aEval st) final ([], [n]) = Just n final _ = Nothing \end{code} A very important property linking our compiler, abstract machine and interpreter together is that of _compiler correctness_. That is: ```haskell forall x. aRun (aCompile x) == aInterpret x ``` As an example, the arithmetic expression $2 + 3 \times 4$ produces the following code sequence: ```haskell [Push 2,Push 3,Push 4,IMul,IAdd] ``` You can interactively follow the execution of this program with the tool below. Pressing the Step button is equivalent to `aEval`. The stack is drawn in boxes to the left, and the instruction sequence is presented on the right, where the `>` marks the currently executing instruction (the "program counter", if you will).
This all sounds good when described on paper, but how does one actually wire up (or, well, program) a computer to reduce functional programs? Among the first and most comprehensive answers to this question was the G-machine, whose G stands for "Graph". More specifically, the G-machine is an implementation of _graph reduction_: The expression to be reduced is represented as a graph that might have some redexes. Once the machine has identified some particular redex to reduce, it'll evaluate exactly as much as is needed to reach a head-normal form, and _replace_ (or update) the graph so that the old redex points to its normal form. To explore the workings of the G-machine, we'll need to choose a functional language. Any will do, but simpler is better. Since I've already written a Lazy ML that compiles as described in this post, we'll go with that. [Rio]'s core language is a very simple functional language, notable only in that _it doesn't have $\lambda$-abstractions_. All functions are defined at top-level, in the form of supercombinators.$$ ( \lambda{x}. x + 2 )\ 5 $$ Evaluation of a functional program starts by identifying a _reducible expression_, that is, an expression that isn't "done" evaluating yet. By convention, we call reducible expressions redexes for short[^1], and expressions that are done evaluating are called _head-normal forms_. Every application is a reducible expression. Here, reduction proceeds by substituting $5$ in the place of every mention of $x$. Substituting an expression $E_2$ in place of the variable $v$, in a bigger expression $E_1$ is notated $E_1[E_2/v]$ (read "$E_1$ with $E_2$ for $v$"). $$ (x + 2)[5/x] $$ This step of the evaluation isn't exactly an expression, but it serves to illustrate what reducing a $\lambda$ expression does: replacing the bound variable (or the "formal parameter" in fancy-pants speak. I'll stick to bound variable). $$ (5 + 2) $$ By this step, the function has disappeared entirely. The expression has been replaced entirely with addition between numbers. Of course, addition, when both sides have been evaluated to a number, is _itself_ a redex. This program isn't done yet. $$ 7 $$ Replacing the addition by its value, our original program has reached its end: The number $7$, and indeed any other number, is a head-normal form.
A **supercombinator** is a function that only refers to its arguments or other supercombinators.There's a data type for terms: ```haskell data Term = Let [(Var, Term)] Term | Letrec [(Var, Term)] Term | App Term Term | Ref Var | Num Integer deriving Show ``` And one for supercombinators: ```haskell data SC = SC { name :: Var, args :: [Var], body :: Term } deriving Show ``` Consider the reduction of this functional program: ```haskell double x = x + x main = double (double 4) ``` Here, `double` and `main` are the supercombinators that constitute the program. By convention, execution starts with the supercombinator `main`.
The initial graph is the trivial graph containing only the node `main` and no edges. Since the node points directly to a supercombinator, we can replace it by a copy of its body:
Now starts the actual work. There are many strategies for selecting a redex, and all of them are equally good, with the caveat that some may not terminate. However, if _any_ evaluation strategy terminates, then so does "always choose the outermost redex". This is called normal order evaluation. It's what the G-machine implements. The outermost redex here is the outer application of `double`, so that's where reduction will happen. To reduce an application, update the redex with a copy of the supercombinator body, and replace the bound variables with pointers to the arguments.
Observe that, since the subexpression `double 4` has two edges leading into it, the _tree_ representing the program has degenerated into a general graph. However, this isn't a bad thing: it means that the work to evaluate `double 4` will only be needed once. The application of $+$ isn't reducible yet because it requires its arguments to be evaluated, so the next reducible expression down the chain is the application node representing `double 4`. The expansion there is similarly simple. Here, it's a bit hard to see what's actually going on, so I'll highlight in blue the _whole_ next redex, `4 + 4`.
The state of the graph after reduction of `double 4`.
... with the entirety of the next redex highlighted for clarity.
And finally, the last redex, `8 + 8`, can be zapped entirely into the number 16[^2]. --- \begin{code} module Gm where import qualified Data.Map.Strict as Map import Data.Map.Strict (Map, (!)) import qualified Data.Set as Set import Data.Set (Set) \end{code} \ignore{ \begin{code} import Data.Maybe \end{code} } Section 2: The G-machine ======================== After seeing in detail the reduction of a simple expression, one might start to form in their heads an idea of an algorithm to reduce a functional programming. As SPJ put it:
1. Find the next redex. 2. Reduce it. 3. Update the root of the redex with its reduct.With these three easy steps, functional programs be! Of course, that glosses over three major difficulties: 1. How does one find the next redex? 2. How does one reduce it? 3. How does one update the graph? Of these, only the answer to 3 is simple: "Overwrite it with an indirection". (We'll get there). To do the latter efficiently, we're going to use an _abstract machine_: The G-machine.
Well, I've used the least subtle blue possible to highlight the _spine_ of the expression graph. By starting at the root (the topmost node), and following every left pointer until reaching a supercombinator, one can find the spine of the graph. Moreover, if we use a stack to remember the addresses that we visited on our way down, we'll have _unwound_ the spine.
There needs to be a way to evaluate the argument `double 4` to head normal form then continue reducing the application of `+`. Every programming language has to deal with this, and our solution is more of the same: use a stack. The G-machine already has a stack, though, so we need another one. A stack of stacks, and of return addresses, called the _dump_. When a primitive operation needs the value of one of its arguments, it first saves that argument from the stack, then pushes the stack pointer and program counter onto the dump (this is the G-machine's concept of return address); The saved argument is pushed onto an empty stack, and the graph is unwound starting from that argument. When unwinding encounters a node in head-normal form, and there's a saved return address on the dump, we pop that, restore the stack pointers, and jump to the saved program counter. The idea behind the G-machine is that we can teach each supercombinator to make an instance of its own body by compiling it to a series of small, atomic instructions. This solves the hardest problem in implementing functional languages, which is the whole "replacing the root of the redex with a copy of the supercombinator body" I glossed over. An Example ---------- Let's consider the (fragment of a) functional program below. ```haskell f g x = K (g x) ``` Compiling it into G-machine instructions results in the following instructions: ```haskell Push (Arg 1) Push (Arg 3) Mkap Push (Global K) Mkap Slide 3 Unwind ``` These diagrams show how the code for `f` would execute.
\begin{code} factorial10 :: GmState factorial10 = GmState { code = [Push (Global "main"), Unwind] , globals = globals , stack = [] , heap = heap , dump = [] } where heap = Map.fromList . zip [0..] $ [ SCo "fac" 1 [ Push (Arg 0), Eval, Push (Local 0), Push (Value 0), Equ , Cond [ Push (Value 1), Slide 3, Unwind ] [] , Push (Global "fac") , Push (Local 1), Push (Value 1), Sub , Mkap, Eval , Push (Local 1), Mul , Slide 2, Unwind ] , SCo "main" 0 [ Push (Global "fac"), Push (Value 10), Mkap, Slide 1, Unwind ] ] globals = Map.fromList [ ("fac", 0), ("main", 1) ] \end{code} What you could expect from Rio is more along the lines of this crime against humanity: \begin{code} factorial10_dumb :: GmState factorial10_dumb = GmState { code = [Unwind] , globals = globals , stack = [5] , heap = heap , dump = [] } where heap = Map.fromList . zip [0..] $ [ SCo "if" 3 [ Push (Arg 0), Eval, Cond [ Push (Arg 1) ] [ Push (Arg 2) ], Slide 4, Unwind ] , SCo "mul" 2 [ Push (Arg 0), Eval, Push (Arg 2), Eval, Mul, Slide 3, Unwind ] , SCo "sub" 2 [ Push (Arg 0), Eval, Push (Arg 2), Eval, Sub, Slide 3, Unwind ] , SCo "equ" 2 [ Push (Arg 0), Eval, Push (Arg 2), Eval, Equ, Slide 3, Unwind ] , SCo "fac" 1 [ Push (Global "if"), Push (Global "equ"), Push (Arg 2), Mkap, Push (Value 0), Mkap , Mkap, Push (Value 1), Mkap, Push (Global "mul"), Push (Arg 2), Mkap, Push (Global "fac") , Push (Global "sub"), Push (Arg 4), Mkap, Push (Value 1), Mkap, Mkap, Mkap , Mkap, Slide 2, Unwind ] , SCo "main" 0 [ Push (Global "fac"), Push (Value 10), Mkap, Slide 1, Unwind ] ] globals = Map.fromList [ ("if", 0), ("mul", 1), ("sub", 2), ("equ", 3), ("fac", 4) ] \end{code}
Those two red nodes? That's the result of the program, and the top of the stack pointing to it. Yup. Thankfully, the G-machine makes it easy to write a garbage collector. Well, in theory, at least. The roots can be found on the stack, and all the stacks saved on the dump. Each live supercombinator can also keep other supercombinators alive by referencing them in `Push (Global _)` instructions. Since traversing each supercombinator every GC cycle to identify global references is expensive, they can each be augmented with a "static reference table", or SRT for short. In our simulator, this would be a `Set` of `Addr`s that each supercombinator keeps alive. \begin{code} liveAddrs :: GmState -> Set Addr liveAddrs GmState{..} = roots <> foldMap explore roots where roots = Set.fromList stack <> foldMap (Set.fromList . fst) dump explore i = Set.insert i $ case heap Map.! i of App x y -> explore x <> explore y SCo _ _ code -> foldMap globalRefs code _ -> mempty globalRefs (Push (Global i)) = Set.singleton (globals Map.! i) globalRefs _ = mempty \end{code} With the set of live addresses in hand, we can write code to get rid of all the others, and re-number them all. This is a toy moving garbage collector, since we allocate an entirely new heap to get rid of the old one. \begin{code} scavenge :: GmState -> GmState scavenge st@GmState{..} = st { heap = Map.filterWithKey (\k _ -> is_live k) heap } where live = liveAddrs st is_live x = x `Set.member` live \end{code} Running scavenge on the final state of `factorial10_dumb` gets us a much better looking graph:
\ignore{ \begin{code} #endif \end{code} } Possible Extensions =================== 1. Data structures. This is covered in the book, but I didn't have space/time to cover it here. The core idea is that the graph gets a new kind of node, `Constr Int [Addr]`, that stores a tag and some fixed amount of addresses. Pattern-matching `case` expressions can then take apart these `Constr` nodes and branch based on the integer tag. 1. Support I/O. By threading an explicit state variable, a guaranteed order of effects can be achieved even in lazy code. Let me tell you a secret: This is what GHC does. ```haskell newtype IO a = IO { runIO# :: State# RealWorld -> (# a, State# RealWorld #) } ``` The `State# RealWorld#`{.haskell} value is consumed by each foreign function, i.e. everything that _actually_ does I/O, looking a lot like a state monad; In reality, the `RealWorld`{.haskell} is made of lies. `State#`{.haskell} has return kind `TYPE (TupleRep '[])`{.haskell}, i.e., it takes up no bits at runtime. However, by having every foreign function be strict in _some_ variable, no matter how fake it is, we can guarantee the order of effects: each function depends directly on the function "before" it. 1. Parallelism. Lazy graph reduction lends itself nicely to parallelism. One could envision a machine where a number of worker threads are each working on a different redex. To prevent weird parallelism issues from cropping up, graph nodes would need to be lockable. However, only `@` nodes will ever be locked, so that might lead to an optimisation. As an alternative to a regular lock, the implementation could replace each node under evaluation by a _black hole_, that doesn't keep alive any more values (thus _possibly_ getting rid of some space leaks). Each black hole would maintain a queue of threads that tried to evaluate it, to be woken up once the result is available. Conclusion ========== This post was long. And it _still_ didn't cover a lot of stuff about the G-machine, such as how to compile _to_ the G-machine (expect a follow-up post on that) and how to compile _from_ the G-machine (expect a follow-up post on that too!) Assembling G-machine instructions is actually simpler than it seems. With the exception of `Eval` and `Unwind`, which are common and large enough to warrant pre-assembled helpers, all G-machine instructions assemble to no more than a handful of x86 instructions. As an entirely contextless example, here's how `Cond` instructions are assembled in Rio: ```haskell compileGInst (Cond c_then c_else) = do pop rbx cmp (int64 0) (intv_off `quadOff` rbx) rec jne else_label traverse_ compileGInst c_then jmp exit_label else_label <- genLabel traverse_ compileGInst c_else exit_label <- genLabel pure () ``` This is one of the most complicated instructions to assemble, since the compiler has to do the impedance matching between the G-machine abstraction of "instruction lists" and the assembler's labels. Other instructions, such as `Pop` (not documented here), have a much clearer translation: ```haskell compileGInst (Pop n) = add (int64 (n * 8)) rsp ``` Keep in mind that the x86 stack grows downwards, so adding corresponds popping. The only difference between the actual machine here and the G-machine here is that the latter works in terms of addresses and the former works in terms of bytes. The code to make an `App` node is similarly simple, using Haskell almost as a macro assembler. The variable `hp` is defined in the code generator and RTS headers to be `r10`, such that both the C support code and the generated assembly can agree on where the heap is. ```haskell compileGInst Mkap = do mov (int8 tag_AP) (tag_off `byteOff` hp) pop (arg_off `quadOff` hp) pop (fun_off `quadOff` hp) push hp hp += int64 valueSize ``` Allocating in Rio is as simple as writing the value you want, saving `hp` somewhere, then bumping it by the size of a value. We can do this because the amount a given supercombinator allocates is statically known, so we can do a heap satisfaction check once, at the start of the combinator, and then just build our graphs free of worry.