amelia
/
blag

---title: "The G-machine In Detail, or How Lazy Evaluation Works"date: January 31, 2020maths: true---
\long\def\ignore#1{}
\ignore{\begin{code}{-# LANGUAGE RecordWildCards, NamedFieldPuns, CPP #-}#if !defined(Section)#error "You haven't specified a section to load! Re-run with -DSection=1 or -DSection=2"#endif#if defined(Section) && (Section != 1 && Section != 2)#error Section "isn't a valid section to load! Re-run with -DSection=1 or -DSection=2"#endif\end{code}}
<script src="https://cdn.jsdelivr.net/npm/@svgdotjs/[email protected]/dist/svg.min.js"></script>
<style>.diagram {  background-color: #ddd;  min-height: 10em;}
.diagram-contained {  height: 100%;}
.diagram-container {  display: flex;  flex-direction: row;}
.picture-container {  display: flex;  flex-direction: row;  overflow-x: scroll;  justify-content: space-between;}
.picture {  display: flex;  flex-direction: column;  width: 80ch;  margin-left: 2em;  margin-right: 2em;}
.center {  justify-content: center;  width: 80%;  max-width: 50%;  margin: 0 auto;}
.instruction {  font-family: monospace;  color: #af005f;}
.operand {  font-family: monospace;  color: #268bd2;}
img.centered {  width: 10em;  margin: auto;}
img.big {  width: 80ch;  height: 200px;  margin: auto;}
img.absolute-unit {  width: 80ch;  height: 500px;  margin: auto;}
img.two-img {  padding-left: 3em;  padding-right: 2em;}
p.image {  text-align: center !important;}

</style>
<noscript>This post has several interactive components that won't work withoutJavaScript. These will be clearly indicated. Regardless, I hope that youcan still appreciate the prose and code.</noscript>
With Haskell now more popular than ever, a great deal of programmersdeal with lazy evaluation in their daily lives. They're aware of thepitfalls of lazy I/O, know not to use `foldl`, and are masters atintroducing bang patterns in the right place. But very few programmersknow the magic behind lazy evaluation—graph reduction.
This post is an abridged adaptation of Simon Peyton Jones' and David R.Lester's book, _"Implementing Functional Languages: a tutorial."_,itself a refinement of SPJ's previous work, 1987's _"The Implementationof Functional Programming Languages"_. The newer book doesn't cover asmuch material as the previous: it focuses mostly on the evaluation offunctional programs, and indeed that is our focus today as well. Forthis, it details three abstract machines: The G-machine, the ThreeInstruction Machine (affectionately called Tim), and a parallelG-machine. 
In this post we'll take a look first at a stack-based machine forreducing arithmetic expressions. Armed with the knowledge of how typicalstack machines work, we'll take a look at the G-machine, and how graphreduction works (and where the name comes from in the first place!)
This post is written as [a Literate Haskell source file], with Cppconditionals to enable/disable each section. To compile a specificsection, use GHC like this:
```bashghc -XCPP -DSection=1 2020-01-09.lhs```
-----
\ignore{\begin{code}{-# LANGUAGE CPP #-}#if Section == 1\end{code}}
\begin{code}module StackArith where\end{code}
Section 1: Evaluating Arithmetic with a Stack=============================================
Stack machines are the base for all of the computation models we'regoing to explore today. To get a better feel of how they work, the firstmodel of computation we're going to describe is stack-based arithmetic,better known as reverse polish notation. This machine also forms thebasis of the programming language FORTH. First, let us define a datatype for arithmetic expressions, including the four basic operators(addition, multiplication, subtraction and division.)
\begin{code}data AExpr  = Lit Int  | Add AExpr AExpr  | Sub AExpr AExpr  | Mul AExpr AExpr  | Div AExpr AExpr  deriving (Eq, Show, Ord)\end{code}
This language has an 'obvious' denotation, which can be realised usingan interpreter function, such as `aInterpret` below.
\begin{code}aInterpret :: AExpr -> IntaInterpret (Lit n) = naInterpret (Add e1 e2) = aInterpret e1 + aInterpret e2aInterpret (Sub e1 e2) = aInterpret e1 - aInterpret e2aInterpret (Mul e1 e2) = aInterpret e1 * aInterpret e2aInterpret (Div e1 e2) = aInterpret e1 `div` aInterpret e2\end{code}
Alternatively, we can implement the language through its _operational_behaviour, by compiling it to a series of instructions that, whenexecuted in an appropriate machine, leave it in a _final state_ fromwhich we can extract the expression's result.
Our abstract machine for aritmethic will be a _stack_ based machine withonly a handful of instructions. The type of instructions is`AInstr`{.haskell}.
\begin{code}data AInstr  = Push Int  | IAdd | IMul | ISub | IDiv  deriving (Eq, Show, Ord)\end{code}
The state of the machine is simply a pair, containing an instructionstream and a stack of values. By our compilation scheme, the machine isnever in a state where more values are required on the stack than thereare values present; This would not be the case if we let programmersdirectly write instruction streams.
We can compile a program into a sequence of instructions recursively.
\begin{code}aCompile :: AExpr -> [AInstr]aCompile (Lit i)     = [Push i]aCompile (Add e1 e2) = aCompile e1 ++ aCompile e2 ++ [IAdd]aCompile (Mul e1 e2) = aCompile e1 ++ aCompile e2 ++ [IMul]aCompile (Sub e1 e2) = aCompile e1 ++ aCompile e2 ++ [ISub]aCompile (Div e1 e2) = aCompile e1 ++ aCompile e2 ++ [IDiv]\end{code}
And we can write a function to represent the state transition rules ofthe machine.
\begin{code}aEval :: ([AInstr], [Int]) -> ([AInstr], [Int])aEval (Push i:xs, st)   = (xs, i:st)aEval (IAdd:xs, x:y:st) = (xs, (x + y):st)aEval (IMul:xs, x:y:st) = (xs, (x * y):st)aEval (ISub:xs, x:y:st) = (xs, (x - y):st)aEval (IDiv:xs, x:y:st) = (xs, (x `div` y):st)\end{code}
A state is said to be _final_ when it has an empty instruction streamand a single result on the stack. To run a program, we simply repeat`aEval` until a final state is reached.
\begin{code}aRun :: [AInstr] -> IntaRun is = go (is, []) where  go st | Just i <- final st = i  go st = go (aEval st)
  final ([], [n]) = Just n  final _ = Nothing\end{code}
A very important property linking our compiler, abstract machine andinterpreter together is that of _compiler correctness_. That is:
```haskellforall x. aRun (aCompile x) == aInterpret x```
As an example, the arithmetic expression $2 + 3 \times 4$ produces thefollowing code sequence:
```haskell[Push 2,Push 3,Push 4,IMul,IAdd]```
You can interactively follow the execution of this program with the toolbelow. Pressing the Step button is equivalent to `aEval`. The stack isdrawn in boxes to the left, and the instruction sequence is presented onthe right, where the `>` marks the currently executing instruction (the"program counter", if you will).
<noscript>You seem to have opted out of the interactive visualisations :(</noscript>
<div class="center"><div class="diagram diagram-container"> <div class="diagram-contained">  <div class="diagram" id="forth">  </div>  <button id="step" onclick="step()">Step</button>  <button onclick="reset()">Reset</button> </div> <div id="code" style="min-width: 10em;"> </div></div></div><script src="/static/forth_machine.js" />
\ignore{\begin{code}#elif Section == 2\end{code}}
---
Section 1.75: A Functional Program==================================
In the previous section, we looked at how stack machines can be used toimplement arithmetic. This is nothing exciting, though: FORTH is fromthe late 1960s! In this section, we're going to look at a _much_ moremodern idea, only 30-something years old, which uses stack machines toimplement _functional_ languages via _lazy graph reduction_.
But first, we need to understand what that technobabble means in thefirst place. We define a functional language to be one in which theevaluation of a program expression is the same as evaluating amathematical function: When you're executing a "function application",substitute the actual value of the argument wherever the parameterappears in the body of the function, then reduce any _reducibleexpressions_.
<blockquote><div style="font-size: 15pt;">$$( \lambda{x}. x + 2 )\ 5$$
Evaluation of a functional program starts by identifying a _reducibleexpression_, that is, an expression that isn't "done" evaluating yet. Byconvention, we call reducible expressions redexes for short[^1], andexpressions that are done evaluating are called _head-normal forms_.
Every application is a reducible expression. Here, reduction proceeds bysubstituting $5$ in the place of every mention of $x$. Substituting anexpression $E_2$ in place of the variable $v$, in a bigger expression$E_1$ is notated $E_1[E_2/v]$ (read "$E_1$ with $E_2$ for $v$").
$$(x + 2)[5/x]$$
This step of the evaluation isn't exactly an expression, but it servesto illustrate what reducing a $\lambda$ expression does: replacing thebound variable (or the "formal parameter" in fancy-pants speak. I'llstick to bound variable).
$$(5 + 2)$$
By this step, the function has disappeared entirely. The expression hasbeen replaced entirely with addition between numbers.
Of course, addition, when both sides have been evaluated to a number, is_itself_ a redex. This program isn't done yet.
$$7$$
Replacing the addition by its value, our original program has reachedits end: The number $7$, and indeed any other number, is a head-normalform.</div></blockquote>
This all sounds good when described on paper, but how does one actuallywire up (or, well, program) a computer to reduce functional programs?
Among the first and most comprehensive answers to this question was theG-machine, whose G stands for "Graph". More specifically, the G-machineis an implementation of _graph reduction_: The expression to be reducedis represented as a graph that might have some redexes.
Once the machine has identified some particular redex to reduce, it'llevaluate exactly as much as is needed to reach a head-normal form, and_replace_ (or update) the graph so that the old redex points to itsnormal form.
To explore the workings of the G-machine, we'll need to choose afunctional language. Any will do, but simpler is better. Since I'vealready written a Lazy ML that compiles as described in this post, we'llgo with that.
[Rio]'s core language is a very simple functional language, notable onlyin that _it doesn't have $\lambda$-abstractions_. All functions aredefined at top-level, in the form of supercombinators.
<blockquote>A **supercombinator** is a function that only refers to its arguments orother supercombinators.</blockquote>
There's a data type for terms:
```haskelldata Term  = Let [(Var, Term)] Term   | Letrec [(Var, Term)] Term  | App Term Term  | Ref Var  | Num Integer  deriving Show```
And one for supercombinators:
```haskelldata SC = SC { name :: Var, args :: [Var], body :: Term }  deriving Show```
Consider the reduction of this functional program:
```haskelldouble x = x + xmain = double (double 4)```
Here, `double` and `main` are the supercombinators that constitute theprogram. By convention, execution starts with the supercombinator`main`.
<p class="image"><img class="centered" src="/diagrams/template/step1.svg" /></p>
The initial graph is the trivial graph containing only the node `main`and no edges. Since the node points directly to a supercombinator, wecan replace it by a copy of its body:
<p class="image"><img class="centered" src="/diagrams/template/step2.svg" /></p>
Now starts the actual work. There are many strategies for selecting aredex, and all of them are equally good, with the caveat that some maynot terminate. However, if _any_ evaluation strategy terminates, then sodoes "always choose the outermost redex". This is called normal orderevaluation. It's what the G-machine implements.
The outermost redex here is the outer application of `double`, so that'swhere reduction will happen. To reduce an application, update the redexwith a copy of the supercombinator body, and replace the bound variableswith pointers to the arguments.
<p class="image"><img class="centered" src="/diagrams/template/step3.svg" /></p>
Observe that, since the subexpression `double 4` has two edges leadinginto it, the _tree_ representing the program has degenerated into ageneral graph. However, this isn't a bad thing: it means that the workto evaluate `double 4` will only be needed once.
The application of $+$ isn't reducible yet because it requires itsarguments to be evaluated, so the next reducible expression down thechain is the application node representing `double 4`. The expansionthere is similarly simple.
Here, it's a bit hard to see what's actually going on, so I'll highlightin <span style="color: #0984e3">blue</span> the _whole_ next redex, `4 + 4`.
<div class="image"> <!-- reduction + highlight {{{ --><div class="mathpar">
<div style="flex-direction: column; padding-right: 2em;"><img class="centered two-img" src="/diagrams/template/step4.svg" />
<p style="max-width: 32ch;">The state of the graph after reduction of `double 4`.</p></div>
<div style="flex-direction: column; padding-left: 2em;"><img class="centered two-img" src="/diagrams/template/step4red.svg" />
<p style="max-width: 32ch;">... with the entirety of the next redex highlighted for clarity.</p></div>
</div></div> <!-- }}} -->
But, wait. That redex has _two_ application nodes, but the expression itrepresents is just `4 + 4` (with the `4`s, shared, so more like `let x =4 in x + x`, but still). What gives?
Most formal treatments of functional languages, this included (tothe extent that you can call Rio and a blog post "formal"), use_currying_ to represent functions of multiple arguments. That is,instead of having built-in support for things like
```javascriptlet x = function(x, y) {  /* takes two arguments (arity = 2) */}```
We encode a function of many arguments using nested lambda expressions,as in $\lambda x. \lambda y. x + y$. That's why the application `4 + 4`,or, better stated, `(+) 4 4`, has two application nodes.
With that in mind, the entire blue subgraph can be zapped away to becomethe number 8.
<p class="image"><img class="centered" src="/diagrams/template/step5.svg" /></p>
And finally, the last redex, `8 + 8`, can be zapped entirely into thenumber 16[^2].
---
\begin{code}module Gm where
import qualified Data.Map.Strict as Mapimport Data.Map.Strict (Map, (!))
import qualified Data.Set as Setimport Data.Set (Set)\end{code}
\ignore{\begin{code}import Data.Maybe\end{code}}
Section 2: The G-machine========================
After seeing in detail the reduction of a simple expression, one mightstart to form in their heads an idea of an algorithm to reduce afunctional programming. As SPJ put it:
<blockquote>1. Find the next redex.2. Reduce it.3. Update the root of the redex with its reduct.</blockquote>
With these three easy steps, functional programs be!

Of course, that glosses over three major difficulties:
1. How does one find the next redex?2. How does one reduce it?3. How does one update the graph?
Of these, only the answer to 3 is simple: "Overwrite it with anindirection". (We'll get there). To do the latter efficiently, we'regoing to use an _abstract machine_: The G-machine.
<details><summary>What's an abstract machine?</summary>
An abstract machine isn't, as the similar-sounding name might imply, avirtual machine. Indeed, these concepts are so easily confused that themost popular abstract machine in existence has "virtual machine" in itsname. I'm talking about LLVM, of course.
Abstract machines are simply formalisms used to aid in theimplementation of compilers. Of course, one might write an executionengine for such a machine (a "simulator", one could say), and even usethat as an actual execution model for your language (like OCaml uses theZINC machine).
In this, they are more closely related to intermediate languages thanvirtual machines.</details>
Let's tackle these problems in turn.
How does one find the next redex?---------------------------------
Consider the following expression graph. It has an interesting featurein that it (almost certainly) constitutes a redex. How do we know that?
<p class="image"><img class="centered" src="/diagrams/gm/spine.svg" /></p>
Well, I've used the least subtle blue possible to highlight the _spine_of the expression graph. By starting at the root (the topmost node), andfollowing every left pointer until reaching a supercombinator, one canfind the spine of the graph.
Moreover, if we use a stack to remember the addresses that we visited onour way down, we'll have _unwound_ the spine.
<p class="image"><img class="centered big" src="/diagrams/gm/spine+stack.svg" /></p>
<details><summary>A note on stack addressing</summary>
Following x86 convention, our stack grows _downwards_, so that the firstelement in the diagram above would be the one pointing to `f`.</details>
The third address in the stack is the root of the redex, and the firstaddress points to a supercombinator. If the number of pointers on thestack is greater than or equal to the number of arguments thesupercombinator expects (plus one, to account for the supercombinatornode itself), we've spotted a redex.
How does one reduce it?-----------------------
This depends on the nature of the redex, of course; Reducing asupercombinator is not the same as reducing an arithmetic function, forexample.
**Supercombinator redexes** are easy enough. If the stack has enougharguments, then we can just replace the root of the redex (in ouraddressing model, this coincides with the stack pointer used to fetchthe last argument) with a copy of the body of the supercombinator,replacing their arguments in the correct place.
**Constant applicative forms**, or CAFs, are supercombinators with noarguments. Their reduction is much the same as with a normalsupercombinator, except that when the time comes to update the graph, weneed to update the supercombinator _itself_ with an indirection.
**Primitive redexes**, on the other hand, will require a bit moremachinery. For instance, what should we do in the situation above, wherethe argument to `+` was itself a redex?
<p class="image"><img class="centered" src="/diagrams/template/step3.svg" /></p>
There needs to be a way to evaluate the argument `double 4` to headnormal form then continue reducing the application of `+`. Everyprogramming language has to deal with this, and our solution is more ofthe same: use a stack.
The G-machine already has a stack, though, so we need another one. Astack of stacks, and of return addresses, called the _dump_. When aprimitive operation needs the value of one of its arguments, it firstsaves that argument from the stack, then pushes the stack pointer andprogram counter onto the dump (this is the G-machine's concept of returnaddress); The saved argument is pushed onto an empty stack, and thegraph is unwound starting from that argument.
When unwinding encounters a node in head-normal form, and there's asaved return address on the dump, we pop that, restore the stackpointers, and jump to the saved program counter.
The idea behind the G-machine is that we can teach each supercombinatorto make an instance of its own body by compiling it to a series ofsmall, atomic instructions. This solves the hardest problem inimplementing functional languages, which is the whole "replacing theroot of the redex with a copy of the supercombinator body" I glossedover.
An Example----------
Let's consider the (fragment of a) functional program below.
```haskellf g x = K (g x)```
Compiling it into G-machine instructions results in the followinginstructions:
```haskellPush (Arg 1)Push (Arg 3)MkapPush (Global K)MkapSlide 3Unwind```
These diagrams show how the code for `f` would execute.
<div class="picture-container"> <!-- {{{ --><div class="picture" id="fig1.1"><img class="tikzpicture" src="/diagrams/gm/entry.svg" />
Fig. 1: Diagram of the stack and the heap ("graph") after entering the$f$ supercombinator.</div>
<div class="picture" id="fig1.2"> <img class="tikzpicture" src="/diagrams/gm/push_x.svg" />
Fig. 2: State of the machine after executing `Push (Arg 1)`.</div> 
<div class="picture" id="fig1.3"> <img class="tikzpicture" src="/diagrams/gm/push_g.svg" />
Fig. 3: State of the machine after executing `Push (Arg 3)`.</div> 
<div class="picture" id="fig1.4"> <img class="tikzpicture" src="/diagrams/gm/app_gx.svg" />
Fig. 4: State of the machine after executing `Mkap`.</div> 
<div class="picture" id="fig1.5"> <img class="tikzpicture" src="/diagrams/gm/push_k.svg" />
Fig. 5: State of the machine after executing `Push (Global K)`.</div> 
<div class="picture" id="fig1.6"> <img class="tikzpicture" src="/diagrams/gm/app_kgx.svg" />
Fig. 6: State of the machine after executing `Mkap`.</div> 
<div class="picture" id="fig1.7"> <img class="tikzpicture" src="/diagrams/gm/slide_3.svg" />
Fig. 7: State of the machine after executing `Slide 3`.</div> </div> <!-- }}} -->
When jumping to the code of `f`, the stack would look as it does infigure 1. The expression graph has been unwound, and the stack haspointers to the application nodes that we'll use to fetch the actualarguments.
The first thing we do is take pointers to the arguments `g` and `x` fromtheir application nodes and put them on the stack. This is shown infigures 2 and 3.
Keep in mind that `Arg 0` would refer to the bottom-most stack location,so (on entry to the function) `Arg 1` refers to the first argument.However, when we push onto the stack, the offsets to reach the argumentshift by one, and so what would be `Arg 2` has to become `Arg 3`.
The instruction `Mkap` takes the two newest pointers and makes anapplication node, denoted `@`, from them. The newest value on the stackis taken as the function (the node's left edge) and the value above thatis the argument (the node's right edge).
By figure 4, we're not done yet. `Push (Global K)` has the sole effectof pushing a pointer to the supercombinator `K` onto the stack, as shownin figure 5; After yet another `Mkap`, we've finished building the bodyof `f`.
The G-machine presented above, unlike the one implemented in Rio, is notlazy; The abrupt transition between figures 6 and 7 shows that,instead of updating the graph, we just discard the old stuff that wasthere with a `Slide 3` instruction.
"Slide" is a weird little instruction that doesn't correspond to anystack operation, whose effect, to save the newest value on the stack,discard the `n` values following that, and push the saved value, isbest described by the Haskell function below:
```haskellslide n (x:xs) = x:drop n xs```
Implementing the G-machine--------------------------
First and foremost we'll need a type for our machine's instructions.`GmVal`{.haskell} represents anything that can be pushed onto the stack,and only exists to avoid having four different `Push` instructions.
\begin{code}data GmVal  = Global String  | Value  Int  | Arg    Int  | Local  Int  deriving (Eq, Show, Ord)\end{code}
The addressing mode `Global` is only used for statically-allocatedsupercombinator nodes; `Value` is used for integer constants, andallocates an integer node on the heap[^3]. `Arg` and `Local` push apointer from the stack back onto the stack, the difference being that`Arg` expects the indexed value to point to an application node, andpushes the right pointer of that node.
\begin{code}data GmInst  = Push GmVal  | Slide Int  | Cond [GmInst] [GmInst]  | Mkap  | Eval  | Add | Sub | Mul | Div | Equ  | Unwind  deriving (Eq, Show, Ord)\end{code}
Here's a quick summary of what the instructions do, in order:
1. `Push`{.haskell} adds something to the stack in one of the waysdescribed above;
3. `Slide n`{.haskell} does the "save top item, pop `n` items, push topitem" transformation described above;
4. `Cond code_then code_else`{.haskell} expects the top of the stack tobe a pointer to an integer node. If the value pointed to is `0`, it'llload `code_then` into the program counter; Otherwise, it'll load`code_else`.
5. `Mkap` makes an application node out of the two topmost values on thestack, and pushes that node's address back onto the stack.
6. `Eval` is one of the most complicated instructions. First, it mustsave the topmost element of the stack. In a compiled implementation,this would be in a scratch register, but in this simulator it's saved asa local Haskell variable.
    It then saves the stack pointer and program counter onto the dump,    allocates a fresh stack with only the saved value, and loads    `[Unwind]` as the program.
7. `Add`, `Sub`, `Mul`, `Div`, and `Equ` are all self-explanatory. Theyall expect the two topmost values onto the stack to be numbers in <spanclass="definition" title="Weak head-normal form">WHNF</span>.
8. `Unwind`{.haskell} is the most complicated instruction in themachine. In a compiled implementation, like Rio, the sensible thing todo for `Unwind`{.haskell} would be to emit a jump to a precompiledprocedure.
    The behaviour of unwinding depends on what's currently the top of    the stack.
    * Unwinding an application node pushes the left pointer (the    function pointer) of the application node onto the stack and    continues unwinding.
    * Unwinding a supercombinator node must check that the stack has    enough pointers to satisfy the combinator's arity. Namely, for a    combinator of arity $N$, the stack must have at least $N + 1$    pointers.
    * Unwinding a number with a non-empty dump must pop the stack    pointer and program counter from the top of the dump and continue    executing, with the number pushed on top of the restored stack.
    * Unwinding a number with an empty dump means the machine is done.
For our simulator, we need to define what the state of the machinecomprises, and implement state transitions corresponding to each of theinstructions above.
\begin{code}type Addr = Int
data GmNode  = App Addr Addr  | SCo String Int [GmInst]  | Num Int  deriving (Eq, Show, Ord)
type GmHeap    = Map Addr GmNodetype GmGlobals = Map String Addrtype GmCode    = [GmInst]type GmDump    = [(GmStack, GmCode)]type GmStack   = [Addr]
\end{code}
The state of the machine is the pairing (quintupling?) of heap, globals,code, dump and stack.
<details><summary>Support functions for the heap and the state type</summary>
\begin{code}data GmState =  GmState { heap    :: GmHeap          , globals :: GmGlobals          , stack   :: GmStack          , code    :: GmCode          , dump    :: GmDump          }  deriving (Eq, Show, Ord)
alloc :: GmNode -> GmHeap -> (Addr, GmHeap)alloc node heap =  let (last, _) = Map.findMax heap  in (last + 1, Map.insert (last + 1) node heap)
num :: GmNode -> Intnum (Num i) = inum x = error $ "Not a number: " ++ show x
binop :: (Int -> Int -> Int) -> GmState -> GmStatebinop fun st@GmState{..} =  let a:b:xs = stack      a' = num (heap Map.! a)      b' = num (heap Map.! b)      (addr, heap') = alloc (Num (b' `fun` a')) heap   in st { heap = heap', stack = addr:xs }
reify :: GmState -> GmNodereify GmState{ stack = addr:_, heap } = heap Map.! addr
graphToDOT :: GmState -> StringgraphToDOT GmState{..} = unlines $ "digraph {\n":concatMap go (Map.toList heap)  ++ [ "stack[color=red]; stack ->" ++ nde (head stack) ++ "; }" ] where  go (n, node) =    case node of      Num i -> ([ nde n ++ "[label=" ++ show i ++ "]; " ])      SCo name _ code -> (nde n ++ "[label=" ++ name ++ "]; "):mapMaybe (codeEdge n) code      App n' n'' -> ([ nde n ++ "[label=\"@\"]", nde n ++ " -> " ++ nde n', nde n ++ " -> " ++ nde n'' ])  nde i = 'N':show i
  codeEdge i (Push (Global g')) = Just (nde i ++ " -> " ++ nde (globals Map.! g'))  codeEdge i _ = Nothing
\end{code}</details>
Armed with a definition for the machine state, we can implement the mainfunction `run`, which takes a state to a list of successor states. Ifthe program represented by some state `initial` terminates, then `last(run initial)` is the terminal state, containing the single number whichis the result of the program.
\begin{code}run :: GmState -> [GmState]run state = state:rest where  rest    | final state = []    | otherwise   = run nextState  nextState = step state\end{code}
What does it mean for a state to be final, or terminal? Well, if themachine has no more code to execute, or it's reached WHNF for a valueand has nowhere to return, execution can not proceed. These are thefinal states of our G-machine.
\begin{code}final :: GmState -> Boolfinal GmState{..} = null code || (null dump && whnf) where  whnf =    case stack of      [addr] -> isNum (heap Map.! addr)      _ -> False
  isNum (Num _) = True  isNum _       = False\end{code}
Now we can define the stepper function that takes one step to itssuccessor:
\begin{code}step :: GmState -> GmStatestep state@GmState{ code = [] } = error "step final state"step state@GmState{ code = i:is } =  instruction i state{ code = is }
instruction :: GmInst -> GmState -> GmState\end{code}
The many cases of the `instruction` function represent the varioustransition rules for each instruction we detailed above.
\begin{code}instruction (Push val) st@GmState{..} =  case val of    Global str -> st { stack = globals Map.! str:stack }    Local i    -> st { stack = (stack !! i):stack }    Arg i      -> st { stack = getArg (heap Map.! (stack !! (i + 1))):stack }    Value i ->      let (addr, heap') = alloc (Num i) heap      in st { stack = addr:stack, heap = heap' }  where getArg (App _ x) = x
\end{code}
Remember that in the `Push (Arg _)`{.haskell} case, the offset points usto an application node unwound from the spine, so we have to lookthrough it to find the actual argument.
\begin{code}instruction Mkap st@GmState{..} =  let (addr, heap') = alloc (App f x) heap      x:f:xs = stack   in st { heap = heap', stack = addr:xs }
instruction (Slide n) st@GmState{..} =  let a:as = stack in st { stack = a:drop n as }\end{code}
`Mkap` and `Slide` are very straightforward indeed.
\begin{code}
instruction (Cond t e) st@GmState{..} =  let a:as = stack      Num i = heap Map.! a  in if i == 0 then st { code = t ++ code, stack = as } else st { code = e ++ code, stack = as }\end{code}
For the `Cond` instruction, we mimic the effect of control flow "joiningup" after an `if` statement by _concatenating_ the given code, insteadof replacing it. Since `Unwind` acts almost like a return statement, onecan skip this by adding an `Unwind` in either branch.
\begin{code}instruction Add st = binop (+) stinstruction Sub st = binop (-) stinstruction Mul st = binop (*) stinstruction Div st = binop div st
instruction Equ st@GmState{..} =  let a:b:xs = stack      Num a' = heap Map.! a      Num b' = heap Map.! b      (addr, heap') = alloc (Num equal) heap      equal = if a' == b' then 0 else 1   in st { heap = heap', stack = addr:xs }\end{code}
I included `Equ` here as a representative example for all the binaryoperations; The rest are defined in terms of a `binop` combinator I hidin a `<details>`{.html} tag way back when the state type was defined.
The `Eval` instruction needs to save the stack and the code onto thedump and begin unwinding the top of the stack.
\begin{code}instruction Eval st@GmState{..} =  let a:as = stack   in st { dump = (as, code):dump, code = [Unwind], stack = [a] }\end{code}
`Unwind` is, by far, the most complicated instruction. We start bydispatching on the head of the stack.
\begin{code}instruction Unwind st@GmState{..} =  case heap Map.! head stack of\end{code}
If there's a number, we also have to inspect the dump. If we havesomewhere to return to, we continue there. Otherwise, we're done.
\begin{code}    Num _ -> case dump of      (stack', code'):dump' ->        st { stack = head stack:stack', code = code', dump = dump' }      []                    ->        st { code = [] }\end{code}
Application nodes are more interesting. We put the function part of theapp node onto the stack and keep unwinding.
\begin{code}    App fun _ -> st { stack = fun:stack, code = [Unwind] }\end{code}
Supercombinator nodes do the arity test and load their code onto thestate if there are enough arguments.
\begin{code}    SCo _ arity code | length stack + 1 >= arity ->      st { code = code }    SCo name _ _ -> error $ "Not enough arguments for supercombinator " ++ name\end{code}
Here's the code for a factorial program if you'd like to see. You canprint the (very non-exciting result) using the functions `reify` and`run` like this:
```haskellmain = print . reify . last . run $ factorial10```
<details><summary>G-machine code for $10!$, and `factorial10_dumb`</summary>
**Note**: The code below is _much_ better than what I can realisticallyimplement a compiler for in the space of a blog post. It was hand-tunedto do the least amount of evaluation nescessary. It could, however, beimproved by being made tail-recursive.
**Exercise**: Make the implementation below tail-recursive. That is,compile the following program:
```haskellfac 0 acc = accfac n !acc = fac (n - 1) (acc * n)
main = fac 10 1```
<blockquote>\begin{code}factorial10 :: GmStatefactorial10 =  GmState { code    = [Push (Global "main"), Unwind]          , globals = globals          , stack   = []          , heap    = heap          , dump    = []          }  where    heap = Map.fromList . zip [0..] $      [ SCo "fac" 1          [ Push (Arg 0), Eval, Push (Local 0), Push (Value 0), Equ         , Cond [ Push (Value 1), Slide 3, Unwind ] []         , Push (Global "fac")         , Push (Local 1), Push (Value 1), Sub         , Mkap, Eval         , Push (Local 1), Mul         , Slide 2, Unwind         ]      , SCo "main" 0 [ Push (Global "fac"), Push (Value 10), Mkap, Slide 1, Unwind ]      ]    globals = Map.fromList [ ("fac", 0), ("main", 1) ]\end{code}
What you could expect from Rio is more along the lines of this crimeagainst humanity:
\begin{code}factorial10_dumb :: GmStatefactorial10_dumb =  GmState { code = [Unwind]          , globals = globals          , stack   = [5]          , heap    = heap          , dump    = []          }  where    heap = Map.fromList . zip [0..] $      [ SCo "if" 3   [ Push (Arg 0), Eval, Cond [ Push (Arg 1) ] [ Push (Arg 2) ], Slide 4, Unwind ]      , SCo "mul" 2  [ Push (Arg 0), Eval, Push (Arg 2), Eval, Mul, Slide 3, Unwind ]      , SCo "sub" 2  [ Push (Arg 0), Eval, Push (Arg 2), Eval, Sub, Slide 3, Unwind ]      , SCo "equ" 2  [ Push (Arg 0), Eval, Push (Arg 2), Eval, Equ, Slide 3, Unwind ]      , SCo "fac" 1          [ Push (Global "if"), Push (Global "equ"), Push (Arg 2), Mkap, Push (Value 0), Mkap         , Mkap, Push (Value 1), Mkap, Push (Global "mul"), Push (Arg 2), Mkap, Push (Global "fac")         , Push (Global "sub"), Push (Arg 4), Mkap, Push (Value 1), Mkap, Mkap, Mkap         , Mkap, Slide 2, Unwind ]      , SCo "main" 0 [ Push (Global "fac"), Push (Value 10), Mkap, Slide 1, Unwind ]      ]    globals = Map.fromList [ ("if", 0), ("mul", 1), ("sub", 2), ("equ", 3), ("fac", 4) ]\end{code}</blockquote></details>
The G-machine, with no garbage collector, has a tendency to produce_ridiculously_ large graphs comprising of mostly garbage. For instance,the graph at the end of reducing `factorial10_dumb` has _271_ nodes,only one of which isn't garbage. Ouch!
<p class="image"><img class="centered absolute-unit" src="/static/doom.svg" /></p>
Those two red nodes? That's the result of the program, and the top ofthe stack pointing to it. Yup.
Thankfully, the G-machine makes it easy to write a garbage collector.Well, in theory, at least. The roots can be found on the stack, and allthe stacks saved on the dump. Each live supercombinator can also keepother supercombinators alive by referencing them in `Push (Global _)`instructions.
Since traversing each supercombinator every GC cycle to identify globalreferences is expensive, they can each be augmented with a "staticreference table", or SRT for short. In our simulator, this would be a`Set` of `Addr`s that each supercombinator keeps alive.
\begin{code}liveAddrs :: GmState -> Set AddrliveAddrs GmState{..} = roots <> foldMap explore roots where  roots = Set.fromList stack <> foldMap (Set.fromList . fst) dump  explore i = Set.insert i $    case heap Map.! i of      App x y -> explore x <> explore y      SCo _ _ code -> foldMap globalRefs code      _ -> mempty
  globalRefs (Push (Global i)) = Set.singleton (globals Map.! i)  globalRefs _ = mempty
\end{code}
With the set of live addresses in hand, we can write code to get rid ofall the others, and re-number them all. This is a toy moving garbagecollector, since we allocate an entirely new heap to get rid of the oldone.
\begin{code}
scavenge :: GmState -> GmStatescavenge st@GmState{..} = st { heap = Map.filterWithKey (\k _ -> is_live k) heap } where  live = liveAddrs st  is_live x = x `Set.member` live
\end{code}
Running scavenge on the final state of `factorial10_dumb` gets us a muchbetter looking graph:
<p class="image"><img class="centered" src="/static/not-doom.svg" /></p>
\ignore{\begin{code}#endif\end{code}}

Possible Extensions===================
1. Data structures. This is covered in the book, but I didn't havespace/time to cover it here. The core idea is that the graph gets a newkind of node, `Constr Int [Addr]`, that stores a tag and some fixedamount of addresses. Pattern-matching `case` expressions can then takeapart these `Constr` nodes and branch based on the integer tag.
1. Support I/O. By threading an explicit state variable, a guaranteedorder of effects can be achieved even in lazy code. Let me tell you asecret: This is what GHC does.
    ```haskell    newtype IO a = IO { runIO# :: State# RealWorld -> (# a, State# RealWorld #) }    ```
    The `State# RealWorld#`{.haskell} value is consumed by each foreign    function, i.e. everything that _actually_ does I/O, looking a lot    like a state monad; In reality, the `RealWorld`{.haskell} is made of    lies. `State#`{.haskell} has return kind `TYPE (TupleRep    '[])`{.haskell}, i.e., it takes up no bits at runtime.
    However, by having every foreign function be strict in _some_    variable, no matter how fake it is, we can guarantee the order of    effects: each function depends directly on the function "before" it.
1. Parallelism. Lazy graph reduction lends itself nicely to parallelism.One could envision a machine where a number of worker threads are eachworking on a different redex. To prevent weird parallelism issues fromcropping up, graph nodes would need to be lockable. However, only `@`nodes will ever be locked, so that might lead to an optimisation.
    As an alternative to a regular lock, the implementation could    replace each node under evaluation by a _black hole_, that doesn't    keep alive any more values (thus _possibly_ getting rid of some    space leaks). Each black hole would maintain a queue of threads that    tried to evaluate it, to be woken up once the result is available.

Conclusion==========
This post was long. And it _still_ didn't cover a lot of stuff about theG-machine, such as how to compile _to_ the G-machine (expect a follow-uppost on that) and how to compile _from_ the G-machine (expect afollow-up post on that too!)
Assembling G-machine instructions is actually simpler than it seems.With the exception of `Eval` and `Unwind`, which are common and largeenough to warrant pre-assembled helpers, all G-machine instructionsassemble to no more than a handful of x86 instructions. As an entirelycontextless example, here's how `Cond` instructions are assembled inRio:
```haskellcompileGInst (Cond c_then c_else) = do  pop rbx  cmp (int64 0) (intv_off `quadOff` rbx)  rec    jne else_label    traverse_ compileGInst c_then    jmp exit_label    else_label <- genLabel    traverse_ compileGInst c_else    exit_label <- genLabel  pure ()```
This is one of the most complicated instructions to assemble, since thecompiler has to do the impedance matching between the G-machineabstraction of "instruction lists" and the assembler's labels. Otherinstructions, such as `Pop` (not documented here), have a much clearertranslation:
```haskellcompileGInst (Pop n) = add (int64 (n * 8)) rsp```
Keep in mind that the x86 stack grows downwards, so adding correspondspopping. The only difference between the actual machine here and theG-machine here is that the latter works in terms of addresses and theformer works in terms of bytes.
The code to make an `App` node is similarly simple, using Haskell almostas a macro assembler. The variable `hp` is defined in the code generatorand RTS headers to be `r10`, such that both the C support code and thegenerated assembly can agree on where the heap is.
```haskellcompileGInst Mkap = do  mov (int8 tag_AP) (tag_off `byteOff` hp)  pop (arg_off `quadOff` hp)  pop (fun_off `quadOff` hp)  push hp  hp += int64 valueSize```
Allocating in Rio is as simple as writing the value you want, saving`hp` somewhere, then bumping it by the size of a value. We can do thisbecause the amount a given supercombinator allocates is staticallyknown, so we can do a heap satisfaction check once, at the start of thecombinator, and then just build our graphs free of worry.
<details><summary>A function to count how much a supercombinator allocates iseasy to write using folds.</summary>```haskellentry :: Foldable f => f GmCode -> BlockBuilder ()entry code  | bytes_alloced > 0  = do lea (bytes_alloced `quadOff` hp) r10       cmp hpLim r10       ja (Label "collect_garbage")  | otherwise = pure ()  where    bytes_alloced = foldl' cntBytes 0 code    cntBytes x MkAp = valueSize + x    cntBytes x (Push (Value _)) = valueSize + x    cntBytes x (Alloc n) = n * valueSize + x    cntBytes x (Cond xs ys) = foldl' cntBytes 0 xs + foldl' cntBytes 0 ys + x    cntBytes x _ = x```</details>

To sum up, hopefully without dragging around a huge chain of thunks inmemory, I'd like to thank everyone who made it to the end of thisgrueling, exceedingly niche article. If you liked it, and were perhapsinspired to write a G-machine of your own, please let me know!
[^1]: I'd prefer the plural "redices".
[Rio]: https://github.com/plt-hokusai/rio[^2]: Which I'm not going to draw here because it's going to be rendered at an absurd size.[^3]: A real implementation could use pointer tagging instead.
[a Literate Haskell source file]: /pages/posts/2020-01-31-lazy-eval.lhs
<!-- vim: fdm=marker-->