my blog lives here now
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1351 lines
42 KiB

6 years ago
2 years ago
6 years ago
2 years ago
6 years ago
2 years ago
6 years ago
  1. ---
  2. title: "The G-machine In Detail, or How Lazy Evaluation Works"
  3. date: January 31, 2020
  4. maths: true
  5. ---
  6. \long\def\ignore#1{}
  7. \ignore{
  8. \begin{code}
  9. {-# LANGUAGE RecordWildCards, NamedFieldPuns, CPP #-}
  10. #if !defined(Section)
  11. #error "You haven't specified a section to load! Re-run with -DSection=1 or -DSection=2"
  12. #endif
  13. #if defined(Section) && (Section != 1 && Section != 2)
  14. #error Section "isn't a valid section to load! Re-run with -DSection=1 or -DSection=2"
  15. #endif
  16. \end{code}
  17. }
  18. <script src="https://cdn.jsdelivr.net/npm/@svgdotjs/[email protected]/dist/svg.min.js"></script>
  19. <style>
  20. .diagram {
  21. background-color: #ddd;
  22. min-height: 10em;
  23. }
  24. .diagram-contained {
  25. height: 100%;
  26. }
  27. .diagram-container {
  28. display: flex;
  29. flex-direction: row;
  30. }
  31. .picture-container {
  32. display: flex;
  33. flex-direction: row;
  34. overflow-x: scroll;
  35. justify-content: space-between;
  36. }
  37. .picture {
  38. display: flex;
  39. flex-direction: column;
  40. width: 80ch;
  41. margin-left: 2em;
  42. margin-right: 2em;
  43. }
  44. .center {
  45. justify-content: center;
  46. width: 80%;
  47. max-width: 50%;
  48. margin: 0 auto;
  49. }
  50. .instruction {
  51. font-family: monospace;
  52. color: #af005f;
  53. }
  54. .operand {
  55. font-family: monospace;
  56. color: #268bd2;
  57. }
  58. img.centered {
  59. width: 10em;
  60. margin: auto;
  61. }
  62. img.big {
  63. width: 80ch;
  64. height: 200px;
  65. margin: auto;
  66. }
  67. img.absolute-unit {
  68. width: 80ch;
  69. height: 500px;
  70. margin: auto;
  71. }
  72. img.two-img {
  73. padding-left: 3em;
  74. padding-right: 2em;
  75. }
  76. p.image {
  77. text-align: center !important;
  78. }
  79. </style>
  80. <noscript>
  81. This post has several interactive components that won't work without
  82. JavaScript. These will be clearly indicated. Regardless, I hope that you
  83. can still appreciate the prose and code.
  84. </noscript>
  85. With Haskell now more popular than ever, a great deal of programmers
  86. deal with lazy evaluation in their daily lives. They're aware of the
  87. pitfalls of lazy I/O, know not to use `foldl`, and are masters at
  88. introducing bang patterns in the right place. But very few programmers
  89. know the magic behind lazy evaluation—graph reduction.
  90. This post is an abridged adaptation of Simon Peyton Jones' and David R.
  91. Lester's book, _"Implementing Functional Languages: a tutorial."_,
  92. itself a refinement of SPJ's previous work, 1987's _"The Implementation
  93. of Functional Programming Languages"_. The newer book doesn't cover as
  94. much material as the previous: it focuses mostly on the evaluation of
  95. functional programs, and indeed that is our focus today as well. For
  96. this, it details three abstract machines: The G-machine, the Three
  97. Instruction Machine (affectionately called Tim), and a parallel
  98. G-machine.
  99. In this post we'll take a look first at a stack-based machine for
  100. reducing arithmetic expressions. Armed with the knowledge of how typical
  101. stack machines work, we'll take a look at the G-machine, and how graph
  102. reduction works (and where the name comes from in the first place!)
  103. This post is written as [a Literate Haskell source file], with Cpp
  104. conditionals to enable/disable each section. To compile a specific
  105. section, use GHC like this:
  106. ```bash
  107. ghc -XCPP -DSection=1 2020-01-09.lhs
  108. ```
  109. -----
  110. \ignore{
  111. \begin{code}
  112. {-# LANGUAGE CPP #-}
  113. #if Section == 1
  114. \end{code}
  115. }
  116. \begin{code}
  117. module StackArith where
  118. \end{code}
  119. Section 1: Evaluating Arithmetic with a Stack
  120. =============================================
  121. Stack machines are the base for all of the computation models we're
  122. going to explore today. To get a better feel of how they work, the first
  123. model of computation we're going to describe is stack-based arithmetic,
  124. better known as reverse polish notation. This machine also forms the
  125. basis of the programming language FORTH. First, let us define a data
  126. type for arithmetic expressions, including the four basic operators
  127. (addition, multiplication, subtraction and division.)
  128. \begin{code}
  129. data AExpr
  130. = Lit Int
  131. | Add AExpr AExpr
  132. | Sub AExpr AExpr
  133. | Mul AExpr AExpr
  134. | Div AExpr AExpr
  135. deriving (Eq, Show, Ord)
  136. \end{code}
  137. This language has an 'obvious' denotation, which can be realised using
  138. an interpreter function, such as `aInterpret` below.
  139. \begin{code}
  140. aInterpret :: AExpr -> Int
  141. aInterpret (Lit n) = n
  142. aInterpret (Add e1 e2) = aInterpret e1 + aInterpret e2
  143. aInterpret (Sub e1 e2) = aInterpret e1 - aInterpret e2
  144. aInterpret (Mul e1 e2) = aInterpret e1 * aInterpret e2
  145. aInterpret (Div e1 e2) = aInterpret e1 `div` aInterpret e2
  146. \end{code}
  147. Alternatively, we can implement the language through its _operational_
  148. behaviour, by compiling it to a series of instructions that, when
  149. executed in an appropriate machine, leave it in a _final state_ from
  150. which we can extract the expression's result.
  151. Our abstract machine for aritmethic will be a _stack_ based machine with
  152. only a handful of instructions. The type of instructions is
  153. `AInstr`{.haskell}.
  154. \begin{code}
  155. data AInstr
  156. = Push Int
  157. | IAdd | IMul | ISub | IDiv
  158. deriving (Eq, Show, Ord)
  159. \end{code}
  160. The state of the machine is simply a pair, containing an instruction
  161. stream and a stack of values. By our compilation scheme, the machine is
  162. never in a state where more values are required on the stack than there
  163. are values present; This would not be the case if we let programmers
  164. directly write instruction streams.
  165. We can compile a program into a sequence of instructions recursively.
  166. \begin{code}
  167. aCompile :: AExpr -> [AInstr]
  168. aCompile (Lit i) = [Push i]
  169. aCompile (Add e1 e2) = aCompile e1 ++ aCompile e2 ++ [IAdd]
  170. aCompile (Mul e1 e2) = aCompile e1 ++ aCompile e2 ++ [IMul]
  171. aCompile (Sub e1 e2) = aCompile e1 ++ aCompile e2 ++ [ISub]
  172. aCompile (Div e1 e2) = aCompile e1 ++ aCompile e2 ++ [IDiv]
  173. \end{code}
  174. And we can write a function to represent the state transition rules of
  175. the machine.
  176. \begin{code}
  177. aEval :: ([AInstr], [Int]) -> ([AInstr], [Int])
  178. aEval (Push i:xs, st) = (xs, i:st)
  179. aEval (IAdd:xs, x:y:st) = (xs, (x + y):st)
  180. aEval (IMul:xs, x:y:st) = (xs, (x * y):st)
  181. aEval (ISub:xs, x:y:st) = (xs, (x - y):st)
  182. aEval (IDiv:xs, x:y:st) = (xs, (x `div` y):st)
  183. \end{code}
  184. A state is said to be _final_ when it has an empty instruction stream
  185. and a single result on the stack. To run a program, we simply repeat
  186. `aEval` until a final state is reached.
  187. \begin{code}
  188. aRun :: [AInstr] -> Int
  189. aRun is = go (is, []) where
  190. go st | Just i <- final st = i
  191. go st = go (aEval st)
  192. final ([], [n]) = Just n
  193. final _ = Nothing
  194. \end{code}
  195. A very important property linking our compiler, abstract machine and
  196. interpreter together is that of _compiler correctness_. That is:
  197. ```haskell
  198. forall x. aRun (aCompile x) == aInterpret x
  199. ```
  200. As an example, the arithmetic expression $2 + 3 \times 4$ produces the
  201. following code sequence:
  202. ```haskell
  203. [Push 2,Push 3,Push 4,IMul,IAdd]
  204. ```
  205. You can interactively follow the execution of this program with the tool
  206. below. Pressing the Step button is equivalent to `aEval`. The stack is
  207. drawn in boxes to the left, and the instruction sequence is presented on
  208. the right, where the `>` marks the currently executing instruction (the
  209. "program counter", if you will).
  210. <noscript>
  211. You seem to have opted out of the interactive visualisations :(
  212. </noscript>
  213. <div class="center">
  214. <div class="diagram diagram-container">
  215. <div class="diagram-contained">
  216. <div class="diagram" id="forth">
  217. </div>
  218. <button id="step" onclick="step()">Step</button>
  219. <button onclick="reset()">Reset</button>
  220. </div>
  221. <div id="code" style="min-width: 10em;">
  222. </div>
  223. </div>
  224. </div>
  225. <script src="/static/forth_machine.js" />
  226. \ignore{
  227. \begin{code}
  228. #elif Section == 2
  229. \end{code}
  230. }
  231. ---
  232. Section 1.75: A Functional Program
  233. ==================================
  234. In the previous section, we looked at how stack machines can be used to
  235. implement arithmetic. This is nothing exciting, though: FORTH is from
  236. the late 1960s! In this section, we're going to look at a _much_ more
  237. modern idea, only 30-something years old, which uses stack machines to
  238. implement _functional_ languages via _lazy graph reduction_.
  239. But first, we need to understand what that technobabble means in the
  240. first place. We define a functional language to be one in which the
  241. evaluation of a program expression is the same as evaluating a
  242. mathematical function: When you're executing a "function application",
  243. substitute the actual value of the argument wherever the parameter
  244. appears in the body of the function, then reduce any _reducible
  245. expressions_.
  246. <blockquote>
  247. <div style="font-size: 15pt;">
  248. $$
  249. ( \lambda{x}. x + 2 )\ 5
  250. $$
  251. Evaluation of a functional program starts by identifying a _reducible
  252. expression_, that is, an expression that isn't "done" evaluating yet. By
  253. convention, we call reducible expressions redexes for short[^1], and
  254. expressions that are done evaluating are called _head-normal forms_.
  255. Every application is a reducible expression. Here, reduction proceeds by
  256. substituting $5$ in the place of every mention of $x$. Substituting an
  257. expression $E_2$ in place of the variable $v$, in a bigger expression
  258. $E_1$ is notated $E_1[E_2/v]$ (read "$E_1$ with $E_2$ for $v$").
  259. $$
  260. (x + 2)[5/x]
  261. $$
  262. This step of the evaluation isn't exactly an expression, but it serves
  263. to illustrate what reducing a $\lambda$ expression does: replacing the
  264. bound variable (or the "formal parameter" in fancy-pants speak. I'll
  265. stick to bound variable).
  266. $$
  267. (5 + 2)
  268. $$
  269. By this step, the function has disappeared entirely. The expression has
  270. been replaced entirely with addition between numbers.
  271. Of course, addition, when both sides have been evaluated to a number, is
  272. _itself_ a redex. This program isn't done yet.
  273. $$
  274. 7
  275. $$
  276. Replacing the addition by its value, our original program has reached
  277. its end: The number $7$, and indeed any other number, is a head-normal
  278. form.
  279. </div>
  280. </blockquote>
  281. This all sounds good when described on paper, but how does one actually
  282. wire up (or, well, program) a computer to reduce functional programs?
  283. Among the first and most comprehensive answers to this question was the
  284. G-machine, whose G stands for "Graph". More specifically, the G-machine
  285. is an implementation of _graph reduction_: The expression to be reduced
  286. is represented as a graph that might have some redexes.
  287. Once the machine has identified some particular redex to reduce, it'll
  288. evaluate exactly as much as is needed to reach a head-normal form, and
  289. _replace_ (or update) the graph so that the old redex points to its
  290. normal form.
  291. To explore the workings of the G-machine, we'll need to choose a
  292. functional language. Any will do, but simpler is better. Since I've
  293. already written a Lazy ML that compiles as described in this post, we'll
  294. go with that.
  295. [Rio]'s core language is a very simple functional language, notable only
  296. in that _it doesn't have $\lambda$-abstractions_. All functions are
  297. defined at top-level, in the form of supercombinators.
  298. <blockquote>
  299. A **supercombinator** is a function that only refers to its arguments or
  300. other supercombinators.
  301. </blockquote>
  302. There's a data type for terms:
  303. ```haskell
  304. data Term
  305. = Let [(Var, Term)] Term
  306. | Letrec [(Var, Term)] Term
  307. | App Term Term
  308. | Ref Var
  309. | Num Integer
  310. deriving Show
  311. ```
  312. And one for supercombinators:
  313. ```haskell
  314. data SC = SC { name :: Var, args :: [Var], body :: Term }
  315. deriving Show
  316. ```
  317. Consider the reduction of this functional program:
  318. ```haskell
  319. double x = x + x
  320. main = double (double 4)
  321. ```
  322. Here, `double` and `main` are the supercombinators that constitute the
  323. program. By convention, execution starts with the supercombinator
  324. `main`.
  325. <p class="image">
  326. <img class="centered" src="/diagrams/template/step1.svg" />
  327. </p>
  328. The initial graph is the trivial graph containing only the node `main`
  329. and no edges. Since the node points directly to a supercombinator, we
  330. can replace it by a copy of its body:
  331. <p class="image">
  332. <img class="centered" src="/diagrams/template/step2.svg" />
  333. </p>
  334. Now starts the actual work. There are many strategies for selecting a
  335. redex, and all of them are equally good, with the caveat that some may
  336. not terminate. However, if _any_ evaluation strategy terminates, then so
  337. does "always choose the outermost redex". This is called normal order
  338. evaluation. It's what the G-machine implements.
  339. The outermost redex here is the outer application of `double`, so that's
  340. where reduction will happen. To reduce an application, update the redex
  341. with a copy of the supercombinator body, and replace the bound variables
  342. with pointers to the arguments.
  343. <p class="image">
  344. <img class="centered" src="/diagrams/template/step3.svg" />
  345. </p>
  346. Observe that, since the subexpression `double 4` has two edges leading
  347. into it, the _tree_ representing the program has degenerated into a
  348. general graph. However, this isn't a bad thing: it means that the work
  349. to evaluate `double 4` will only be needed once.
  350. The application of $+$ isn't reducible yet because it requires its
  351. arguments to be evaluated, so the next reducible expression down the
  352. chain is the application node representing `double 4`. The expansion
  353. there is similarly simple.
  354. Here, it's a bit hard to see what's actually going on, so I'll highlight
  355. in <span style="color: #0984e3">blue</span> the _whole_ next redex, `4 + 4`.
  356. <div class="image"> <!-- reduction + highlight {{{ -->
  357. <div class="mathpar">
  358. <div style="flex-direction: column; padding-right: 2em;">
  359. <img class="centered two-img" src="/diagrams/template/step4.svg" />
  360. <p style="max-width: 32ch;">
  361. The state of the graph after reduction of `double 4`.
  362. </p>
  363. </div>
  364. <div style="flex-direction: column; padding-left: 2em;">
  365. <img class="centered two-img" src="/diagrams/template/step4red.svg" />
  366. <p style="max-width: 32ch;">
  367. ... with the entirety of the next redex highlighted for clarity.
  368. </p>
  369. </div>
  370. </div>
  371. </div> <!-- }}} -->
  372. But, wait. That redex has _two_ application nodes, but the expression it
  373. represents is just `4 + 4` (with the `4`s, shared, so more like `let x =
  374. 4 in x + x`, but still). What gives?
  375. Most formal treatments of functional languages, this included (to
  376. the extent that you can call Rio and a blog post "formal"), use
  377. _currying_ to represent functions of multiple arguments. That is,
  378. instead of having built-in support for things like
  379. ```javascript
  380. let x = function(x, y) {
  381. /* takes two arguments (arity = 2) */
  382. }
  383. ```
  384. We encode a function of many arguments using nested lambda expressions,
  385. as in $\lambda x. \lambda y. x + y$. That's why the application `4 + 4`,
  386. or, better stated, `(+) 4 4`, has two application nodes.
  387. With that in mind, the entire blue subgraph can be zapped away to become
  388. the number 8.
  389. <p class="image">
  390. <img class="centered" src="/diagrams/template/step5.svg" />
  391. </p>
  392. And finally, the last redex, `8 + 8`, can be zapped entirely into the
  393. number 16[^2].
  394. ---
  395. \begin{code}
  396. module Gm where
  397. import qualified Data.Map.Strict as Map
  398. import Data.Map.Strict (Map, (!))
  399. import qualified Data.Set as Set
  400. import Data.Set (Set)
  401. \end{code}
  402. \ignore{
  403. \begin{code}
  404. import Data.Maybe
  405. \end{code}
  406. }
  407. Section 2: The G-machine
  408. ========================
  409. After seeing in detail the reduction of a simple expression, one might
  410. start to form in their heads an idea of an algorithm to reduce a
  411. functional programming. As SPJ put it:
  412. <blockquote>
  413. 1. Find the next redex.
  414. 2. Reduce it.
  415. 3. Update the root of the redex with its reduct.
  416. </blockquote>
  417. With these three easy steps, functional programs be!
  418. Of course, that glosses over three major difficulties:
  419. 1. How does one find the next redex?
  420. 2. How does one reduce it?
  421. 3. How does one update the graph?
  422. Of these, only the answer to 3 is simple: "Overwrite it with an
  423. indirection". (We'll get there). To do the latter efficiently, we're
  424. going to use an _abstract machine_: The G-machine.
  425. <details>
  426. <summary>What's an abstract machine?</summary>
  427. An abstract machine isn't, as the similar-sounding name might imply, a
  428. virtual machine. Indeed, these concepts are so easily confused that the
  429. most popular abstract machine in existence has "virtual machine" in its
  430. name. I'm talking about LLVM, of course.
  431. Abstract machines are simply formalisms used to aid in the
  432. implementation of compilers. Of course, one might write an execution
  433. engine for such a machine (a "simulator", one could say), and even use
  434. that as an actual execution model for your language (like OCaml uses the
  435. ZINC machine).
  436. In this, they are more closely related to intermediate languages than
  437. virtual machines.
  438. </details>
  439. Let's tackle these problems in turn.
  440. How does one find the next redex?
  441. ---------------------------------
  442. Consider the following expression graph. It has an interesting feature
  443. in that it (almost certainly) constitutes a redex. How do we know that?
  444. <p class="image">
  445. <img class="centered" src="/diagrams/gm/spine.svg" />
  446. </p>
  447. Well, I've used the least subtle blue possible to highlight the _spine_
  448. of the expression graph. By starting at the root (the topmost node), and
  449. following every left pointer until reaching a supercombinator, one can
  450. find the spine of the graph.
  451. Moreover, if we use a stack to remember the addresses that we visited on
  452. our way down, we'll have _unwound_ the spine.
  453. <p class="image">
  454. <img class="centered big" src="/diagrams/gm/spine+stack.svg" />
  455. </p>
  456. <details>
  457. <summary>A note on stack addressing</summary>
  458. Following x86 convention, our stack grows _downwards_, so that the first
  459. element in the diagram above would be the one pointing to `f`.
  460. </details>
  461. The third address in the stack is the root of the redex, and the first
  462. address points to a supercombinator. If the number of pointers on the
  463. stack is greater than or equal to the number of arguments the
  464. supercombinator expects (plus one, to account for the supercombinator
  465. node itself), we've spotted a redex.
  466. How does one reduce it?
  467. -----------------------
  468. This depends on the nature of the redex, of course; Reducing a
  469. supercombinator is not the same as reducing an arithmetic function, for
  470. example.
  471. **Supercombinator redexes** are easy enough. If the stack has enough
  472. arguments, then we can just replace the root of the redex (in our
  473. addressing model, this coincides with the stack pointer used to fetch
  474. the last argument) with a copy of the body of the supercombinator,
  475. replacing their arguments in the correct place.
  476. **Constant applicative forms**, or CAFs, are supercombinators with no
  477. arguments. Their reduction is much the same as with a normal
  478. supercombinator, except that when the time comes to update the graph, we
  479. need to update the supercombinator _itself_ with an indirection.
  480. **Primitive redexes**, on the other hand, will require a bit more
  481. machinery. For instance, what should we do in the situation above, where
  482. the argument to `+` was itself a redex?
  483. <p class="image">
  484. <img class="centered" src="/diagrams/template/step3.svg" />
  485. </p>
  486. There needs to be a way to evaluate the argument `double 4` to head
  487. normal form then continue reducing the application of `+`. Every
  488. programming language has to deal with this, and our solution is more of
  489. the same: use a stack.
  490. The G-machine already has a stack, though, so we need another one. A
  491. stack of stacks, and of return addresses, called the _dump_. When a
  492. primitive operation needs the value of one of its arguments, it first
  493. saves that argument from the stack, then pushes the stack pointer and
  494. program counter onto the dump (this is the G-machine's concept of return
  495. address); The saved argument is pushed onto an empty stack, and the
  496. graph is unwound starting from that argument.
  497. When unwinding encounters a node in head-normal form, and there's a
  498. saved return address on the dump, we pop that, restore the stack
  499. pointers, and jump to the saved program counter.
  500. The idea behind the G-machine is that we can teach each supercombinator
  501. to make an instance of its own body by compiling it to a series of
  502. small, atomic instructions. This solves the hardest problem in
  503. implementing functional languages, which is the whole "replacing the
  504. root of the redex with a copy of the supercombinator body" I glossed
  505. over.
  506. An Example
  507. ----------
  508. Let's consider the (fragment of a) functional program below.
  509. ```haskell
  510. f g x = K (g x)
  511. ```
  512. Compiling it into G-machine instructions results in the following
  513. instructions:
  514. ```haskell
  515. Push (Arg 1)
  516. Push (Arg 3)
  517. Mkap
  518. Push (Global K)
  519. Mkap
  520. Slide 3
  521. Unwind
  522. ```
  523. These diagrams show how the code for `f` would execute.
  524. <div class="picture-container"> <!-- {{{ -->
  525. <div class="picture" id="fig1.1">
  526. <img class="tikzpicture" src="/diagrams/gm/entry.svg" />
  527. Fig. 1: Diagram of the stack and the heap ("graph") after entering the
  528. $f$ supercombinator.
  529. </div>
  530. <div class="picture" id="fig1.2">
  531. <img class="tikzpicture" src="/diagrams/gm/push_x.svg" />
  532. Fig. 2: State of the machine after executing `Push (Arg 1)`.
  533. </div>
  534. <div class="picture" id="fig1.3">
  535. <img class="tikzpicture" src="/diagrams/gm/push_g.svg" />
  536. Fig. 3: State of the machine after executing `Push (Arg 3)`.
  537. </div>
  538. <div class="picture" id="fig1.4">
  539. <img class="tikzpicture" src="/diagrams/gm/app_gx.svg" />
  540. Fig. 4: State of the machine after executing `Mkap`.
  541. </div>
  542. <div class="picture" id="fig1.5">
  543. <img class="tikzpicture" src="/diagrams/gm/push_k.svg" />
  544. Fig. 5: State of the machine after executing `Push (Global K)`.
  545. </div>
  546. <div class="picture" id="fig1.6">
  547. <img class="tikzpicture" src="/diagrams/gm/app_kgx.svg" />
  548. Fig. 6: State of the machine after executing `Mkap`.
  549. </div>
  550. <div class="picture" id="fig1.7">
  551. <img class="tikzpicture" src="/diagrams/gm/slide_3.svg" />
  552. Fig. 7: State of the machine after executing `Slide 3`.
  553. </div>
  554. </div> <!-- }}} -->
  555. When jumping to the code of `f`, the stack would look as it does in
  556. figure 1. The expression graph has been unwound, and the stack has
  557. pointers to the application nodes that we'll use to fetch the actual
  558. arguments.
  559. The first thing we do is take pointers to the arguments `g` and `x` from
  560. their application nodes and put them on the stack. This is shown in
  561. figures 2 and 3.
  562. Keep in mind that `Arg 0` would refer to the bottom-most stack location,
  563. so (on entry to the function) `Arg 1` refers to the first argument.
  564. However, when we push onto the stack, the offsets to reach the argument
  565. shift by one, and so what would be `Arg 2` has to become `Arg 3`.
  566. The instruction `Mkap` takes the two newest pointers and makes an
  567. application node, denoted `@`, from them. The newest value on the stack
  568. is taken as the function (the node's left edge) and the value above that
  569. is the argument (the node's right edge).
  570. By figure 4, we're not done yet. `Push (Global K)` has the sole effect
  571. of pushing a pointer to the supercombinator `K` onto the stack, as shown
  572. in figure 5; After yet another `Mkap`, we've finished building the body
  573. of `f`.
  574. The G-machine presented above, unlike the one implemented in Rio, is not
  575. lazy; The abrupt transition between figures 6 and 7 shows that,
  576. instead of updating the graph, we just discard the old stuff that was
  577. there with a `Slide 3` instruction.
  578. "Slide" is a weird little instruction that doesn't correspond to any
  579. stack operation, whose effect, to save the newest value on the stack,
  580. discard the `n` values following that, and push the saved value, is
  581. best described by the Haskell function below:
  582. ```haskell
  583. slide n (x:xs) = x:drop n xs
  584. ```
  585. Implementing the G-machine
  586. --------------------------
  587. First and foremost we'll need a type for our machine's instructions.
  588. `GmVal`{.haskell} represents anything that can be pushed onto the stack,
  589. and only exists to avoid having four different `Push` instructions.
  590. \begin{code}
  591. data GmVal
  592. = Global String
  593. | Value Int
  594. | Arg Int
  595. | Local Int
  596. deriving (Eq, Show, Ord)
  597. \end{code}
  598. The addressing mode `Global` is only used for statically-allocated
  599. supercombinator nodes; `Value` is used for integer constants, and
  600. allocates an integer node on the heap[^3]. `Arg` and `Local` push a
  601. pointer from the stack back onto the stack, the difference being that
  602. `Arg` expects the indexed value to point to an application node, and
  603. pushes the right pointer of that node.
  604. \begin{code}
  605. data GmInst
  606. = Push GmVal
  607. | Slide Int
  608. | Cond [GmInst] [GmInst]
  609. | Mkap
  610. | Eval
  611. | Add | Sub | Mul | Div | Equ
  612. | Unwind
  613. deriving (Eq, Show, Ord)
  614. \end{code}
  615. Here's a quick summary of what the instructions do, in order:
  616. 1. `Push`{.haskell} adds something to the stack in one of the ways
  617. described above;
  618. 3. `Slide n`{.haskell} does the "save top item, pop `n` items, push top
  619. item" transformation described above;
  620. 4. `Cond code_then code_else`{.haskell} expects the top of the stack to
  621. be a pointer to an integer node. If the value pointed to is `0`, it'll
  622. load `code_then` into the program counter; Otherwise, it'll load
  623. `code_else`.
  624. 5. `Mkap` makes an application node out of the two topmost values on the
  625. stack, and pushes that node's address back onto the stack.
  626. 6. `Eval` is one of the most complicated instructions. First, it must
  627. save the topmost element of the stack. In a compiled implementation,
  628. this would be in a scratch register, but in this simulator it's saved as
  629. a local Haskell variable.
  630. It then saves the stack pointer and program counter onto the dump,
  631. allocates a fresh stack with only the saved value, and loads
  632. `[Unwind]` as the program.
  633. 7. `Add`, `Sub`, `Mul`, `Div`, and `Equ` are all self-explanatory. They
  634. all expect the two topmost values onto the stack to be numbers in <span
  635. class="definition" title="Weak head-normal form">WHNF</span>.
  636. 8. `Unwind`{.haskell} is the most complicated instruction in the
  637. machine. In a compiled implementation, like Rio, the sensible thing to
  638. do for `Unwind`{.haskell} would be to emit a jump to a precompiled
  639. procedure.
  640. The behaviour of unwinding depends on what's currently the top of
  641. the stack.
  642. * Unwinding an application node pushes the left pointer (the
  643. function pointer) of the application node onto the stack and
  644. continues unwinding.
  645. * Unwinding a supercombinator node must check that the stack has
  646. enough pointers to satisfy the combinator's arity. Namely, for a
  647. combinator of arity $N$, the stack must have at least $N + 1$
  648. pointers.
  649. * Unwinding a number with a non-empty dump must pop the stack
  650. pointer and program counter from the top of the dump and continue
  651. executing, with the number pushed on top of the restored stack.
  652. * Unwinding a number with an empty dump means the machine is done.
  653. For our simulator, we need to define what the state of the machine
  654. comprises, and implement state transitions corresponding to each of the
  655. instructions above.
  656. \begin{code}
  657. type Addr = Int
  658. data GmNode
  659. = App Addr Addr
  660. | SCo String Int [GmInst]
  661. | Num Int
  662. deriving (Eq, Show, Ord)
  663. type GmHeap = Map Addr GmNode
  664. type GmGlobals = Map String Addr
  665. type GmCode = [GmInst]
  666. type GmDump = [(GmStack, GmCode)]
  667. type GmStack = [Addr]
  668. \end{code}
  669. The state of the machine is the pairing (quintupling?) of heap, globals,
  670. code, dump and stack.
  671. <details>
  672. <summary>Support functions for the heap and the state type</summary>
  673. \begin{code}
  674. data GmState =
  675. GmState { heap :: GmHeap
  676. , globals :: GmGlobals
  677. , stack :: GmStack
  678. , code :: GmCode
  679. , dump :: GmDump
  680. }
  681. deriving (Eq, Show, Ord)
  682. alloc :: GmNode -> GmHeap -> (Addr, GmHeap)
  683. alloc node heap =
  684. let (last, _) = Map.findMax heap
  685. in (last + 1, Map.insert (last + 1) node heap)
  686. num :: GmNode -> Int
  687. num (Num i) = i
  688. num x = error $ "Not a number: " ++ show x
  689. binop :: (Int -> Int -> Int) -> GmState -> GmState
  690. binop fun st@GmState{..} =
  691. let a:b:xs = stack
  692. a' = num (heap Map.! a)
  693. b' = num (heap Map.! b)
  694. (addr, heap') = alloc (Num (b' `fun` a')) heap
  695. in st { heap = heap', stack = addr:xs }
  696. reify :: GmState -> GmNode
  697. reify GmState{ stack = addr:_, heap } = heap Map.! addr
  698. graphToDOT :: GmState -> String
  699. graphToDOT GmState{..} = unlines $ "digraph {\n":concatMap go (Map.toList heap)
  700. ++ [ "stack[color=red]; stack ->" ++ nde (head stack) ++ "; }" ] where
  701. go (n, node) =
  702. case node of
  703. Num i -> ([ nde n ++ "[label=" ++ show i ++ "]; " ])
  704. SCo name _ code -> (nde n ++ "[label=" ++ name ++ "]; "):mapMaybe (codeEdge n) code
  705. App n' n'' -> ([ nde n ++ "[label=\"@\"]", nde n ++ " -> " ++ nde n', nde n ++ " -> " ++ nde n'' ])
  706. nde i = 'N':show i
  707. codeEdge i (Push (Global g')) = Just (nde i ++ " -> " ++ nde (globals Map.! g'))
  708. codeEdge i _ = Nothing
  709. \end{code}
  710. </details>
  711. Armed with a definition for the machine state, we can implement the main
  712. function `run`, which takes a state to a list of successor states. If
  713. the program represented by some state `initial` terminates, then `last
  714. (run initial)` is the terminal state, containing the single number which
  715. is the result of the program.
  716. \begin{code}
  717. run :: GmState -> [GmState]
  718. run state = state:rest where
  719. rest
  720. | final state = []
  721. | otherwise = run nextState
  722. nextState = step state
  723. \end{code}
  724. What does it mean for a state to be final, or terminal? Well, if the
  725. machine has no more code to execute, or it's reached WHNF for a value
  726. and has nowhere to return, execution can not proceed. These are the
  727. final states of our G-machine.
  728. \begin{code}
  729. final :: GmState -> Bool
  730. final GmState{..} = null code || (null dump && whnf) where
  731. whnf =
  732. case stack of
  733. [addr] -> isNum (heap Map.! addr)
  734. _ -> False
  735. isNum (Num _) = True
  736. isNum _ = False
  737. \end{code}
  738. Now we can define the stepper function that takes one step to its
  739. successor:
  740. \begin{code}
  741. step :: GmState -> GmState
  742. step state@GmState{ code = [] } = error "step final state"
  743. step state@GmState{ code = i:is } =
  744. instruction i state{ code = is }
  745. instruction :: GmInst -> GmState -> GmState
  746. \end{code}
  747. The many cases of the `instruction` function represent the various
  748. transition rules for each instruction we detailed above.
  749. \begin{code}
  750. instruction (Push val) st@GmState{..} =
  751. case val of
  752. Global str -> st { stack = globals Map.! str:stack }
  753. Local i -> st { stack = (stack !! i):stack }
  754. Arg i -> st { stack = getArg (heap Map.! (stack !! (i + 1))):stack }
  755. Value i ->
  756. let (addr, heap') = alloc (Num i) heap
  757. in st { stack = addr:stack, heap = heap' }
  758. where getArg (App _ x) = x
  759. \end{code}
  760. Remember that in the `Push (Arg _)`{.haskell} case, the offset points us
  761. to an application node unwound from the spine, so we have to look
  762. through it to find the actual argument.
  763. \begin{code}
  764. instruction Mkap st@GmState{..} =
  765. let (addr, heap') = alloc (App f x) heap
  766. x:f:xs = stack
  767. in st { heap = heap', stack = addr:xs }
  768. instruction (Slide n) st@GmState{..} =
  769. let a:as = stack in st { stack = a:drop n as }
  770. \end{code}
  771. `Mkap` and `Slide` are very straightforward indeed.
  772. \begin{code}
  773. instruction (Cond t e) st@GmState{..} =
  774. let a:as = stack
  775. Num i = heap Map.! a
  776. in if i == 0 then st { code = t ++ code, stack = as } else st { code = e ++ code, stack = as }
  777. \end{code}
  778. For the `Cond` instruction, we mimic the effect of control flow "joining
  779. up" after an `if` statement by _concatenating_ the given code, instead
  780. of replacing it. Since `Unwind` acts almost like a return statement, one
  781. can skip this by adding an `Unwind` in either branch.
  782. \begin{code}
  783. instruction Add st = binop (+) st
  784. instruction Sub st = binop (-) st
  785. instruction Mul st = binop (*) st
  786. instruction Div st = binop div st
  787. instruction Equ st@GmState{..} =
  788. let a:b:xs = stack
  789. Num a' = heap Map.! a
  790. Num b' = heap Map.! b
  791. (addr, heap') = alloc (Num equal) heap
  792. equal = if a' == b' then 0 else 1
  793. in st { heap = heap', stack = addr:xs }
  794. \end{code}
  795. I included `Equ` here as a representative example for all the binary
  796. operations; The rest are defined in terms of a `binop` combinator I hid
  797. in a `<details>`{.html} tag way back when the state type was defined.
  798. The `Eval` instruction needs to save the stack and the code onto the
  799. dump and begin unwinding the top of the stack.
  800. \begin{code}
  801. instruction Eval st@GmState{..} =
  802. let a:as = stack
  803. in st { dump = (as, code):dump, code = [Unwind], stack = [a] }
  804. \end{code}
  805. `Unwind` is, by far, the most complicated instruction. We start by
  806. dispatching on the head of the stack.
  807. \begin{code}
  808. instruction Unwind st@GmState{..} =
  809. case heap Map.! head stack of
  810. \end{code}
  811. If there's a number, we also have to inspect the dump. If we have
  812. somewhere to return to, we continue there. Otherwise, we're done.
  813. \begin{code}
  814. Num _ -> case dump of
  815. (stack', code'):dump' ->
  816. st { stack = head stack:stack', code = code', dump = dump' }
  817. [] ->
  818. st { code = [] }
  819. \end{code}
  820. Application nodes are more interesting. We put the function part of the
  821. app node onto the stack and keep unwinding.
  822. \begin{code}
  823. App fun _ -> st { stack = fun:stack, code = [Unwind] }
  824. \end{code}
  825. Supercombinator nodes do the arity test and load their code onto the
  826. state if there are enough arguments.
  827. \begin{code}
  828. SCo _ arity code | length stack + 1 >= arity ->
  829. st { code = code }
  830. SCo name _ _ -> error $ "Not enough arguments for supercombinator " ++ name
  831. \end{code}
  832. Here's the code for a factorial program if you'd like to see. You can
  833. print the (very non-exciting result) using the functions `reify` and
  834. `run` like this:
  835. ```haskell
  836. main = print . reify . last . run $ factorial10
  837. ```
  838. <details>
  839. <summary>G-machine code for $10!$, and `factorial10_dumb`</summary>
  840. **Note**: The code below is _much_ better than what I can realistically
  841. implement a compiler for in the space of a blog post. It was hand-tuned
  842. to do the least amount of evaluation nescessary. It could, however, be
  843. improved by being made tail-recursive.
  844. **Exercise**: Make the implementation below tail-recursive. That is,
  845. compile the following program:
  846. ```haskell
  847. fac 0 acc = acc
  848. fac n !acc = fac (n - 1) (acc * n)
  849. main = fac 10 1
  850. ```
  851. <blockquote>
  852. \begin{code}
  853. factorial10 :: GmState
  854. factorial10 =
  855. GmState { code = [Push (Global "main"), Unwind]
  856. , globals = globals
  857. , stack = []
  858. , heap = heap
  859. , dump = []
  860. }
  861. where
  862. heap = Map.fromList . zip [0..] $
  863. [ SCo "fac" 1
  864. [ Push (Arg 0), Eval, Push (Local 0), Push (Value 0), Equ
  865. , Cond [ Push (Value 1), Slide 3, Unwind ] []
  866. , Push (Global "fac")
  867. , Push (Local 1), Push (Value 1), Sub
  868. , Mkap, Eval
  869. , Push (Local 1), Mul
  870. , Slide 2, Unwind
  871. ]
  872. , SCo "main" 0 [ Push (Global "fac"), Push (Value 10), Mkap, Slide 1, Unwind ]
  873. ]
  874. globals = Map.fromList [ ("fac", 0), ("main", 1) ]
  875. \end{code}
  876. What you could expect from Rio is more along the lines of this crime
  877. against humanity:
  878. \begin{code}
  879. factorial10_dumb :: GmState
  880. factorial10_dumb =
  881. GmState { code = [Unwind]
  882. , globals = globals
  883. , stack = [5]
  884. , heap = heap
  885. , dump = []
  886. }
  887. where
  888. heap = Map.fromList . zip [0..] $
  889. [ SCo "if" 3 [ Push (Arg 0), Eval, Cond [ Push (Arg 1) ] [ Push (Arg 2) ], Slide 4, Unwind ]
  890. , SCo "mul" 2 [ Push (Arg 0), Eval, Push (Arg 2), Eval, Mul, Slide 3, Unwind ]
  891. , SCo "sub" 2 [ Push (Arg 0), Eval, Push (Arg 2), Eval, Sub, Slide 3, Unwind ]
  892. , SCo "equ" 2 [ Push (Arg 0), Eval, Push (Arg 2), Eval, Equ, Slide 3, Unwind ]
  893. , SCo "fac" 1
  894. [ Push (Global "if"), Push (Global "equ"), Push (Arg 2), Mkap, Push (Value 0), Mkap
  895. , Mkap, Push (Value 1), Mkap, Push (Global "mul"), Push (Arg 2), Mkap, Push (Global "fac")
  896. , Push (Global "sub"), Push (Arg 4), Mkap, Push (Value 1), Mkap, Mkap, Mkap
  897. , Mkap, Slide 2, Unwind ]
  898. , SCo "main" 0 [ Push (Global "fac"), Push (Value 10), Mkap, Slide 1, Unwind ]
  899. ]
  900. globals = Map.fromList [ ("if", 0), ("mul", 1), ("sub", 2), ("equ", 3), ("fac", 4) ]
  901. \end{code}
  902. </blockquote>
  903. </details>
  904. The G-machine, with no garbage collector, has a tendency to produce
  905. _ridiculously_ large graphs comprising of mostly garbage. For instance,
  906. the graph at the end of reducing `factorial10_dumb` has _271_ nodes,
  907. only one of which isn't garbage. Ouch!
  908. <p class="image">
  909. <img class="centered absolute-unit" src="/static/doom.svg" />
  910. </p>
  911. Those two red nodes? That's the result of the program, and the top of
  912. the stack pointing to it. Yup.
  913. Thankfully, the G-machine makes it easy to write a garbage collector.
  914. Well, in theory, at least. The roots can be found on the stack, and all
  915. the stacks saved on the dump. Each live supercombinator can also keep
  916. other supercombinators alive by referencing them in `Push (Global _)`
  917. instructions.
  918. Since traversing each supercombinator every GC cycle to identify global
  919. references is expensive, they can each be augmented with a "static
  920. reference table", or SRT for short. In our simulator, this would be a
  921. `Set` of `Addr`s that each supercombinator keeps alive.
  922. \begin{code}
  923. liveAddrs :: GmState -> Set Addr
  924. liveAddrs GmState{..} = roots <> foldMap explore roots where
  925. roots = Set.fromList stack <> foldMap (Set.fromList . fst) dump
  926. explore i = Set.insert i $
  927. case heap Map.! i of
  928. App x y -> explore x <> explore y
  929. SCo _ _ code -> foldMap globalRefs code
  930. _ -> mempty
  931. globalRefs (Push (Global i)) = Set.singleton (globals Map.! i)
  932. globalRefs _ = mempty
  933. \end{code}
  934. With the set of live addresses in hand, we can write code to get rid of
  935. all the others, and re-number them all. This is a toy moving garbage
  936. collector, since we allocate an entirely new heap to get rid of the old
  937. one.
  938. \begin{code}
  939. scavenge :: GmState -> GmState
  940. scavenge st@GmState{..} = st { heap = Map.filterWithKey (\k _ -> is_live k) heap } where
  941. live = liveAddrs st
  942. is_live x = x `Set.member` live
  943. \end{code}
  944. Running scavenge on the final state of `factorial10_dumb` gets us a much
  945. better looking graph:
  946. <p class="image">
  947. <img class="centered" src="/static/not-doom.svg" />
  948. </p>
  949. \ignore{
  950. \begin{code}
  951. #endif
  952. \end{code}
  953. }
  954. Possible Extensions
  955. ===================
  956. 1. Data structures. This is covered in the book, but I didn't have
  957. space/time to cover it here. The core idea is that the graph gets a new
  958. kind of node, `Constr Int [Addr]`, that stores a tag and some fixed
  959. amount of addresses. Pattern-matching `case` expressions can then take
  960. apart these `Constr` nodes and branch based on the integer tag.
  961. 1. Support I/O. By threading an explicit state variable, a guaranteed
  962. order of effects can be achieved even in lazy code. Let me tell you a
  963. secret: This is what GHC does.
  964. ```haskell
  965. newtype IO a = IO { runIO# :: State# RealWorld -> (# a, State# RealWorld #) }
  966. ```
  967. The `State# RealWorld#`{.haskell} value is consumed by each foreign
  968. function, i.e. everything that _actually_ does I/O, looking a lot
  969. like a state monad; In reality, the `RealWorld`{.haskell} is made of
  970. lies. `State#`{.haskell} has return kind `TYPE (TupleRep
  971. '[])`{.haskell}, i.e., it takes up no bits at runtime.
  972. However, by having every foreign function be strict in _some_
  973. variable, no matter how fake it is, we can guarantee the order of
  974. effects: each function depends directly on the function "before" it.
  975. 1. Parallelism. Lazy graph reduction lends itself nicely to parallelism.
  976. One could envision a machine where a number of worker threads are each
  977. working on a different redex. To prevent weird parallelism issues from
  978. cropping up, graph nodes would need to be lockable. However, only `@`
  979. nodes will ever be locked, so that might lead to an optimisation.
  980. As an alternative to a regular lock, the implementation could
  981. replace each node under evaluation by a _black hole_, that doesn't
  982. keep alive any more values (thus _possibly_ getting rid of some
  983. space leaks). Each black hole would maintain a queue of threads that
  984. tried to evaluate it, to be woken up once the result is available.
  985. Conclusion
  986. ==========
  987. This post was long. And it _still_ didn't cover a lot of stuff about the
  988. G-machine, such as how to compile _to_ the G-machine (expect a follow-up
  989. post on that) and how to compile _from_ the G-machine (expect a
  990. follow-up post on that too!)
  991. Assembling G-machine instructions is actually simpler than it seems.
  992. With the exception of `Eval` and `Unwind`, which are common and large
  993. enough to warrant pre-assembled helpers, all G-machine instructions
  994. assemble to no more than a handful of x86 instructions. As an entirely
  995. contextless example, here's how `Cond` instructions are assembled in
  996. Rio:
  997. ```haskell
  998. compileGInst (Cond c_then c_else) = do
  999. pop rbx
  1000. cmp (int64 0) (intv_off `quadOff` rbx)
  1001. rec
  1002. jne else_label
  1003. traverse_ compileGInst c_then
  1004. jmp exit_label
  1005. else_label <- genLabel
  1006. traverse_ compileGInst c_else
  1007. exit_label <- genLabel
  1008. pure ()
  1009. ```
  1010. This is one of the most complicated instructions to assemble, since the
  1011. compiler has to do the impedance matching between the G-machine
  1012. abstraction of "instruction lists" and the assembler's labels. Other
  1013. instructions, such as `Pop` (not documented here), have a much clearer
  1014. translation:
  1015. ```haskell
  1016. compileGInst (Pop n) = add (int64 (n * 8)) rsp
  1017. ```
  1018. Keep in mind that the x86 stack grows downwards, so adding corresponds
  1019. popping. The only difference between the actual machine here and the
  1020. G-machine here is that the latter works in terms of addresses and the
  1021. former works in terms of bytes.
  1022. The code to make an `App` node is similarly simple, using Haskell almost
  1023. as a macro assembler. The variable `hp` is defined in the code generator
  1024. and RTS headers to be `r10`, such that both the C support code and the
  1025. generated assembly can agree on where the heap is.
  1026. ```haskell
  1027. compileGInst Mkap = do
  1028. mov (int8 tag_AP) (tag_off `byteOff` hp)
  1029. pop (arg_off `quadOff` hp)
  1030. pop (fun_off `quadOff` hp)
  1031. push hp
  1032. hp += int64 valueSize
  1033. ```
  1034. Allocating in Rio is as simple as writing the value you want, saving
  1035. `hp` somewhere, then bumping it by the size of a value. We can do this
  1036. because the amount a given supercombinator allocates is statically
  1037. known, so we can do a heap satisfaction check once, at the start of the
  1038. combinator, and then just build our graphs free of worry.
  1039. <details>
  1040. <summary>A function to count how much a supercombinator allocates is
  1041. easy to write using folds.</summary>
  1042. ```haskell
  1043. entry :: Foldable f => f GmCode -> BlockBuilder ()
  1044. entry code
  1045. | bytes_alloced > 0
  1046. = do lea (bytes_alloced `quadOff` hp) r10
  1047. cmp hpLim r10
  1048. ja (Label "collect_garbage")
  1049. | otherwise = pure ()
  1050. where
  1051. bytes_alloced = foldl' cntBytes 0 code
  1052. cntBytes x MkAp = valueSize + x
  1053. cntBytes x (Push (Value _)) = valueSize + x
  1054. cntBytes x (Alloc n) = n * valueSize + x
  1055. cntBytes x (Cond xs ys) = foldl' cntBytes 0 xs + foldl' cntBytes 0 ys + x
  1056. cntBytes x _ = x
  1057. ```
  1058. </details>
  1059. To sum up, hopefully without dragging around a huge chain of thunks in
  1060. memory, I'd like to thank everyone who made it to the end of this
  1061. grueling, exceedingly niche article. If you liked it, and were perhaps
  1062. inspired to write a G-machine of your own, please let me know!
  1063. [^1]: I'd prefer the plural "redices".
  1064. [Rio]: https://github.com/plt-hokusai/rio
  1065. [^2]: Which I'm not going to draw here because it's going to be rendered at an absurd size.
  1066. [^3]: A real implementation could use pointer tagging instead.
  1067. [a Literate Haskell source file]: /pages/posts/2020-01-31-lazy-eval.lhs
  1068. <!-- vim: fdm=marker
  1069. -->