This note is dedicated to compiling self-compilers.
Say that we agree with David MacQueen that ML is a wonderful domain-specific language to write compilers, in that it provides expressive support to implement a compiler that we find excellent. (Note: ML is really a family of programming languages and OCaml is a member of this family.) We also happen to have a brand new computer, for which somehow there is only an executable compiler from C to assembly language. Pictorially:
Finally, we have an executable byte-code interpreter (i.e., a virtual machine) for this assembly language. Pictorially:
We want to write an ML compiler for our new computer. Naturally, we wish to write it in ML, not in C.
Everything should be built top-down,except the first time.—Alan Perlis‘s programming epigram #11
The rest of this section describes how to bootstrap our ML compiler so that we can run it on our new computer. (Bootstraps are straps at the top of cowboy boots: to put on his boots, a cowboy inserts his feet in his boots, sticks a finger in each strap, and pulls the boots towards him. Metaphorically, if the cowboy pulls vigorously enough, he can lift himself off the ground, a paradox that has been illustrated for time travel in Robert Heinlein‘s short stories By His Bootstraps and —All You Zombies— (now a movie), back in the days.)
We write, in C, a throw-away compiler from ML to assembly language. This compiler is quick and dirty in that it does not compile very well (i.e., it does not generate good compiled code), and it is not efficient (i.e., it does not run fast). It must, however, be correct:
We label this quick-and-dirty compiler with the number 0 (zero).
Using the executable C compiler, we compile the quick-and-dirty compiler into assembly language. The result is a compiler from ML to assembly language written in assembly language:
This compiled compiler is still quick and dirty: it does not work fast and it generates inefficient code. We label it with the number 1 (one).
We write, in ML, a gorgeous compiler from ML to assembly language:
This gorgeous compiler is written to work fast and to generate efficient code. We label it with the number 2 (two).
Using the compiled compiler from Phase 1 and the interpreter for A written in x86, we compile the gorgeous compiler:
The result is a compiler from ML to assembly language written in assembly language. This compiler generates the same efficient code as the gorgeous compiler. It is however not fast because it was produced by a quick-and-dirty compiler that does not generate fast code. We label it with the number 3 (three).
Using the new compiler (i.e., the result of Phase 3), we compile the gorgeous compiler:
The result is a compiler from ML to assembly language written in assembly language. This compiler generates the same efficient code as the gorgeous compiler. It is also fast because it was produced by a good (if slow) compiler. We label it with the number 4 (four).
As a verification, if we use the compiler from Phase 4 to compile the gorgeous compiler, we should obtain textually the same compiled code as in Phase 4.
Too much of a good thing can be wonderful.
Indeed, the compiled compilers (4 and 4’) are obtained from the same gorgeous compiler and using the same algorithm that the gorgeous compiler implements. (We assume this algorithm to be deterministic: given two identical source programs, it produces two identical target programs.)
So all in all, the gorgeous ML compiler has been bootstrapped: given
we have obtained a gorgeous compiler from ML to A written in A.
A word about correctness:
If a compiled program behaves incorrectly, it it because:
Finding this error can take an arbitrarily long time.
Alfrothul: And if there are several errors?Harald (sighing): Then it takes even longer to find them.Alfrothul: How do you know there is an error somewhere?Harald: Well, you have to look for it.Loki: Or you use the strategy of the three code monkeys.Harald: The what of the what?Loki: The strategy of the three code monkeys: you don’t look and you don’t listen.Harald: You don’t say.Loki: Exactly.Alfrothul: Er, guys? There is an exercise now and it is mandatory.Loki: Attaboy.
Suppose we don’t write a quick-and-dirty compiler from ML to assembly language written in C but a quick-and-dirty interpreter for ML written in C. Can we still bootstrap the gorgeous ML compiler? (In other words, given an x86 microprocessor, a compiler from C to A written in x86, an interpreter for A written in x86, a quick-and-dirty interpreter for ML written in C, and a gorgeous compiler from ML to A written in ML, can we obtain a gorgeous compiler from ML to A written in A?)