We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
Chapters 2–11 have described the fundamental components of a good compiler: a front end, which does lexical analysis, parsing, construction of abstract syntax, type-checking, and translation to intermediate code; and a back end, which does instruction selection, dataflow analysis, and register allocation.
What lessons have we learned? I hope that the reader has learned about the algorithms used in different components of a compiler and the interfaces used to connect the components. But the author has also learned quite a bit from the exercise.
My goal was to describe a good compiler that is, to use Einstein's phrase, “as simple as possible – but no simpler.” I will now discuss the thorny issues that arose in designing Tiger and its compiler.
Nested functions. Tiger has nested functions, requiring some mechanism (such as static links) for implementing access to nonlocal variables. But many programming languages in widespread use −C, C++, Java – do not have nested functions or static links. The Tiger compiler would become simpler without nested functions, for then variables would not escape, and the FindEscape phase would be unnecessary. But there are two reasons for explaining how to compile nonlocal variables. First, there are programming languages where nested functions are extremely useful – these are the functional languages described in Chapter 15.
Over the past decade, there have been several shifts in the way compilers are built. New kinds of programming languages are being used: object-oriented languages with dynamic methods, functional languages with nested scope and first-class function closures; and many of these languages require garbage collection. New machines have large register sets and a high penalty for memory access, and can often run much faster with compiler assistance in scheduling instructions and managing instructions and data for cache locality.
This book is intended as a textbook for a one- or two-semester course in compilers. Students will see the theory behind different components of a compiler, the programming techniques used to put the theory into practice, and the interfaces used to modularize the compiler. To make the interfaces and programming examples clear and concrete, I have written them in the ML programming language. Other editions of this book are available that use the C and Java languages.
Implementation project. The “student project compiler” that I have outlined is reasonably simple, but is organized to demonstrate some important techniques that are now in common use: abstract syntax trees to avoid tangling syntax and semantics, separation of instruction selection from register allocation, copy propagation to give flexibility to earlier phases of the compiler, and containment of target-machine dependencies. Unlike many “student compilers” found in textbooks, this one has a simple but sophisticated back end, allowing good register allocation to be done after instruction selection.
A compiler was originally a program that “compiled” subroutines [a link-loader]. When in 1954 the combination “algebraic compiler” came into use, or rather into misuse, the meaning of the term had already shifted into the present one.
Bauer and Eickel [1975]
This book describes techniques, data structures, and algorithms for translating programming languages into executable code. A modern compiler is often organized into many phases, each operating on a different abstract “language.” The chapters of this book follow the organization of a compiler, each covering a successive phase.
To illustrate the issues in compiling real programming languages, I show how to compile Tiger, a simple but nontrivial language of the Algol family, with nested scope and heap-allocated records. Programming exercises in each chapter call for the implementation of the corresponding phase; a student who implements all the phases described in Part I of the book will have a working compiler. Tiger is easily modified to be functional or object-oriented (or both), and exercises in Part II show how to do this. Other chapters in Part II cover advanced techniques in program optimization. Appendix A describes the Tiger language.
The interfaces between modules of the compiler are almost as important as the algorithms inside the modules. To describe the interfaces concretely, it is useful to write them down in a real programming language. This book uses the C programming language.
lex-i-cal: of or relating to words or the vocabulary of a language as distinguished from its grammar and construction
Webster's Dictionary
To translate a program from one language into another, a compiler must first pull it apart and understand its structure and meaning, then put it together in a different way. The front end of the compiler performs analysis; the back end does synthesis.
The analysis is usually broken up into
Lexical analysis: breaking the input into individual words or “tokens”;
Syntax analysis: parsing the phrase structure of the program; and
Semantic analysis: calculating the program's meaning.
The lexical analyzer takes a stream of characters and produces a stream of names, keywords, and punctuation marks; it discards white space and comments between the tokens. It would unduly complicate the parser to have to account for possible white space and comments at every possible point; this is the main reason for separating lexical analysis from parsing.
Lexical analysis is not very complicated, but we will attack it with high-powered formalisms and tools, because similar formalisms will be useful in the study of parsing and similar tools have many applications in areas other than compilation.
LEXICAL TOKENS
A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language. A programming language classifies lexical tokens into a finite set of token types.
Heap-allocated records that are not reachable by any chain of pointers from program variables are garbage. The memory occupied by garbage should be reclaimed for use in allocating new records. This process is called garbage collection, and is performed not by the compiler but by the runtime system (the support programs linked with the compiled code).
Ideally, we would say that any record that is not dynamically live (will not be used in the future of the computation) is garbage. But, as Section 10.1 explains, it is not always possible to know whether a variable is live. So we will use a conservative approximation: we will require the compiler to guarantee that any live record is reachable; we will ask the compiler to minimize the number of reachable records that are not live; and we will preserve all reachable records, even if some of them might not be live.
Figure 13.1 shows a Tiger program ready to undergo garbage collection (at the point marked garbage-collect here). There are only three program variables in scope: p, q, and r.
MARK-AND-SWEEP COLLECTION
Program variables and heap-allocated records form a directed graph. The variables are roots of this graph. A node n is reachable if there is a path of directed edges r → … → n starting at some root r. A graph-search algorithm such as depth-first search (Algorithm 13.2) can mark all the reachable nodes.
reg-is-ter: a device for storing small amounts of data
al-lo-cate: to apportion for a specific purpose
Webster's Dictionary
The Translate, Canon, and Codegen phases of the compiler assume that there are an infinite number of registers to hold temporary values and that move instructions cost nothing. The job of the register allocator is to assign the many temporaries to a small number of machine registers, and, where possible, to assign the source and destination of a move to the same register so that the move can be deleted.
From an examination of the control and dataflow graph, we derive an interference graph. Each node in the inteference graph represents a temporary value; each edge (t1, t2) indicates a pair of temporaries that cannot be assigned to the same register. The most common reason for an interference edge is that t1 and t2 are live at the same time. Interference edges can also express other constraints; for example, if a certain instruction a ← b ⊕ c cannot produce results in register r12 on our machine, we can make a interfere with r12.
Next we color the interference graph. We want to use as few colors as possible, but no pair of nodes connected by an edge may be assigned the same color. Graph coloring problems derive from the old mapmakers' rule that adjacent countries on a map should be colored with different colors.
mem-o-ry: a device in which information can be inserted and stored and from which it may be extracted when wanted
hi-er-ar-chy: a graded or ranked series
Webster's Dictionary
An idealized random access memory (RAM) has N words indexed by integers such that any word can be fetched or stored – using its integer address – equally quickly. Hardware designers can make a big slow memory, or a small fast memory, but a big fast memory is prohibitively expensive. Also, one thing that speeds up access to memory is its nearness to the processor, and a big memory must have some parts far from the processor no matter how much money might be thrown at the problem.
Almost as good as a big fast memory is the combination of a small fast cache memory and a big slow main memory; the program keeps its frequently used data in cache and the rarely used data in main memory, and when it enters a phase in which datum x will be frequently used it may move x from the slow memory to the fast memory.
It's inconvenient for the programmer to manage multiple memories, so the hardware does it automatically. Whenever the processor wants the datum at address x, it looks first in the cache, and – we hope – usually finds it there.
lex-i-cal: of or relating to words or the vocabulary of a language as distinguished from its grammar and construction
Webster's Dictionary
To translate a program from one language into another, a compiler must first pull it apart and understand its structure and meaning, then put it together in a different way. The front end of the compiler performs analysis; the back end does synthesis.
The analysis is usually broken up into
Lexical analysis: breaking the input into individual words or “tokens”;
Syntax analysis: parsing the phrase structure of the program; and
Semantic analysis: calculating the program's meaning.
The lexical analyzer takes a stream of characters and produces a stream of names, keywords, and punctuation marks; it discards white space and comments between the tokens. It would unduly complicate the parser to have to account for possible white space and comments at every possible point; this is the main reason for separating lexical analysis from parsing.
Lexical analysis is not very complicated, but we will attack it with high powered formalisms and tools, because similar formalisms will be useful in the study of parsing and similar tools have many applications in areas other than compilation.
LEXICAL TOKENS
A lexical token is a sequence of characters that can be treated as a unit in the grammar of a programming language. A programming language classifies lexical tokens into a finite set of token types.
reg-is-ter: a device for storing small amounts of data
al-lo-cate: to apportion for a specific purpose
Webster's Dictionary
The Translate, Canon, and Codegen phases of the compiler assume that there are an infinite number of registers to hold temporary values and that move instructions cost nothing. The job of the register allocator is to assign the many temporaries to a small number of machine registers, and, where possible, to assign the source and destination of a move to the same register so that the move can be deleted.
From an examination of the control and dataflow graph, we derive an interference graph. Each node in the inteference graph represents a temporary value; each edge (t1, t2) indicates a pair of temporaries that cannot be assigned to the same register. The most common reason for an interference edge is that t1 and t2 are live at the same time. Interference edges can also express other constraints; for example, if a certain instruction a ← b ⊕ c cannot produce results in register r12 on our machine, we can make a interfere with r12.
Next we color the interference graph. We want to use as few colors as possible, but no pair of nodes connected by an edge may be assigned the same color. Graph coloring problems derive from the old mapmakers' rule that adjacent countries on a map should be colored with different colors.
mem-o-ry: a device in which information can be inserted and stored and from which it may be extracted when wanted
hi-er-ar-chy: a graded or ranked series
Webster's Dictionary
An idealized random access memory (RAM) has N words indexed by integers such that any word can be fetched or stored – using its integer address – equally quickly. Hardware designers can make a big slow memory, or a small fast memory, but a big fast memory is prohibitively expensive. Also, one thing that speeds up access to memory is its nearness to the processor, and a big memory must have some parts far from the processor no matter how much money might be thrown at the problem.
Almost as good as a big fast memory is the combination of a small fast cache memory and a big slow main memory; the program keeps its frequently used data in cache and the rarely used data in main memory, and when it enters a phase in which datum x will be frequently used it may move x from the slow memory to the fast memory.
It's inconvenient for the programmer to manage multiple memories, so the hardware does it automatically. Whenever the processor wants the datum at address x, it looks first in the cache, and – we hope – usually finds it there. If there is a cache miss – x is not in the cache – then the processor fetches x from main memory and places a copy of x in the cache so that the next reference to x will be a cache hit.
The front end of the compiler translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never “in use” at the same time. Thus, many temporaries can fit in few registers; if they don't all fit, the excess temporaries can be kept in memory.
Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is live if it holds a value that may be needed in the future, so this analysis is called liveness analysis.
To perform analyses on a program, it is often useful to make a control-flow graph. Each statement in the program is a node in the flow graph; if statement x can be followed by statement y, there is an edge from x to y. Graph 10.1 shows the flow graph for a simple loop.
Let us consider the liveness of each variable (Figure 10.2). A variable is live if its current value will be used in the future, so we analyze liveness by working from the future to the past. Variable b is used in statement 4, so b is live on the 3 → 4 edge.