We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
dom-i-nate: to exert the supreme determining or guiding influence on
Webster's Dictionary
Many dataflow analyses need to find the use-sites of each defined variable or the definition-sites of each variable used in an expression. The def-use chain is a data structure that makes this efficient: for each statement in the flow graph, the compiler can keep a list of pointers to all the use sites of variables defined there, and a list of pointers to all definition sites of the variables used there. In this way the compiler can hop quickly from use to definition to use to definition.
An improvement on the idea of def-use chains is static single-assignment form, or SSA form, an intermediate representation in which each variable has only one definition in the program text. The one (static) definition-site may be in a loop that is executed many (dynamic) times, thus the name static single-assignment form instead of single-assignment form (in which variables are never redefined at all).
The SSA form is useful for several reasons:
Dataflow analysis and optimization algorithms can be made simpler when each variable has only one definition.
If a variable has N uses and M definitions (which occupy about N + M instructions in a program), it takes space (and time) proportional to N · M to represent def-use chains – a quadratic blowup (see Exercise 19.8).
A useful software-engineering principle is information hiding or encapsulation. A module may provide values of a given type, but the representation of that type is known only to the module. Clients of the module may manipulate the values only through operations provided by the module. In this way, the module can assure that the values always meet consistency requirements of its own choosing.
Object-oriented programming languages are designed to support information hiding. Because the “values” may have internal state that the operations will modify, it makes sense to call them objects. Because a typical “module” manipulates only one type of object, we can eliminate the notion of module and (syntactically) treat the operations as fields of the objects, where they are called methods.
Another important characteristic of object-oriented languages is the notion of extension or inheritance. If some program context (such as the formal parameter of a function or method) expects an object that supports methods m1,m2,m3, then it will also accept an object that supports m1,m2,m3,m4.
CLASSES
To illustrate the techniques of compiling object-oriented languages I will use a simple class-based object-oriented language called Object-Tiger.
We extend the Tiger language with new declaration syntax to create classes:
The declaration class B extends A { … } declares a new class B that extends the class A. This declaration must be in the scope of the letexpression that declares A. All the fields and methods of A implicitly belong to B.
trans-late: to turn into one's own or another language
Webster's Dictionary
The semantic analysis phase of a compiler must translate abstract syntax into abstract machine code. It can do this after type-checking, or at the same time.
Though it is possible to translate directly to real machine code, this hinders portability and modularity. Suppose we want compilers for N different source languages, targeted to M different machines. In principle this is N · M compilers (Figure 7.1a), a large implementation task.
An intermediate representation (IR) is a kind of abstract machine language that can express the target-machine operations without committing to too much machine-specific detail. But it is also independent of the details of the source language. The front end of the compiler does lexical analysis, parsing, semantic analysis, and translation to intermediate representation. The back end does optimization of the intermediate representation and translation to machine language.
A portable compiler translates the source language into IR and then translates the IR into machine language, as illustrated in Figure 7.1b. Now only N front ends and M back ends are required. Such an implementation task is more reasonable.
Even when only one front end and one back end are being built, a good IR can modularize the task, so that the front end is not complicated with machine-specific details, and the back end is not bothered with information specific to one source language.
A useful software-engineering principle is information hiding or encapsulation. A module may provide values of a given type, but the representation of that type is known only to the module. Clients of the module may manipulate the values only through operations provided by the module. In this way, the module can assure that the values always meet consistency requirements of its own choosing.
Object-oriented programming languages are designed to support information hiding. Because the “values” may have internal state that the operations will modify, it makes sense to call them objects. Because a typical “module” manipulates only one type of object, we can eliminate the notion of module and (syntactically) treat the operations as fields of the objects, where they are called methods.
Another important characteristic of object-oriented languages is the notion of extension or inheritance. If some program context (such as the formal parameter of a function or method) expects an object that supports methods m1, m2, m3, then it will also accept an object that supports m1, m2, m3, m4.
CLASSES
To illustrate the techniques of compiling object-oriented languages I will use a simple class-based object-oriented language called Object-Tiger.
Over the past decade, there have been several shifts in the way compilers are built. New kinds of programming languages are being used: object-oriented languages with dynamic methods, functional languages with nested scope and first-class function closures; and many of these languages require garbage collection. New machines have large register sets and a high penalty for memory access, and can often run much faster with compiler assistance in scheduling instructions and managing instructions and data for cache locality.
This book is intended as a textbook for a one- or two-semester course in compilers. Students will see the theory behind different components of a compiler, the programming techniques used to put the theory into practice, and the interfaces used to modularize the compiler. To make the interfaces and programming examples clear and concrete, I have written them in the C programming language. Other editions of this book are available that use the Java and ML languages.
Implementation project. The “student project compiler” that I have outlined is reasonably simple, but is organized to demonstrate some important techniques that are now in common use: abstract syntax trees to avoid tangling syntax and semantics, separation of instruction selection from register allocation, copy propagation to give flexibility to earlier phases of the compiler, and containment of target-machine dependencies. Unlike many “student compilers” found in textbooks, this one has a simple but sophisticated back end, allowing good register allocation to be done after instruction selection.
ca-non-i-cal: reduced to the simplest or clearest schema possible
Webster's Dictionary
The trees generated by the semantic analysis phase must be translated into assembly or machine language. The operators of the Tree language are chosen carefully to match the capabilities of most machines. However, there are certain aspects of the tree language that do not correspond exactly with machine languages, and some aspects of the Tree language interfere with compile-time optimization analyses.
For example, it's useful to be able to evaluate the subexpressions of an expression in any order. But the subexpressions of Tree.exp can contain side effects – eseq and call nodes that contain assignment statements and perform input/output. If tree expressions did not contain eseq and call nodes, then the order of evaluation would not matter.
Some of the mismatches between Trees and machine-language programs are:
The cjump instruction can jump to either of two labels, but real machines' conditional jump instructions fall through to the next instruction if the condition is false.
eseq nodes within expressions are inconvenient, because they make different orders of evaluating subtrees yield different results.
call nodes within expressions cause the same problem.
call nodes within the argument-expressions of other call nodes will cause problems when trying to put arguments into a fixed set of formal-parameter registers.
Why does the Tree language allow eseq and two-way cjump, if they are so troublesome?
The front end of the compiler translates programs into an intermediate language with an unbounded number of temporaries. This program must run on a machine with a bounded number of registers. Two temporaries a and b can fit into the same register, if a and b are never “in use” at the same time. Thus, many temporaries can fit in few registers; if they don't all fit, the excess temporaries can be kept in memory.
Therefore, the compiler needs to analyze the intermediate-representation program to determine which temporaries are in use at the same time. We say a variable is live if it holds a value that may be needed in the future, so this analysis is called liveness analysis.
To perform analyses on a program, it is often useful to make a control-flow graph. Each statement in the program is a node in the flow graph; if statement x can be followed by statement y, there is an edge from x to y. Graph 10.1 shows the flow graph for a simple loop.
Let us consider the liveness of each variable (Figure 10.2). A variable is live if its current value will be used in the future, so we analyze liveness by working from the future to the past. Variable b is used in statement 4, so b is live on the 3 → 4 edge.
func-tion: a mathematical correspondence that assigns exactly one element of one set to each element of the same or another set
Webster's Dictionary
The mathematical notion of function is that if f(x) = a “this time,” then f(x) = a “next time”; there is no other value equal to f(x). This allows the use of equational reasoning familiar from algebra: that if a = f(x) then g(f(x), f(x)) is equivalent to g(a, a). Pure functional programming languages encourage a kind of programming in which equational reasoning works, as it does in mathematics.
Imperative programming languages have similar syntax: a ← f(x). But if we follow this by b ← f(x) there is no guarantee that a = b; the function f can have side effects on global variables that make it return a different value each time. Furthermore, a program might assign into variable x between calls to f(x), so f(x) really means a different thing each time.
Higher-order functions. Functional programming languages also allow functions to be passed as arguments to other functions, or returned as results. Functions that take functional arguments are called higher-order functions.
Higher-order functions become particularly interesting if the language also supports nested functions with lexical scope (also called block structure). As in Tiger, lexical scope means that each function can refer to variables and parameters of any function in which it is nested.
sched-ule: a procedural plan that indicates the time and sequence of each operation
Webster's Dictionary
A simple computer can process one instruction at a time. First it fetches the instruction, then decodes it into opcode and operand specifiers, then reads the operands from the register bank (or memory), then performs the arithmetic denoted by the opcode, then writes the result back to the register back (or memory); and then fetches the next instruction.
Modern computers can execute parts of many different instructions at the same time. At the same time the processor is writing results of two instructions back to registers, it may be doing arithmetic for three other instructions, reading operands for two more instructions, decoding four others, and fetching yet another four. Meanwhile, there may be five instructions delayed, awaiting the results of memory-fetches.
Such a processor usually fetches instructions from a single flow of control; it's not that several programs are running in parallel, but the adjacent instructions of a single program are decoded and executed simultaneously. This is called instruction-level parallelism (ILP), and is the basis for much of the astounding advance in processor speed in the last decade of the twentieth century.
A pipelined machine performs the write-back of one instruction in the same cycle as the arithmetic “execute” of the next instruction and the operand-read of the previous one, and so on.
func-tion: a mathematical correspondence that assigns exactly one element of one set to each element of the same or another set
Webster's Dictionary
The mathematical notion of function is that if f(x) = a “this time,” then f(x) = a “next time”; there is no other value equal to f(x). This allows the use of equational reasoning familiar from algebra: that if a = f(x) then g(f(x), f(x)) is equivalent to g(a, a). Pure functional programming languages encourage a kind of programming in which equational reasoning works, as it does in mathematics.
Imperative programming languages have similar syntax: a ← f(x). But if we follow this by b ← f(x) there is no guarantee that a = b; the function f can have side effects on global variables that make it return a different value each time. Furthermore, a program might assign into variable x between calls to f(x), so f(x) really means a different thing each time.
Higher-order functions. Functional programming languages also allow functions to be passed as arguments to other functions, or returned as results. Functions that take functional arguments are called higher-order functions.
Higher-order functions become particularly interesting if the language also supports nested functions with lexical scope (also called block structure). As in Tiger, lexical scope means that each function can refer to variables and parameters of any function in which it is nested. A higher-order functional language is one with nested scope and higher-order functions.