Compiler | 1 Introduction & 2 A
This note is partially copied from Dragon Book - Compilers Principals Techniques and Tools (2nd Edition) Chapter 1 Introduction and Chapter 2 A Simple Syntax-Directed Translator.
Chapter 1 Introduction
1.1 Language Processors
A language-processing system
1.2 The Structure of a Compiler
Phases of a compiler
1.6.3 Static Scope and Block Structure
A declaration D "belongs" to a block B if B is the most closely nested block containing D; that is, D is located within B, but not within any block that is nested within B
The static-scope rule for variable declarations in a block-structured languages is a follows. If declaration D of name x belongs to block B, then the scope of D is all of B, except for any blocks B' nested to any depth within B, in which x is redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x belongs to B'.
1.6.5 Dynamic Scope
The term dynamic scope, refers to the following policy: a use of a name x refers to the declaration of x in the most recently called procedure with such a declaration.
- Declarations tell us about the types of things, while definitions tell us about their values. Thus,
int i
is a declaration of i, whilei = 1
is a definition of i. In C++, a method is declared in a class definition, by giving the types of arguments and result of the method (often called signature for the method). The method is then defined in another place.
Chapter 2 A Simple Syntax-Directed Translator
Translate Java code into three-address code
2.1 Introduction
The syntax of a programming language describes the proper form of its programs, while the semantics of the language defines what its programs mean, what each program does when it executes.
2.2 Syntax Definition
Using the variable expr to denote an expression and the variable stmt to denote a statement, the structuring rule of an if-else statement can be expressed as
stmt -> if ( expr ) stmt else stmt
in which the arrow may be read as "can have the form". Such a rule is called a prduction. In a production, lexical elements like the keyword if and the parentheses are called terminals. Variables like expr and stmt represent sequences of terminals are called nonterminals.
2.2.1 Definition of Grammars
A context-free grammar has four components:
-
A set of terminal symbols, sometimes referred to as "tokens". The terminals are the elementary symbols of the language defined by the grammar.
-
A set of nonterminals, sometimes called "syntactic variables". Each nonterminal represents a set of strings or terminals, in a manner we shall describe.
-
A set of productions, where each production consists of a nonterminal, called the head or left side of the production, an arrow, and a sequence of terminals and/or nonterminals, called the body or right side of the production. The intuitive intent of a production is to specify one of the written forms of a construct; if the head nonterminal represents a construct, then the body represents a written form of the construct.
-
A designation (指定) of one of the nonterminals as the start symbol.
We specify grammars by listing their productions, with the productions for the start symbol listed first. We assume that digits, signs such as < and <=, and boldface strings such as while are terminals. An italicized (斜体) name is a nonterminal, and any nonitalicized name or symbol may be assumed to be a terminal. For notational convenience, productions with the same nonterminal as the head can have their bodies grouped, with the alternative bodies separated by the symbol |. which we read as "or".
e.g. lists of digits separated by plus or minus
list -> list + digit | list - digit | digit
digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
- terminals: + - 0 1 2 3 4 5 6 7 8 9
- nonterminals: list, digit
- start symbol: list
We say a production is for a nonterminal if the nonterminal is the head of the production. A string of terminals is a sequence of zero or more terminals. The string of zero terminals, written as ε , is called the empty string.
2.2.2 Derivations
A grammar derives strings by beginning with the start symbol and repeatedly replacing a nonterminal by the body of a production for that nonterminal. The terminal strings that can be derived from the start symbol form the language defined by the grammar.
2.2.3 Parse Trees
Formally, given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:
-
The root is labeled by the start symbol.
-
Each leaf is labeled by a terminal or by ε.
-
Each interior node is labeled by a nonterminal.
-
If A is the nonterminal labeling some interior node and X1, X2, ..., Xn are the labels of the children of that node from left to right, then there must be a production A -> X1X2···Xn. Here, X1, X2, ..., Xn each stand for a symbol that is either a terminal or a nonterminal. As a special case, if A -> ε is a production, then a node labeled A may have a single child labeled ε.
The process of finding a parse tree for a given string of terminals is called parsing that string.
2.2.5 Associativity of Operators
- left-associative: a + b + c is equivalent to (a + b) + c. The operand b is taken by the left plus sign, so the plus operator is left-associative.
- right-associative: a = b = c is equivalent to a = (b = c). The operand b is taken by the right assign operator, so assignment is right-associative.
2.2.6 Precedence of Operators
The associativity rules apply to occurrences of the same operator, so they do not resolve the ambiguity of more than one kind of operator. Rules defining the relative precedence of operators are needed when more than one kind of operator is present.
We say *
has higher precedence than +
if *
takes its operands before +
does.
2.3 Syntax-Directed Translation
Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 语法制导翻译时通过向一个文法 (grammar) 的产生式 (production) 附加 (attach) 一些规则 (rule) 或程序片段 (program fragments) 而得到的。
-
Attributes. An attribute is any quantity associated with a programming construct. Examples of attributes are data types of expressions, the number of instructions in the generated code, or the location of the first instruction in the generated code for a construct, among many other posibilities. Since we use grammar symbols (nonterminals and terminals) to represent programming constructs, we extend the notion of attrubutes from constructs to the symbols that represent them.
-
(Syntax-directed) translation schemes. A translation scheme is a notation for attaching program fragments to the productions of a grammar. The program fragments are executed when the production is used during syntax analysis. The combined result of all these fragment executions, in the order induced by the syntax analysis, produces the translation of the program to which this analysis/ synthesis process is applied.
2.3.2 Synthesized Attributes
A syntax-directed definition associates
- With each grammar symbol, a set of attributes and
- With each production, a set of semantic rules for computing the values of the attributes associated with the symbols appearing in the production.
An attribute is said to be synthesized if its value at a parse-tree node N is determined from attribute values at the children of N and at N itself. Synthesized attributes have the desirable property that they can be evaluated during a single bottom-up traversal of a parse tree.
2.3.5 Translation Schemes
A syntax-directed translation scheme is a notation for specifying a translation by attaching program fragments to productions in a grammar. A translation scheme is like a syntax-directed definition, except that the order of evaluation of the semantic rule is explicitly specified.
Program fragments embedded within production bodies are called semantic actions. The position at which an action is to be executed is shown by enclosing it between curly braces and writing it within the production body.
When drawing a parse tree for a translation scheme, we indicate an action by constructing an extra child for it, connected by a dashed line to the node that corresponds to the head of the production. The node for a semantic action has no children.
E.g. Semantic actions for translating infix expressions to postfix expressions.
The implementation of a translation scheme must ensure that semantic actions are performed in the order they would appear during a postorder traversal of a parse tree.
2.4 Parsing
Parsing is the process of determining how a string of terminals can be generated by a grammar.
2.4.1 Top-Down Parsing
The top-down construction of a parse tree is done by starting with the root, labeled with the starting nonterminal stmt, and repeatedly performing the following two steps.
-
At node N, labeled with nonterminal A, select one of the productions for A and construct children at N for the symbols in the production body.
-
Find the next node at which a subtree is to be constructed, typically the leftmost unexpanded noterminal of the tree.
For some grammars, the above steps can be implemented during a single left-to-right scan of the input string. The current terminal being scanned in the input is frequently referred to as the lookahead symbol.
2.4.2 Predictive Parsing
Predictive parsing relies on information about the first symbols that can be generated by a production body. More precisely, let α be a string of grammar symbols (terminals and / or nonterminals). We define FIRST (α)
to be the set of terminals that appear as the first symbols of one or more strings of terminals generated from α. If α is ε or can generate ε, then ε is also in FIRST (α)
.
The FIRST
sets must be considered if there are two productions A -> α and A -> β. Ignoring ε-productions for the moment, predictive parsing requires FIRST (α)
and FIRST (β)
to be disjoint (不相交的). The lookahead symbol can then be used to decide which production to use; if the lookahead symbol is in FIRST (α)
, then α is used. Otherwise, if the lookahead symbol is in FIRST (β)
, then β is used.
2.4.3 Designing a Predictive Parser
Recall that a predictive parser is a program consisting of a procedure for every nonterminal. The procedure for nonterminal A does two things.
-
It decides which A-production to use by examining the lookahead symbol. The production with body α (where α is not ε, the empty string) is used if the lookahead symbol is in
FIRST (α)
. If there is a conflict between two nonempty bodies for any lookahead symbol, then we cannot use this parsing method on this grammar. In addition, the ε-production for A, if it exists, is used if the lookahead symbol is not in theFIRST
set for any other production body for A. -
The procedure then mimics the body of the chosen production. That is ,the symbols of the body are "executed" in turn, from the left. A nonterminal is "executed" by a call to the procedure for that nonterminal, and a terminal matching the lookahead symbol is "executed" by reading the next input symbol. If at some point the terminal in the body does not match the lookahead symbol, a syntax error is reported.
2.4.5 Left Recursion
It is possible for a recursive-descent parser to loop forever. A problem arises with "left-recursive" productions like
expr -> expr + term
where the leftmost symbol of the body is the same as the nonterminal at the head of the production.
A left-recursive production can be eliminated by rewriting the offending production. Consider a nonterminal A with two productions
A -> Aα | β
where α and β are sequences of terminals and nonterminals that do not start with A. For example, in
expr -> expr + term | term
nonterminal A = expr, string α = + term, and string β = term.
The nonterminal A and its production are said to be left recursive, because the production A -> Aα has A itself as the leftmost symbol on the right side. Repeated application of this production builds up a sequence of α's to the right of A. When A is finally replaced by β, we have a β followed by a sequence of zero or more α's.
The left-recursion-elimination technique | The same effect can be achieved by rewriting the productions for A in the following manner, using a new nonterminal R:
A -> βR
R -> αR | ε
Nonterminal R and its production R -> αR are right recursive because this production for R has R itself as the last symbol on the right side.
2.5.1 Abstract and Concrete Syntax
A useful starting point for designing a translator is a data structure called an abstract syntax tree. In an abstract syntax tree for an expression, each interior node represents an operator; the children of the node represent the operands of the operator. More generally, any programming construct can be handled by making up an operator for the construct and treating as operands the semantically meaningful components of that construct.
Abstract syntax trees, or simply syntax tees, resemble parse trees to an extent. However, in the syntax tree, interior nodes represent programming constructs while in the parse tree, the interior node represent nonterminals. Many nonterminals of a grammar represent programming constructs, but others are "helpers" of one sort or another, such as those representing terms, factors, or other variations of expressions. Om the syntax tree, these helpers typically are not needed and are hence dropped. To emphasize the contrast, a parse tree is sometimes called a concrete syntax tree and the underlying grammar is called a concrete syntax for the language.
2.6 Lexical Analysis
A lexical analyzer reads characters from the input and groups them into "token objects".
A sequence of input characters that comprises (组成) a single token is called a lexeme. Thus, we can say that the lexical analyzer insulates (隔离) a parser from the lexeme representation of tokens.
2.6.1 Removal of White Space and Comments
2.6.2 Reading Ahead
2.6.3 Constants
<num, 31>
2.6.4 Recognizing Keywords and Identifiers
Grammars routinely treat identifiers as terminals to simplify the parser, which can then expect the same terminal, say id, each time any identifier appears in the input.
<id, "count">
2.6.5 A Lexical Analyzer
2.7 Symbol Tables
From Section 1.6.1, the scope of a declaration is the portion of a program to which the declaration applies. We shall implement scopes by setting up a separate symbol table for each scope. A program block with declarations will have its own symbol table with an entry for each declaration in the block.
2.7.1 Symbol Table Per Scope
The most-closely nested rule for blocks is that an identifier x is in the scope of the most-closely nested declaration of x; that is, the declaration of x found by examining blocks inside-out, starting with the block in which x appears.
In Java Implementation of chained symbol table in Fig 2.37 defines a class Env
, short for environment.
Class Env
supports three operations:
-
Create a new symbol table. The constructor
Env(p)
on lines 6 through 8 of Fig 2.37 creates anEnv
object with a hash table namedtable
. The object is chained to the environment-valued parameterp
by setting fieldprev
top
. Although it is theEnv
objects that form a chain, it is convenient to talk of the tablesbeing chained. -
Put a new entry in the current table.
-
Get an entry for an identifier by searching the chain of tables, starting with the table for the current block.
Chaining of symbol tables results in a tree structure, since more than on block can be nested inside an enclosing block.
2.7.2 The Use of Symbol Tables
In effect, the role of a symbol table is to pass information from declarations to uses. A semantic action "puts" information about identifier x into the symbol, when the declaration of x is analyzed. Subsequently, a semantic action associated with a production such as factor -> id "gets" information about the identifier from the symbol table.
2.8.1 Two Kinds of Intermediate Representations
The two most important intermediate representations are:
- Trees, including parse trees and (abstract) syntax trees (introduced in Section 2.5.1).
- Linear representations, especially "three-address code".
In addition to creating an intermediate representation, a compiler front end checks that the source program follows the syntactic and semantic rules of the source language. This checking is called static checking; in general "static" means "done by the compiler". Static checking assures that certain kinds of programming errors, including type mismatches, are detected and reported during compilation.
It is common for compilers to emit the three-address code while the parser "goes through the motions" of constructing a syntax tree, without actually constructing the complete tree data structure. Rather, the compiler stores nodes and their attributes needed for semantic checking or other purposes, along with the data structure used for parsing.
2.8.2 Construction of Syntax Trees
We shall first give a translation scheme that constructs syntax trees, and later show how the scheme can be modified to emit three-address code, along with, or instead of, the syntax tree.
All the nonterminals in the translation scheme have an attribute n, which is a node of the syntax tree. Nodes are implemented as objects of class Node.
Class Node has two immediate subclasses: Expr for all kinds of expressions, and Stmt for all kinds of statements. Each type of statement has a corresponding subclass of Stmt; for example, operator while corresponds to subclass While. A syntax-tree node for operator while with children x and y is created by the pseudocode new While (x, y)
which creates an object of class While by calling constructor function While.
2.8.3 Static Checking
Static checks are consistency checks that are done during compilation. Static checks includes:
- Syntactic Checking.
- Type Checking.
The terms l-value and r-value refer to values that are appropriate on the left and right sides of an assignment, respectively.That is , r-values are what we usually think of as "values", while l-values are locations.
The idea of matching actual with expected types continues to apply (type checking), even in the following situations:
-
Coercions ( 强制类型转换 ). A coercion occurs if the type of an operand is automatically converted to the type expected by the operator.
-
Overloading ( 运算符重载 ). A symbol is said to be overloaded if it has different meanings depending on its context.
2.8.4 Three-Address Code
Three-address code is a sequence of instructions of the form
x = y op z
where x
, y
, and z
are names, constants, or compiler-generated temporaries; and op stands for an operator.
Arrays will be handled by using the following two variants of instructions:
x [ y ] = z
x = y [ z ]
The following instructions are for control flow:
-
ifFalse x goto L
. Ifx
is false, next execute the instruction labeled L -
ifTrue x goto L
. If x is true, next execute the instruction labeled L -
goto L
. next execute the instruction labeled L
A label L can be attached to any instruction by prepending a prefix L:
. Any instruction can have more than one label.
Translation of Statements
Statements are translated into three-addrress code by using jump instructions to implement the flow of control through the statement.
Translation of Expressions
2.9 Summary of Chapter 2
The syntax-directed techniques in this chapter can be used to construct compiler front ends, such as those illustrated in Fig 2.46.