Compiler | 1 Introduction & 2 A

2018-08-03  本文已影响0人  shawn233

This note is partially copied from Dragon Book - Compilers Principals Techniques and Tools (2nd Edition) Chapter 1 Introduction and Chapter 2 A Simple Syntax-Directed Translator.

Chapter 1 Introduction


1.1 Language Processors

A language-processing system

1.2 The Structure of a Compiler

Phases of a compiler

1.6.3 Static Scope and Block Structure

A declaration D "belongs" to a block B if B is the most closely nested block containing D; that is, D is located within B, but not within any block that is nested within B

The static-scope rule for variable declarations in a block-structured languages is a follows. If declaration D of name x belongs to block B, then the scope of D is all of B, except for any blocks B' nested to any depth within B, in which x is redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x belongs to B'.

1.6.5 Dynamic Scope

The term dynamic scope, refers to the following policy: a use of a name x refers to the declaration of x in the most recently called procedure with such a declaration.

Chapter 2 A Simple Syntax-Directed Translator

Translate Java code into three-address code


2.1 Introduction

The syntax of a programming language describes the proper form of its programs, while the semantics of the language defines what its programs mean, what each program does when it executes.


2.2 Syntax Definition

Using the variable expr to denote an expression and the variable stmt to denote a statement, the structuring rule of an if-else statement can be expressed as

stmt -> if ( expr ) stmt else stmt

in which the arrow may be read as "can have the form". Such a rule is called a prduction. In a production, lexical elements like the keyword if and the parentheses are called terminals. Variables like expr and stmt represent sequences of terminals are called nonterminals.

2.2.1 Definition of Grammars

A context-free grammar has four components:

  1. A set of terminal symbols, sometimes referred to as "tokens". The terminals are the elementary symbols of the language defined by the grammar.

  2. A set of nonterminals, sometimes called "syntactic variables". Each nonterminal represents a set of strings or terminals, in a manner we shall describe.

  3. A set of productions, where each production consists of a nonterminal, called the head or left side of the production, an arrow, and a sequence of terminals and/or nonterminals, called the body or right side of the production. The intuitive intent of a production is to specify one of the written forms of a construct; if the head nonterminal represents a construct, then the body represents a written form of the construct.

  4. A designation (指定) of one of the nonterminals as the start symbol.

We specify grammars by listing their productions, with the productions for the start symbol listed first. We assume that digits, signs such as < and <=, and boldface strings such as while are terminals. An italicized (斜体) name is a nonterminal, and any nonitalicized name or symbol may be assumed to be a terminal. For notational convenience, productions with the same nonterminal as the head can have their bodies grouped, with the alternative bodies separated by the symbol |. which we read as "or".

e.g. lists of digits separated by plus or minus

list -> list + digit | list - digit | digit
digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

We say a production is for a nonterminal if the nonterminal is the head of the production. A string of terminals is a sequence of zero or more terminals. The string of zero terminals, written as ε , is called the empty string.

2.2.2 Derivations

A grammar derives strings by beginning with the start symbol and repeatedly replacing a nonterminal by the body of a production for that nonterminal. The terminal strings that can be derived from the start symbol form the language defined by the grammar.

2.2.3 Parse Trees

Formally, given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:

  1. The root is labeled by the start symbol.

  2. Each leaf is labeled by a terminal or by ε.

  3. Each interior node is labeled by a nonterminal.

  4. If A is the nonterminal labeling some interior node and X1, X2, ..., Xn are the labels of the children of that node from left to right, then there must be a production A -> X1X2···Xn. Here, X1, X2, ..., Xn each stand for a symbol that is either a terminal or a nonterminal. As a special case, if A -> ε is a production, then a node labeled A may have a single child labeled ε.

The process of finding a parse tree for a given string of terminals is called parsing that string.

2.2.5 Associativity of Operators

2.2.6 Precedence of Operators

The associativity rules apply to occurrences of the same operator, so they do not resolve the ambiguity of more than one kind of operator. Rules defining the relative precedence of operators are needed when more than one kind of operator is present.

We say * has higher precedence than + if * takes its operands before + does.


2.3 Syntax-Directed Translation

Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 语法制导翻译时通过向一个文法 (grammar) 的产生式 (production) 附加 (attach) 一些规则 (rule) 或程序片段 (program fragments) 而得到的。

2.3.2 Synthesized Attributes

A syntax-directed definition associates

  1. With each grammar symbol, a set of attributes and
  2. With each production, a set of semantic rules for computing the values of the attributes associated with the symbols appearing in the production.

An attribute is said to be synthesized if its value at a parse-tree node N is determined from attribute values at the children of N and at N itself. Synthesized attributes have the desirable property that they can be evaluated during a single bottom-up traversal of a parse tree.

2.3.5 Translation Schemes

A syntax-directed translation scheme is a notation for specifying a translation by attaching program fragments to productions in a grammar. A translation scheme is like a syntax-directed definition, except that the order of evaluation of the semantic rule is explicitly specified.

Program fragments embedded within production bodies are called semantic actions. The position at which an action is to be executed is shown by enclosing it between curly braces and writing it within the production body.

When drawing a parse tree for a translation scheme, we indicate an action by constructing an extra child for it, connected by a dashed line to the node that corresponds to the head of the production. The node for a semantic action has no children.

E.g. Semantic actions for translating infix expressions to postfix expressions.

The implementation of a translation scheme must ensure that semantic actions are performed in the order they would appear during a postorder traversal of a parse tree.


2.4 Parsing

Parsing is the process of determining how a string of terminals can be generated by a grammar.

2.4.1 Top-Down Parsing

The top-down construction of a parse tree is done by starting with the root, labeled with the starting nonterminal stmt, and repeatedly performing the following two steps.

  1. At node N, labeled with nonterminal A, select one of the productions for A and construct children at N for the symbols in the production body.

  2. Find the next node at which a subtree is to be constructed, typically the leftmost unexpanded noterminal of the tree.

For some grammars, the above steps can be implemented during a single left-to-right scan of the input string. The current terminal being scanned in the input is frequently referred to as the lookahead symbol.

2.4.2 Predictive Parsing

Predictive parsing relies on information about the first symbols that can be generated by a production body. More precisely, let α be a string of grammar symbols (terminals and / or nonterminals). We define FIRST (α) to be the set of terminals that appear as the first symbols of one or more strings of terminals generated from α. If α is ε or can generate ε, then ε is also in FIRST (α).

The FIRST sets must be considered if there are two productions A -> α and A -> β. Ignoring ε-productions for the moment, predictive parsing requires FIRST (α) and FIRST (β) to be disjoint (不相交的). The lookahead symbol can then be used to decide which production to use; if the lookahead symbol is in FIRST (α), then α is used. Otherwise, if the lookahead symbol is in FIRST (β), then β is used.

2.4.3 Designing a Predictive Parser

Recall that a predictive parser is a program consisting of a procedure for every nonterminal. The procedure for nonterminal A does two things.

  1. It decides which A-production to use by examining the lookahead symbol. The production with body α (where α is not ε, the empty string) is used if the lookahead symbol is in FIRST (α). If there is a conflict between two nonempty bodies for any lookahead symbol, then we cannot use this parsing method on this grammar. In addition, the ε-production for A, if it exists, is used if the lookahead symbol is not in the FIRST set for any other production body for A.

  2. The procedure then mimics the body of the chosen production. That is ,the symbols of the body are "executed" in turn, from the left. A nonterminal is "executed" by a call to the procedure for that nonterminal, and a terminal matching the lookahead symbol is "executed" by reading the next input symbol. If at some point the terminal in the body does not match the lookahead symbol, a syntax error is reported.

2.4.5 Left Recursion

It is possible for a recursive-descent parser to loop forever. A problem arises with "left-recursive" productions like

expr -> expr + term

where the leftmost symbol of the body is the same as the nonterminal at the head of the production.

A left-recursive production can be eliminated by rewriting the offending production. Consider a nonterminal A with two productions

A -> Aα | β

where α and β are sequences of terminals and nonterminals that do not start with A. For example, in

expr -> expr + term | term

nonterminal A = expr, string α = + term, and string β = term.

The nonterminal A and its production are said to be left recursive, because the production A -> Aα has A itself as the leftmost symbol on the right side. Repeated application of this production builds up a sequence of α's to the right of A. When A is finally replaced by β, we have a β followed by a sequence of zero or more α's.

The left-recursion-elimination technique | The same effect can be achieved by rewriting the productions for A in the following manner, using a new nonterminal R:

A -> βR
R -> αR | ε

Nonterminal R and its production R -> αR are right recursive because this production for R has R itself as the last symbol on the right side.


2.5.1 Abstract and Concrete Syntax

A useful starting point for designing a translator is a data structure called an abstract syntax tree. In an abstract syntax tree for an expression, each interior node represents an operator; the children of the node represent the operands of the operator. More generally, any programming construct can be handled by making up an operator for the construct and treating as operands the semantically meaningful components of that construct.

Abstract syntax trees, or simply syntax tees, resemble parse trees to an extent. However, in the syntax tree, interior nodes represent programming constructs while in the parse tree, the interior node represent nonterminals. Many nonterminals of a grammar represent programming constructs, but others are "helpers" of one sort or another, such as those representing terms, factors, or other variations of expressions. Om the syntax tree, these helpers typically are not needed and are hence dropped. To emphasize the contrast, a parse tree is sometimes called a concrete syntax tree and the underlying grammar is called a concrete syntax for the language.


2.6 Lexical Analysis

A lexical analyzer reads characters from the input and groups them into "token objects".

A sequence of input characters that comprises (组成) a single token is called a lexeme. Thus, we can say that the lexical analyzer insulates (隔离) a parser from the lexeme representation of tokens.

2.6.1 Removal of White Space and Comments

2.6.2 Reading Ahead

2.6.3 Constants

<num, 31>

2.6.4 Recognizing Keywords and Identifiers

Grammars routinely treat identifiers as terminals to simplify the parser, which can then expect the same terminal, say id, each time any identifier appears in the input.

<id, "count">

2.6.5 A Lexical Analyzer

2.7 Symbol Tables

From Section 1.6.1, the scope of a declaration is the portion of a program to which the declaration applies. We shall implement scopes by setting up a separate symbol table for each scope. A program block with declarations will have its own symbol table with an entry for each declaration in the block.


2.7.1 Symbol Table Per Scope

The most-closely nested rule for blocks is that an identifier x is in the scope of the most-closely nested declaration of x; that is, the declaration of x found by examining blocks inside-out, starting with the block in which x appears.

In Java Implementation of chained symbol table in Fig 2.37 defines a class Env, short for environment.

Class Env supports three operations:

Chaining of symbol tables results in a tree structure, since more than on block can be nested inside an enclosing block.

2.7.2 The Use of Symbol Tables

In effect, the role of a symbol table is to pass information from declarations to uses. A semantic action "puts" information about identifier x into the symbol, when the declaration of x is analyzed. Subsequently, a semantic action associated with a production such as factor -> id "gets" information about the identifier from the symbol table.


2.8.1 Two Kinds of Intermediate Representations

The two most important intermediate representations are:

In addition to creating an intermediate representation, a compiler front end checks that the source program follows the syntactic and semantic rules of the source language. This checking is called static checking; in general "static" means "done by the compiler". Static checking assures that certain kinds of programming errors, including type mismatches, are detected and reported during compilation.

It is common for compilers to emit the three-address code while the parser "goes through the motions" of constructing a syntax tree, without actually constructing the complete tree data structure. Rather, the compiler stores nodes and their attributes needed for semantic checking or other purposes, along with the data structure used for parsing.

2.8.2 Construction of Syntax Trees

We shall first give a translation scheme that constructs syntax trees, and later show how the scheme can be modified to emit three-address code, along with, or instead of, the syntax tree.

All the nonterminals in the translation scheme have an attribute n, which is a node of the syntax tree. Nodes are implemented as objects of class Node.

Class Node has two immediate subclasses: Expr for all kinds of expressions, and Stmt for all kinds of statements. Each type of statement has a corresponding subclass of Stmt; for example, operator while corresponds to subclass While. A syntax-tree node for operator while with children x and y is created by the pseudocode new While (x, y) which creates an object of class While by calling constructor function While.

2.8.3 Static Checking

Static checks are consistency checks that are done during compilation. Static checks includes:

The terms l-value and r-value refer to values that are appropriate on the left and right sides of an assignment, respectively.That is , r-values are what we usually think of as "values", while l-values are locations.

The idea of matching actual with expected types continues to apply (type checking), even in the following situations:

2.8.4 Three-Address Code

Three-address code is a sequence of instructions of the form

x = y op z

where x, y, and z are names, constants, or compiler-generated temporaries; and op stands for an operator.

Arrays will be handled by using the following two variants of instructions:

x [ y ] = z
x = y [ z ] 

The following instructions are for control flow:

A label L can be attached to any instruction by prepending a prefix L:. Any instruction can have more than one label.

Translation of Statements

Statements are translated into three-addrress code by using jump instructions to implement the flow of control through the statement.

Translation of Expressions


2.9 Summary of Chapter 2

The syntax-directed techniques in this chapter can be used to construct compiler front ends, such as those illustrated in Fig 2.46.

上一篇下一篇

猜你喜欢

热点阅读