Download presentation
Presentation is loading. Please wait.
Published byConnor Snell Modified over 9 years ago
1
Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth
2
2.1 Grammars 2.1.1 Backus-Naur Form 2.1.2 Derivations 2.1.3 Parse Trees 2.1.4 Associativity and Precedence 2.1.5 Ambiguous Grammars 2.2 Extended BNF 2.3 Syntax of a Small Language: Clite 2.3.1 Lexical Syntax 2.3.2 Concrete Syntax 2.4 Compilers and Interpreters 2.5 Linking Syntax and Semantics 2.5.1 Abstract Syntax 2.5.2 Abstract Syntax Trees 2.5.3 Abstract Syntax of Clite
3
Expr Expr + Term | Expr – Term | Term Term Term * Factor | Term / Factor | Term % Factor | Factor Factor Primary ** Factor | Primary Primary 0 |... | 9 | ( Expr ) Red indicates a terminal of G 1, blue indicates meta-symbols (symbols that aren’t part of the language but are used to describe the language) Example: 10 * ( 5 – 3)
4
Motivation for using a subset of C: Grammar Language (pages) Reference Pascal5Jensen & Wirth C 6Kernighan & Richie C++22Stroustrup Java14Gosling, et. al. The Clite grammar fits on one page (next 3 slides), so it’s a far better tool for studying language design.
5
Program int main ( ) { Declarations Statements } Declarations { Declaration } Declaration Type Identifier [ [ Integer ] ] {, Identifier [ [ Integer ] ] } ; Type int | bool | float | char Statements { Statement } Statement ; | Block | Assignment | IfStatement | WhileStatement Block { Statements } Assignment Identifier [ [ Expression ] ] = Expression ; IfStatement if ( Expression ) Statement [ else Statement ] WhileStatement while ( Expression ) Statement
6
Expression Conjunction { || Conjunction } Conjunction Equality { && Equality } Equality Relation [ EquOp Relation ] EquOp == | != Relation Addition [ RelOp Addition ] RelOp | >= Addition Term { AddOp Term } AddOp + | - Term Factor { MulOp Factor } MulOp * | / | % Factor [ UnaryOp ] Primary UnaryOp - | ! Primary Identifier [ [ Expression ] ] | Literal | ( Expression ) | Type ( Expression )
7
Identifier Letter { Letter | Digit } Letter a | b |... | z | A | B |... | Z Digit 0 | 1 |... | 9 Literal Integer | Boolean | Float | Char Integer Digit { Digit } Boolean true | false Float Integer. Integer Char ‘ ASCII Char ‘ (ASCII Char is the set of ASCII characters)
8
13 grammar rules – compare to 4 pages, for C++ Metabraces { }(0 or more) is interpreted to mean left associativity; e.g., ◦ Addition → Term { AddOp Term } AddOp → + | - Metabrackets [ ] (optional) means an addition can only be followed by one or no relational operators plus another addition. Relation Addition [ RelOp Addition ] ◦ RelOp | >= ◦ (no a > b > c for example)
9
Comments The significance of whitespace Distinguishing one token <= from two tokens < = Distinguishing identifiers from keywords like if
10
The Clite grammar has two levels ◦ lexical level (described by lexical syntax) ◦ syntactic level(described by concrete syntax) They correspond to two separate parts of a compiler. The issues on the previous slide are lexical issues.
11
Examples of lexical entities (tokens): ◦ Identifiers e.g., numbr1, X ◦ Literalse.g., 123, 'x', 3.25, true ◦ Keywords bool char else false float if int main true while ◦ Operators = || && == != >= + - * / ! % ◦ Punctuation ;, { } ( ) ◦ Chare.g., ‘?’
12
Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens
13
while ( a <= b) legal - spacing between tokens while(a<=b) also legal - spacing not needed while (a < = b) no lexical errors but illegal syntactically – lexer would identify tokens while, (, a, <, =, b, )
14
Clite uses // comment style of C++ Not defined in Clite grammar (but could be) Instead, it’s defined outside the grammar The use of whitespace to differentiate between one and two character operators is also defined outside the grammar.
15
Sequence of letters and digits, starting with a letter “if” is an identifier which also is a keyword Keywords versus reserved words: ◦ Keyword: predefined by the language ◦ Reserved word: can only be used as defined. ◦ In most languages all keywords are also reserved, but in a few; e.g., Pascal, a subset of the keyword identifiers are predefined but not reserved (and can be redefined by the programmer). Implications? Flexibility, confusion …
16
Concrete syntax of a language is the set of rules for writing correct programs The structure of a specific program can be represented by a parse tree, based on the concrete syntax of the language, using the stream of Tokens identified during lexical analysis The root of the parse tree is the Start Symbol of the language (Program, in Clite).
17
Clite’s expression rules are non- ambiguous with respect to precedence and associativity ◦ Rule ordering defines precedence; rule format defines associativity. C/C++ expression grammar definition is ambiguous – precedence and associativity are specified separately.
18
Clite OperatorAssociativity Unary - ! none * / left + - left >= none (i.e., no a < b <= c ) == != none && left || left
19
… are non-associative. (an idea borrowed from Ada) Why is this important? In C & C++, the expression: if (10 < x < 20) is not equivalent to if (10 < x && x < 20) But it is error-free! So, what does it mean?
20
Grammar rules don’t specify the operand types to be used with various operators; e.g., is true + 13 a legal expression? What is the type of the expression 123.78 + 37 ? These are type and semantic issues, not lexical or syntax.
21
Lexical Analyzer (lexer) Syntactic Analyzer (parser) Semantic Analyzer Code Optimizer Code Generator Tokens Abstract Syntax Machine Code Intermediate Code (IC) Source Program Intermediate Code (IC)
22
Input: characters (the program) Output: tokens & token type Lexical grammars are simpler than syntax grammars Often generated automatically by lexical analyzer generating programs
23
Often based on BNF/EBNF grammar Input: tokens Output: abstract syntax tree or some other representation of the program Abstract syntax: similar to a concrete parse tree but with punctuation, many nonterminals discarded
24
Typical tasks: ◦ Check that all identifiers are declared ◦ Perform type checking for expressions, assignments, … ◦ Insert implied conversion operators (i.e., make them explicit) Context free grammars can’t express the semantic rules that are needed for this phase of translation. Output: Intermediate code (IC) tree, modified abstract syntax tree representation.
25
Purpose: Improve the run-time performance of the object code ◦ Usually, to make it run faster ◦ Other possibilities: reduce amount of storage required Drawback: optimization is time- consuming; slows down debugging Output: Intermediate code, similar to abstract syntax notation; closer to machine code
26
Evaluate constant expressions at compile- time In-line expansion (of function calls) Loop unrolling Reorder code to improve cache performance Eliminate common sub-expressions Eliminate unnecessary code Store local variables/intermediate results in registers rather than on the stack or elsewhere in memory
27
Output: machine code Instruction selection, register management “Peephole” optimization: look at a small segment of machine code, make it more efficient; e.g., ◦ x = y; →→ load y, R0 store R0, x z = x * 2; load x, R0 // redundant code mul R0, #2
28
Replaces last 2 phases of a compiler with direct execution Input: ◦ Mixed: generates & uses intermediate code/abstract syntax ◦ Pure: start from stream of ASCII characters each time a statement is executed Mixed interpreters ◦ Java, Perl, Python, Haskell, Scheme Pure interpreters: ◦ most Basics, shell commands
29
Source code: a = x + y; Compiler-generated object code: load r0, x; add r0, y; store r0, a; Will be executed later, with remainder of program Interpreter: Call an interpretive routine to actually perform the operation. e.g., add(x, y, a);
30
It’s not the case that the lexical analyzer identifies all the tokens, and then the parser analyzes all the tokens, and then the type/semantic analysis is performed. Instead, parser repeatedly contacts lexer to get another token As tokens are received they either do or don’t match the expected syntax ◦ If there’s a match, perform any type or semantic testing, possibly generate int. code, call for another token.
31
Output of parser: the concrete parse tree is large – probably more than needed for next phase ◦ The compiler usually produces some more compact representation of the program Example: Fig. 2.9 (page 46)
33
The shape of the parse tree reveals the meaning of the program. So as output of syntax analysis we want a tree that removes its inefficiency and keeps its meaning. ◦ Remove separator/punctuation terminal symbols ◦ Remove all trivial nonterminals ◦ Replace remaining nonterminals with leaf terminals Example: Fig. 2.10
35
Removes unnecessary details but keeps the essential language elements; e.g., consider the following two equivalent loops: PascalC++ while j < n do beginwhile (j < n) { j := j + 1; j = j + 1; end;} Essential information: 1) it is a loop, 2) its terminating condition is j >= n, and 3) its body increments the current value of j.
36
Purpose: an intermediate form of the source code Generated by the parser during syntax analysis Used during type checking/semantic analysis Abstract syntax rules are defined by the compiler (or interpreter). One concrete syntax can have several abstract syntaxes associated with it, depending on the design of the translator.
37
LHS = RHS LHS names an abstract syntax class RHS either (1) gives a list of one or more alternatives or (2) lists the essential elements of the syntax class Compare to production rules in concrete grammar
38
Assignment = Variable target ; Expression source Expression = Variable| Value | Binary | Unary Variable = String id Value = Integer value Binary = Operator op ; Expression term1, term2 Unary = UnaryOp op ; Expression term Operator = +| - | * | / | ! Priority? Associativity? ….
39
Concrete: Assignment Identifier [ [ Expression ] ] = Expression ; Abstract: Assignment = Variable target ; Expression source
40
target source Binary Operator Variable Value + 2 y * x z
41
op term1 term2 Binary node opterm Unary node Binaries and unaries represent information that can be used for later processing
42
Assignment = Variable target ; Expression source Expression = VariableRef | Value | Binary | Unary VariableRef = Variable | ArrayRef Variable = String id ArrayRef = String id ; Expression index Value = IntValue | BoolValue | FloatValue | CharValue Binary = Operator op ; Expression term1, term2 Unary = UnaryOp op ; Expression term Operator = ArithmeticOp | RelationalOp | BooleanOp IntValue = Integer intValue …
43
abstract class Expression { } abstract class VariableRef extends Expression { } class Variable extends VariableRef { String id; } class ArrayRef extends VariableRef { String id; Expression index} class Value extends Expression { … } class Binary extends Expression { Operator op; Expression term1, term2; } class Unary extends Expression { UnaryOp op; Expression term; }
45
Lexical syntax: small, simple, defines language tokens Concrete syntax: detailed, specific, defines correct programs, used to direct parsing algorithms (language specific, not implementation specific) Abstract syntax: simpler than concrete, used to describe the structure of the intermediate code (implementation specific, not language specific) NOT INTENDED TO BE USED FOR PARSING Semantics: program “meaning”, or runtime behavior
46
A syntax is ambiguous if a portion of a program has two or more possible interpretations (parse trees) Non-ambiguous grammars can be written but some ambiguity may be tolerated to reduce grammar size. Operator associativity and precedence can be defined by the concrete syntax Compilers generate code for later execution while interpreters execute program statements as they are analyzed.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.