Download presentation
Presentation is loading. Please wait.
Published bySpencer Potter Modified over 9 years ago
1
1 Introduction to Parsing
2
2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations
3
3 Languages and Automata l Formal languages are very important in CS »Especially in programming languages l Regular languages »The weakest formal languages widely used »Many applications l We will also study context-free languages
4
4 Limitations of Regular Languages l Intuition: A finite automaton that runs long enough must repeat states Finite automaton can ’ t remember # of times it has visited a particular state l Finite automaton has finite memory »Only enough to store in which state it is »Cannot count, except up to a finite limit E.g., language of balanced parentheses is not regular: { ( i ) i | i ¸ 0}
5
5 The Functionality of the Parser l Input: sequence of tokens from lexer l Output: parse tree of the program
6
6 Example l Java expr x == y ? 1 : 2 l Parser input ID == ID ? INT : INT l Parser output ID ?: == INT
7
7 Comparison with Lexical Analysis PhaseInputOutput LexerSequence of characters Sequence of tokens ParserSequence of tokens Parse tree
8
8 The Role of the Parser l Not all sequences of tokens are programs... l... Parser must distinguish between valid and invalid sequences of tokens l We need »A language for describing valid sequences of tokens »A method for distinguishing valid from invalid sequences of tokens
9
9 Context-Free Grammars l Programming language constructs have recursive structure l An EXPR is if EXPR then EXPR else EXPR fi, or while EXPR loop EXPR pool, or … l Context-free grammars are a natural notation for this recursive structure
10
10 CFGs (Cont.) l A CFG consists of »A set of terminals T »A set of non-terminals N »A start symbol S (a non-terminal) »A set of productions Assuming X 2 N X ! , or X ! Y 1 Y 2... Y n where Y i µ N [ T
11
11 Notational Conventions l In these lecture notes »Non-terminals are written upper-case »Terminals are written lower-case »The start symbol is the left-hand side of the first production
12
12 Examples of CFGs Expr if Expr then Expr else Expr | while Expr do Expr | id
13
13 Examples of CFGs Simple arithmetic expressions:
14
14 The Language of a CFG Read productions as replacement rules: X ! Y 1... Y n Means X can be replaced by Y 1... Y n X ! Means X can be erased (replaced with empty string)
15
15 Key Idea 1. Begin with a string consisting of the start symbol “ S ” 2. Replace any non-terminal X in the string by a right-hand side of some production 3. Repeat (2) until there are no non-terminals in the string
16
16 The Language of a CFG (Cont.) More formally, write if there is a production
17
17 The Language of a CFG (Cont.) Write if in 0 or more steps
18
18 The Language of a CFG Let G be a context-free grammar with start symbol S. Then the language of G is:
19
19 Terminals l Terminals are called because there are no rules for replacing them l Once generated, terminals are permanent l Terminals ought to be tokens of the language
20
20 Examples L(G) is the language of CFG G Strings of balanced parentheses Two grammars: OR
21
21 Arithmetic Example Simple arithmetic expressions: Some elements of the language:
22
22 Notes The idea of a CFG is a big step. But: Membership in a language is “ yes ” or “ no ” ; also need parse tree of the input l Must handle errors gracefully Need an implementation of CFG ’ s (e.g., Cup)
23
23 More Notes l Form of the grammar is important »Many grammars generate the same language »Tools are sensitive to the grammar »Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice
24
24 Derivations and Parse Trees A derivation is a sequence of productions A derivation can be drawn as a tree »Start symbol is the tree ’ s root »For a production add children » to node
25
25 Derivation Example l Grammar l String
26
26 Derivation Example (Cont.) E E EE E+ id*
27
27 Derivation in Detail (1) E
28
28 Derivation in Detail (2) E EE+
29
29 Derivation in Detail (3) E E EE E+ *
30
30 Derivation in Detail (4) E E EE E+ * id
31
31 Derivation in Detail (5) E E EE E+ * id
32
32 Derivation in Detail (6) E E EE E+ id*
33
33 Notes on Derivations l A parse tree has »Terminals at the leaves »Non-terminals at the interior nodes l An in-order traversal of the leaves is the original input l The parse tree shows the association of operations, the input string does not
34
34 Left-most and Right-most Derivations l The example is a left-most derivation »At each step, replace the left-most non- terminal l There is an equivalent notion of a right-most derivation
35
35 Right-most Derivation in Detail (1) E
36
36 Right-most Derivation in Detail (2) E EE+
37
37 Right-most Derivation in Detail (3) E EE+ id
38
38 Right-most Derivation in Detail (4) E E EE E+ id*
39
39 Right-most Derivation in Detail (5) E E EE E+ id*
40
40 Right-most Derivation in Detail (6) E E EE E+ id*
41
41 Derivations and Parse Trees l Note that right-most and left-most derivations have the same parse tree l The difference is the order in which branches are added
42
42 Summary of Derivations l We are not just interested in whether s 2 L(G) »We need a parse tree for s l A derivation defines a parse tree »But one parse tree may have many derivations l Left-most and right-most derivations are important in parser implementation
43
43 Issues l A parser consumes a sequence of tokens s and produces a parse tree l Issues: »How do we recognize that s 2 L(G) ? »A parse tree of s describes how s L(G) »Ambiguity: more than one parse tree (interpretation) for some string s »Error: no parse tree for some string s »How do we construct the parse tree?
44
44 Ambiguity l Grammar E ! E + E | E * E | ( E ) | int l String int * int + int
45
45 Ambiguity (Cont.) This string has two parse trees E E EE E * int+ E E EE E+ *
46
46 Ambiguity (Cont.) l A grammar is ambiguous if it has more than one parse tree for some string »Equivalently, there is more than one right-most or left-most derivation for some string l Ambiguity is bad »Leaves meaning of some programs ill-defined l Ambiguity is common in programming languages »Arithmetic expressions »IF-THEN-ELSE
47
47 Dealing with Ambiguity l There are several ways to handle ambiguity l Most direct method is to rewrite the grammar unambiguously E ! T + E | T T ! int * T | int | ( E ) l Enforces precedence of * over +
48
48 Ambiguity: The Dangling Else l Consider the grammar E if E then E | if E then E else E | OTHER l This grammar is also ambiguous
49
49 The Dangling Else: Example l The expression if E 1 then if E 2 then E 3 else E 4 has two parse trees if E1E1 E2E2 E3E3 E4E4 E1E1 E2E2 E3E3 E4E4 Typically we want the second form
50
50 The Dangling Else: A Fix l else matches the closest unmatched then l We can describe this in the grammar E MIF /* all then are matched */ | UIF /* some then are unmatched */ MIF if E then MIF else MIF | OTHER UIF if E then E | if E then MIF else UIF l Describes the same set of strings
51
51 The Dangling Else: Example Revisited l The expression if E 1 then if E 2 then E 3 else E 4 if(UIF) E1E1 if(MIF) E2E2 E3E3 E4E4 E1E1 if(UIF) E2E2 E3E3 E 4 Not valid because the then expression is not a MIF A valid parse tree (for a UIF)
52
52 Ambiguity l No general techniques for handling ambiguity l Impossible to convert automatically an ambiguous grammar to an unambiguous one l Used with care, ambiguity can simplify the grammar »Sometimes allows more natural definitions »We need disambiguation mechanisms
53
53 Precedence and Associativity Declarations l Instead of rewriting the grammar »Use the more natural (ambiguous) grammar »Along with disambiguating declarations l Most tools allow precedence and associativity declarations to disambiguate grammars Examples …
54
54 Associativity Declarations l Consider the grammar E E + E | int l Ambiguous: two parse trees of int + int + int E E EE E + int+ E E EE E+ + Left-associativity declaration: %left +
55
55 Precedence Declarations l Consider the grammar E E + E | E * E | int »And the string int + int * int E E EE E + int* E E EE E* + Precedence declarations: %left + %left *
56
56 Abstract Syntax Trees l So far a parser traces the derivation of a sequence of tokens l The rest of the compiler needs a structural representation of the program l Abstract syntax trees »Like parse trees but ignore some details »Abbreviated as AST
57
57 Abstract Syntax Tree. (Cont.) l Consider the grammar E int | ( E ) | E + E l And the string 5 + (2 + 3) l After lexical analysis (a list of tokens) int 5 ‘ + ’ ‘ ( ‘ int 2 ‘ + ’ int 3 ‘ ) ’ During parsing we build a parse tree …
58
58 Example of Parse Tree E EE (E) + E + int 5 int 2 E int 3 l Traces the operation of the parser l Does capture the nesting structure l But too much info »Parentheses »Single-successor nodes
59
59 Example of Abstract Syntax Tree l Also captures the nesting structure l But abstracts from the concrete syntax => more compact and easier to use l An important data structure in a compiler PLUS 2 5 3
60
60 Constructing An AST l We first define the AST class hierarchy »ASTNode IntNode, PlusNode l Consider an abstract tree type with two constructors: new IntNode(n) new PlusNode( T1T1 ) =, T2T2 = PLUS T1T1 T2T2 n
61
61 Semantic Actions This is what we ’ ll use to construct ASTs l Each grammar symbol may have attributes »For terminal symbols (lexical tokens) attributes can be calculated by the lexer l Each production may have an action »Written as: X Y 1 … Y n { action } »That can refer to or compute symbol attributes
62
62 Constructing an AST l We define an attribute ast for non-terminals »Values of ast attributes are ASTs »We assume that int.lexval is the value of the integer lexeme »Computed using semantic actions E int E.ast = new intNode(int.lexval) | E 1 + E 2 E.ast = new PlusNode (E 1.ast, E 2.ast) | ( E 1 ) E.ast = E 1.ast
63
63 Parse Tree Example Consider the string int 5 ‘ + ’ ‘ ( ‘ int 2 ‘ + ’ int 3 ‘ ) ’ l A bottom-up evaluation of the ast attribute: E.ast = new PlusNode(new IntNode(5), new PlusNode(new IntNode(2), new IntNode(3)) PLUS 2 53
64
64 Review l We can specify language syntax using CFG l A parser will answer whether s L(G) »and will build a parse tree »which we convert to an AST »and pass on to the rest of the compiler l Next lectures: »How do we answer s L(G) and build a parse tree?
65
65 Introduction to Top-Down Parsing l Terminals are seen in order of appearance in the token stream: t 2 t 5 t 6 t 8 t 9 l The parse tree is constructed »From the top »From left to right 1 t2t2 3 4 t5t5 7 t6t6 t9t9 t8t8
66
66 Recursive Descent Parsing l Consider the grammar E T + E | T T int | int * T | ( E ) l Token stream is: int 5 * int 2 l Start with top-level non-terminal E l Try the rules for E in order
67
67 Recursive Descent Parsing. Example (Cont.) l Try E 0 T 1 + E 2 l Then try a rule for T 1 ( E 3 ) »But ( does not match input token int 5 l Try T 1 int. Token matches. »But + after T 1 does not match input token * l Try T 1 int * T 2 »This will match but + after T 1 will be unmatched l Have exhausted the choices for T 1 »Backtrack to choice for E 0
68
68 Recursive Descent Parsing. Example (Cont.) l Try E 0 T 1 l Follow same steps as before for T 1 »And succeed with T 1 int * T 2 and T 2 int »With the following parse tree E0E0 T1T1 int 5 * T2T2 int 2
69
69 Recursive Descent Parsing. Notes. l Easy to implement by hand But does not always work …
70
70 Implementation of a Recursive Descent Parser
71
71 A Recursive Descent Parser. Preliminaries l Let TOKEN be the type of tokens »Special tokens INT, OPEN, CLOSE, PLUS, TIMES l Let the global next point to the next token
72
72 A Recursive Descent Parser (2) l Define boolean functions that check the token string for a match of »A given token terminal bool term(TOKEN tok) { return *next++ == tok; } »A given production of S (the n th ) bool S n () { … } // and test »Any production of S: bool S() { … } // or test l These functions advance next
73
73 A Recursive Descent Parser (3) l For production E T + E bool E 1 () { return T() && term(PLUS) && E(); } l For production E T bool E 2 () { return T(); } l For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E 1 ()) || (next = save, E 2 ()); }
74
74 A Recursive Descent Parser (4) l Functions for non-terminal T bool T 1 () { return term(OPEN) && E() && term(CLOSE); } bool T 2 () { return term(INT) && term(TIMES) && T(); } bool T 3 () { return term(INT); } bool T() { TOKEN *save = next; return (next = save, T 1 ()) || (next = save, T 2 ()) || (next = save, T 3 ()); }
75
75 Recursive Descent Parsing. Notes. l To start the parser »Initialize next to point to first token »Invoke E() l Notice how this simulates our backtracking example from lecture l Easy to implement by hand But does not always work … »Predictive parsing is better
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.