1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.

1 Introduction to Parsing

2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations

3 Languages and Automata l Formal languages are very important in CS »Especially in programming languages l Regular languages »The weakest formal languages widely used »Many applications l We will also study context-free languages

4 Limitations of Regular Languages l Intuition: A finite automaton that runs long enough must repeat states Finite automaton can ’ t remember # of times it has visited a particular state l Finite automaton has finite memory »Only enough to store in which state it is »Cannot count, except up to a finite limit E.g., language of balanced parentheses is not regular: { ( i ) i | i ¸ 0}

5 The Functionality of the Parser l Input: sequence of tokens from lexer l Output: parse tree of the program

6 Example l Java expr x == y ? 1 : 2 l Parser input ID == ID ? INT : INT l Parser output ID ?: == INT

7 Comparison with Lexical Analysis PhaseInputOutput LexerSequence of characters Sequence of tokens ParserSequence of tokens Parse tree

8 The Role of the Parser l Not all sequences of tokens are programs... l... Parser must distinguish between valid and invalid sequences of tokens l We need »A language for describing valid sequences of tokens »A method for distinguishing valid from invalid sequences of tokens

9 Context-Free Grammars l Programming language constructs have recursive structure l An EXPR is if EXPR then EXPR else EXPR fi, or while EXPR loop EXPR pool, or … l Context-free grammars are a natural notation for this recursive structure

10 CFGs (Cont.) l A CFG consists of »A set of terminals T »A set of non-terminals N »A start symbol S (a non-terminal) »A set of productions Assuming X 2 N X ! , or X ! Y 1 Y 2... Y n where Y i µ N [ T

11 Notational Conventions l In these lecture notes »Non-terminals are written upper-case »Terminals are written lower-case »The start symbol is the left-hand side of the first production

12 Examples of CFGs Expr  if Expr then Expr else Expr | while Expr do Expr | id

13 Examples of CFGs Simple arithmetic expressions:

14 The Language of a CFG Read productions as replacement rules: X ! Y 1... Y n Means X can be replaced by Y 1... Y n X !  Means X can be erased (replaced with empty string)

15 Key Idea 1. Begin with a string consisting of the start symbol “ S ” 2. Replace any non-terminal X in the string by a right-hand side of some production 3. Repeat (2) until there are no non-terminals in the string

16 The Language of a CFG (Cont.) More formally, write if there is a production

17 The Language of a CFG (Cont.) Write if in 0 or more steps

18 The Language of a CFG Let G be a context-free grammar with start symbol S. Then the language of G is:

19 Terminals l Terminals are called because there are no rules for replacing them l Once generated, terminals are permanent l Terminals ought to be tokens of the language

20 Examples L(G) is the language of CFG G Strings of balanced parentheses Two grammars: OR

21 Arithmetic Example Simple arithmetic expressions: Some elements of the language:

22 Notes The idea of a CFG is a big step. But: Membership in a language is “ yes ” or “ no ” ; also need parse tree of the input l Must handle errors gracefully Need an implementation of CFG ’ s (e.g., Cup)

23 More Notes l Form of the grammar is important »Many grammars generate the same language »Tools are sensitive to the grammar »Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice

24 Derivations and Parse Trees A derivation is a sequence of productions A derivation can be drawn as a tree »Start symbol is the tree ’ s root »For a production add children » to node

25 Derivation Example l Grammar l String

26 Derivation Example (Cont.) E E EE E+ id*

27 Derivation in Detail (1) E

28 Derivation in Detail (2) E EE+

29 Derivation in Detail (3) E E EE E+ *

30 Derivation in Detail (4) E E EE E+ * id

31 Derivation in Detail (5) E E EE E+ * id

32 Derivation in Detail (6) E E EE E+ id*

33 Notes on Derivations l A parse tree has »Terminals at the leaves »Non-terminals at the interior nodes l An in-order traversal of the leaves is the original input l The parse tree shows the association of operations, the input string does not

34 Left-most and Right-most Derivations l The example is a left-most derivation »At each step, replace the left-most non- terminal l There is an equivalent notion of a right-most derivation

35 Right-most Derivation in Detail (1) E

36 Right-most Derivation in Detail (2) E EE+

37 Right-most Derivation in Detail (3) E EE+ id

38 Right-most Derivation in Detail (4) E E EE E+ id*

41 Derivations and Parse Trees l Note that right-most and left-most derivations have the same parse tree l The difference is the order in which branches are added

42 Summary of Derivations l We are not just interested in whether s 2 L(G) »We need a parse tree for s l A derivation defines a parse tree »But one parse tree may have many derivations l Left-most and right-most derivations are important in parser implementation

43 Issues l A parser consumes a sequence of tokens s and produces a parse tree l Issues: »How do we recognize that s 2 L(G) ? »A parse tree of s describes how s  L(G) »Ambiguity: more than one parse tree (interpretation) for some string s »Error: no parse tree for some string s »How do we construct the parse tree?

44 Ambiguity l Grammar E ! E + E | E * E | ( E ) | int l String int * int + int

45 Ambiguity (Cont.) This string has two parse trees E E EE E * int+ E E EE E+ *

46 Ambiguity (Cont.) l A grammar is ambiguous if it has more than one parse tree for some string »Equivalently, there is more than one right-most or left-most derivation for some string l Ambiguity is bad »Leaves meaning of some programs ill-defined l Ambiguity is common in programming languages »Arithmetic expressions »IF-THEN-ELSE

47 Dealing with Ambiguity l There are several ways to handle ambiguity l Most direct method is to rewrite the grammar unambiguously E ! T + E | T T ! int * T | int | ( E ) l Enforces precedence of * over +

48 Ambiguity: The Dangling Else l Consider the grammar E  if E then E | if E then E else E | OTHER l This grammar is also ambiguous

49 The Dangling Else: Example l The expression if E 1 then if E 2 then E 3 else E 4 has two parse trees if E1E1 E2E2 E3E3 E4E4 E1E1 E2E2 E3E3 E4E4 Typically we want the second form

50 The Dangling Else: A Fix l else matches the closest unmatched then l We can describe this in the grammar E  MIF /* all then are matched */ | UIF /* some then are unmatched */ MIF  if E then MIF else MIF | OTHER UIF  if E then E | if E then MIF else UIF l Describes the same set of strings

51 The Dangling Else: Example Revisited l The expression if E 1 then if E 2 then E 3 else E 4 if(UIF) E1E1 if(MIF) E2E2 E3E3 E4E4 E1E1 if(UIF) E2E2 E3E3 E 4 Not valid because the then expression is not a MIF A valid parse tree (for a UIF)

52 Ambiguity l No general techniques for handling ambiguity l Impossible to convert automatically an ambiguous grammar to an unambiguous one l Used with care, ambiguity can simplify the grammar »Sometimes allows more natural definitions »We need disambiguation mechanisms

53 Precedence and Associativity Declarations l Instead of rewriting the grammar »Use the more natural (ambiguous) grammar »Along with disambiguating declarations l Most tools allow precedence and associativity declarations to disambiguate grammars Examples …

54 Associativity Declarations l Consider the grammar E  E + E | int l Ambiguous: two parse trees of int + int + int E E EE E + int+ E E EE E+ + Left-associativity declaration: %left +

55 Precedence Declarations l Consider the grammar E  E + E | E * E | int »And the string int + int * int E E EE E + int* E E EE E* + Precedence declarations: %left + %left *

56 Abstract Syntax Trees l So far a parser traces the derivation of a sequence of tokens l The rest of the compiler needs a structural representation of the program l Abstract syntax trees »Like parse trees but ignore some details »Abbreviated as AST

57 Abstract Syntax Tree. (Cont.) l Consider the grammar E  int | ( E ) | E + E l And the string 5 + (2 + 3) l After lexical analysis (a list of tokens) int 5 ‘ + ’ ‘ ( ‘ int 2 ‘ + ’ int 3 ‘ ) ’ During parsing we build a parse tree …

58 Example of Parse Tree E EE (E) + E + int 5 int 2 E int 3 l Traces the operation of the parser l Does capture the nesting structure l But too much info »Parentheses »Single-successor nodes

59 Example of Abstract Syntax Tree l Also captures the nesting structure l But abstracts from the concrete syntax => more compact and easier to use l An important data structure in a compiler PLUS 2 5 3

60 Constructing An AST l We first define the AST class hierarchy »ASTNode  IntNode, PlusNode l Consider an abstract tree type with two constructors: new IntNode(n) new PlusNode( T1T1 ) =, T2T2 = PLUS T1T1 T2T2 n

61 Semantic Actions This is what we ’ ll use to construct ASTs l Each grammar symbol may have attributes »For terminal symbols (lexical tokens) attributes can be calculated by the lexer l Each production may have an action »Written as: X  Y 1 … Y n { action } »That can refer to or compute symbol attributes

62 Constructing an AST l We define an attribute ast for non-terminals »Values of ast attributes are ASTs »We assume that int.lexval is the value of the integer lexeme »Computed using semantic actions E  int E.ast = new intNode(int.lexval) | E 1 + E 2 E.ast = new PlusNode (E 1.ast, E 2.ast) | ( E 1 ) E.ast = E 1.ast

63 Parse Tree Example Consider the string int 5 ‘ + ’ ‘ ( ‘ int 2 ‘ + ’ int 3 ‘ ) ’ l A bottom-up evaluation of the ast attribute: E.ast = new PlusNode(new IntNode(5), new PlusNode(new IntNode(2), new IntNode(3)) PLUS 2 53

64 Review l We can specify language syntax using CFG l A parser will answer whether s  L(G) »and will build a parse tree »which we convert to an AST »and pass on to the rest of the compiler l Next lectures: »How do we answer s  L(G) and build a parse tree?

65 Introduction to Top-Down Parsing l Terminals are seen in order of appearance in the token stream: t 2 t 5 t 6 t 8 t 9 l The parse tree is constructed »From the top »From left to right 1 t2t2 3 4 t5t5 7 t6t6 t9t9 t8t8

66 Recursive Descent Parsing l Consider the grammar E  T + E | T T  int | int * T | ( E ) l Token stream is: int 5 * int 2 l Start with top-level non-terminal E l Try the rules for E in order

67 Recursive Descent Parsing. Example (Cont.) l Try E 0  T 1 + E 2 l Then try a rule for T 1  ( E 3 ) »But ( does not match input token int 5 l Try T 1  int. Token matches. »But + after T 1 does not match input token * l Try T 1  int * T 2 »This will match but + after T 1 will be unmatched l Have exhausted the choices for T 1 »Backtrack to choice for E 0

68 Recursive Descent Parsing. Example (Cont.) l Try E 0  T 1 l Follow same steps as before for T 1 »And succeed with T 1  int * T 2 and T 2  int »With the following parse tree E0E0 T1T1 int 5 * T2T2 int 2

69 Recursive Descent Parsing. Notes. l Easy to implement by hand But does not always work …

70 Implementation of a Recursive Descent Parser

71 A Recursive Descent Parser. Preliminaries l Let TOKEN be the type of tokens »Special tokens INT, OPEN, CLOSE, PLUS, TIMES l Let the global next point to the next token

72 A Recursive Descent Parser (2) l Define boolean functions that check the token string for a match of »A given token terminal bool term(TOKEN tok) { return *next++ == tok; } »A given production of S (the n th ) bool S n () { … } // and test »Any production of S: bool S() { … } // or test l These functions advance next

73 A Recursive Descent Parser (3) l For production E  T + E bool E 1 () { return T() && term(PLUS) && E(); } l For production E  T bool E 2 () { return T(); } l For all productions of E (with backtracking) bool E() { TOKEN *save = next; return (next = save, E 1 ()) || (next = save, E 2 ()); }

74 A Recursive Descent Parser (4) l Functions for non-terminal T bool T 1 () { return term(OPEN) && E() && term(CLOSE); } bool T 2 () { return term(INT) && term(TIMES) && T(); } bool T 3 () { return term(INT); } bool T() { TOKEN *save = next; return (next = save, T 1 ()) || (next = save, T 2 ()) || (next = save, T 3 ()); }

75 Recursive Descent Parsing. Notes. l To start the parser »Initialize next to point to first token »Invoke E() l Notice how this simulates our backtracking example from lecture l Easy to implement by hand But does not always work … »Predictive parsing is better

1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.

Similar presentations

Presentation on theme: "1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.

Similar presentations

Presentation on theme: "1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations."— Presentation transcript:

Similar presentations

About project

Feedback