CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.

Slides:



Advertisements
Similar presentations
Mooly Sagiv and Roman Manevich School of Computer Science
Advertisements

Lexical and Syntactic Analysis Here, we look at two of the tasks involved in the compilation process –Given source code, we need to first break it into.
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
Top-Down Parsing.
Bottom-Up Syntax Analysis Mooly Sagiv Textbook:Modern Compiler Design Chapter (modified)
Bottom-Up Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Design Chapter
CS Summer 2005 Top-down and Bottom-up Parsing - a whirlwind tour June 20, 2005 Slide acknowledgment: Radu Rugina, CS 412.
Context-Free Grammars Lecture 7
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
Syntax Analysis Mooly Sagiv html:// Textbook:Modern Compiler Design Chapter 2.2 (Partial) Hashlama 11:00-14:00.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Prof. Fateman CS 164 Lecture 91 Bottom-Up Parsing Lecture 9.
Professor Yihjia Tsai Tamkang University
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.
PZ03A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ03A - Pushdown automata Programming Language Design.
Bottom-up parsing Goal of parser : build a derivation
Lexical and syntax analysis
CPSC 388 – Compiler Design and Construction
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 7 Mälardalen University 2010.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
Parsing IV Bottom-up Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Syntax and Semantics Structure of programming languages.
Chapter 9 Syntax Analysis Winter 2007 SEG2101 Chapter 9.
Top-Down Parsing - recursive descent - predictive parsing
4 4 (c) parsing. Parsing A grammar describes the strings of tokens that are syntactically legal in a PL A recogniser simply accepts or rejects strings.
1 Chapter 5 LL (1) Grammars and Parsers. 2 Naming of parsing techniques The way to parse token sequence L: Leftmost R: Righmost Top-down  LL Bottom-up.
4 4 (c) parsing. Parsing A grammar describes syntactically legal strings in a language A recogniser simply accepts or rejects strings A generator produces.
10/13/2015IT 3271 Tow kinds of predictive parsers: Bottom-Up: The syntax tree is built up from the leaves Example: LR(1) parser Top-Down The syntax tree.
Parsing G Programming Languages May 24, 2012 New York University Chanseok Oh
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Review 1.Lexical Analysis 2.Syntax Analysis 3.Semantic Analysis 4.Code Generation 5.Code Optimization.
CS 153 A little bit about LR Parsing. Background We’ve seen three ways to write parsers:  By hand, typically recursive descent  Using parsing combinators.
Syntax and Semantics Structure of programming languages.
COP4020 Programming Languages Syntax Prof. Robert van Engelen (modified by Prof. Em. Chris Lacher)
4 4 (c) parsing. Parsing A grammar describes syntactically legal strings in a language A recogniser simply accepts or rejects strings A generator produces.
Bernd Fischer RW713: Compiler and Software Language Engineering.
Introduction to Parsing
Comp 311 Principles of Programming Languages Lecture 3 Parsing Corky Cartwright August 28, 2009.
Top-down Parsing lecture slides from C OMP 412 Rice University Houston, Texas, Fall 2001.
PZ03A Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ03A - Pushdown automata Programming Language Design.
Top-Down Parsing CS 671 January 29, CS 671 – Spring Where Are We? Source code: if (b==0) a = “Hi”; Token Stream: if (b == 0) a = “Hi”; Abstract.
Top-down Parsing. 2 Parsing Techniques Top-down parsers (LL(1), recursive descent) Start at the root of the parse tree and grow toward leaves Pick a production.
Top-Down Parsing.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
LECTURE 7 Lex and Intro to Parsing. LEX Last lecture, we learned a little bit about how we can take our regular expressions (which specify our valid tokens)
CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.
UMBC  CSEE   1 Chapter 4 Chapter 4 (b) parsing.
Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.
COMPILER CONSTRUCTION
Syntax and Semantics Structure of programming languages.
Programming Languages Translator
Table-driven parsing Parsing performed by a finite state machine.
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Bottom-Up Syntax Analysis
PZ03A - Pushdown automata
4 (c) parsing.
Lexical and Syntax Analysis
Top-Down Parsing CS 671 January 29, 2008.
LL and Recursive-Descent Parsing Hal Perkins Autumn 2011
Syntax Analysis - Parsing
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
LL and Recursive-Descent Parsing Hal Perkins Autumn 2009
LL and Recursive-Descent Parsing Hal Perkins Winter 2008
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Pushdown automata Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Presentation transcript:

CMSC 330: Organization of Programming Languages Pushdown Automata Parsing

Type 0: Any formal grammar Turing machines Type-1: Linear bounded automata Type-2: Pushdown automata (PDAs) Type-3: Regular expressions Finite state automata (NFAs/DFAs) Chomsky Hierarchy Categorization of various languages and grammars Each is strictly more restrictive than the previous First described by Noam Chomsky in 1956 CMSC 330

3 Implementing context-free languages Problem: enforcing balanced language constructs Solution: add a stack

CMSC 3304 Pushdown Automaton (PDA) A pushdown automaton (PDA) is an abstract machine similar to a DFA –Has a finite set of states and transitions –Also has a pushdown stack Moves of the PDA are as follows: –An input symbol is read and the top symbol on the stack is read –Based on both inputs, the machine Enters a new state, and Pushes zero or more symbols onto the pushdown stack Or pops zero or more symbols from the stack –String accepted if the input has ended AND the stack is empty

CMSC 3305 Power of PDAs PDAs are more powerful than DFAs –a n b n, which cannot be recognized by a DFA, can easily be recognized by the PDA Push all a symbols onto the stack For each b, pop an a off the stack If the end of input is reached at the same time that the stack becomes empty, the string is accepted

Formal Definition Q – finite set of states Σ – input alphabet Γ – stack alphabet δ – transitions from (Q×Γ) to (Q×Γ) on (Σ U {ε}) q 0 – start state (member of Q) Z – initial stack symbol (member of Γ) F – set of accepting states (subset of Q) CMSC 330

7 Parsing There are many efficient techniques for turning strings into parse trees –They all have strange names, like LL(k), SLR(k), LR(k) –They use various forms of PDAs We will look at one very simple technique: recursive descent parsing –This is a “top-down” parsing algorithm because we’re going to begin at the start symbol and try to produce the string –We won’t actually formally construct any PDAs

CMSC 3308 Example E → id = n | { L } L → E ; L | ε –Here n is an integer and id is an identifier One input might be –{ x = 3; { y = 4; }; } –This would get turned into a list of tokens { x = 3 ; { y = 4 ; } ; } –And we want to turn it into a parse tree

CMSC 3309 Example (cont’d) E → id = n | { L } L → E ; L | ε { x = 3; { y = 4; }; } E {L} E;L x=3E;L {L} E; L y=4ε ε

CMSC Parsing Algorithm Goal: determine if we can produce a string from the grammar's start symbol At each step, we'll keep track of two facts –What tree node are we trying to match? –What is the next token (lookahead) of the input string?

CMSC Parsing Algorithm There are three cases: –If we’re trying to match a terminal and the next token (lookahead) is that token, then succeed, advance the lookahead, and continue –If we’re trying to match a nonterminal then pick which production to apply based on the lookahead –Otherwise, fail with a parsing error

CMSC Example (cont’d) E → id = n | { L } L → E ; L | ε { x = 3 ; { y = 4 ; } ; } E {L} E;L x=3E;L {L} E; L y=4ε ε lookahead

CMSC Definition of First(γ) First(γ), for any terminal or nonterminal γ, is the set of initial terminals of all strings that γ may expand to –We’ll use this to decide what production to apply

CMSC Definition of First(γ), cont’d For a terminal a, First(a) = { a } For a nonterminal N: –If N → ε, then add ε to First(N) –If N → α 1 α 2... α n, then (note the α i are all the symbols on the right side of one single production): Add First(α 1 α 2... α n ) to First(N), where First(α 1 α 2... α n ) is defined as –First(α 1 ) if ε  First(α 1 ) –Otherwise (First(α 1 ) – ε) ∪ First(α 2... α n ) If ε  First(α i ) for all i, 1  i  k, then add ε to First(N)

CMSC Examples E → id = n | { L } L → E ; L | ε First(id) = { id } First("=") = { "=" } First(n) = { n } First("{")= { "{" } First("}")= { "}" } First(";")= { ";" } First(E) = { id, "{" } First(L) = { id, "{", ε } E → id = n | { L } | ε L → E ; L | ε First(id) = { id } First("=") = { "=" } First(n) = { n } First("{")= { "{" } First("}")= { "}" } First(";")= { ";" } First(E) = { id, "{", ε } First(L) = { id, "{", ";", ε }

CMSC Implementing a Recursive Descent Parser For each terminal symbol a, create a function parse_a, which: –If the lookahead is a it consumes the lookahead by advancing the lookahead to the next token, and returns –Otherwise fails with a parse error For each nonterminal N, create a function parse_N –This function is called when we’re trying to parse a part of the input which corresponds to (or can be derived from) N –parse_S for the start symbol S begins the process

CMSC Implementing a Recursive Descent Parser, con't. The body of parse_N for a nonterminal N does the following: –Let N → β 1 |... | β k be the productions of N Here β i is the entire right side of a production- a sequence of terminals and nonterminals –Pick the production N → β i such that the lookahead is in First(β i ) It must be that First(β i ) ∩ First(β j ) = ∅ for i ≠ j If there is no such production, but N → ε then return Otherwise, then fail with a parse error –Suppose β i = α 1 α 2... α n. Then call parse_α 1 ();... ; parse_α n () to match the expected right-hand side, and return

CMSC Example E → id = n | { L } L → E ; L | ε let parse_term t = if lookahead = t then lookahead := else raise let rec parse_E () = if lookahead = 'id' then begin parse_term 'id'; parse_term '='; parse_term 'n' end else if lookahead = '{' then begin parse_term '{'; parse_L (); parse_term '}'; end else raise ;

CMSC Example (cont’d) E → id = n | { L } L → E ; L | ε and parse_L () = if lookahead = 'id'|| lookahead = '{' then begin parse_E (); parse_term ';'; parse_L () end (* else return (not an error) *) mutually recursive with previous let rec

CMSC Things to Notice If you draw the execution trace of the parser as a tree, then you get the parse tree This is a predictive parser because we use the lookahead to determine exactly which production to use

CMSC Limitations: Overlapping First Sets This parsing strategy may fail on certain grammars because the First sets overlap –This doesn’t mean the grammar is not usable in a parser, just not in this type of parser Consider parsing the grammar E → n + E | n –First(E) = n = First(n), so we can’t use this technique Exercise: Rewrite this grammar so it becomes amenable to our parsing technique

CMSC Limitations: Left Recursion How about the grammar S → Sa | ε –First(Sa) = a, so we’re ok as far as which production –But the body of parse_S() has an infinite loop if (lookahead = "a") then parse_S() –This technique cannot handle left-recursion –Exercise: rewrite this grammar to be right-recursive

CMSC Expr Grammar for Top-Down Parsing E → T E' E' → ε | + E T → P T' T' → ε | * T P → n | ( E ) –Notice we can always decide what production to choose with only one symbol of lookahead

CMSC Interesting Question Recursive descent parsers are a form of push- down automata But where's the stack?

CMSC What’s Wrong with Parse Trees? We don't actually use parse trees to do translation Parse trees contain too much information –E.g., they have parentheses and they have extra nonterminals for precedence –This extra stuff is needed for parsing But when we want to reason about languages, it gets in the way (it’s too much detail)

CMSC Abstract Syntax Trees (ASTs) An abstract syntax tree is a more compact, abstract representation of a parse tree, with only the essential parts parse tree AST

CMSC ASTs (cont’d) Intuitively, ASTs correspond to the data structure you’d use to represent strings in the language –Note that grammars describe trees (so do OCaml datatypes which we’ll see later) –E → a | b | c | E+E | E-E | E*E | (E)

CMSC Producing an AST To produce an AST, we modify the parse() functions to construct the AST along the way

CMSC General parsing algorithms LL parsing –Scans input Left-to-right –Builds Leftmost derivations –Sometimes called “top-down parsing” –Implemented with tables or recursive descent algorithm LR parsing –Scans input Left-to-right –Builds Rightmost derivations –Sometimes called “bottom-up parsing” –Usually implemented with shift-reduce algorithm LL(k) means LL with k-symbol lookahead

CMSC General parsing algorithms Recursive descent parsers are easy to write –They're unable to handle certain kinds of grammars More powerful techniques generally require tool support, such as yacc and bison LR(k), SLR(k) [Simple LR(k)], and LALR(k) [Lookahead LR(k)] are all techniques used today to build efficient parsers. –Recursive descent is a form of LL(k) parsing –You’ll study more about parsing in CMSC 430

CMSC Context-free Grammars in Practice Regular expressions and finite automata are used to turn raw text into a string of tokens –E.g., “if”, “then”, “identifier”, etc. –Whitespace and comments are simply skipped –These tokens are the input for the next phase of compilation –This process is called lexing –Lexer generators include lex and flex Grammars and pushdown automata are used to turn tokens into parse trees and/or ASTs –This process is called parsing –Parser generators include yacc and bison The compiler produces object code from ASTs

CMSC The Compilation Process