Compilation 2007 Context-Free Languages Parsers and Scanners Michael I. Schwartzbach BRICS, University of Aarhus.

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

Translator Architecture Code Generator ParserTokenizer string of characters (source code) string of tokens abstract program string of integers (object.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
LR-Grammars LR(0), LR(1), and LR(K).
CS5371 Theory of Computation
6/12/2015Prof. Hilfinger CS164 Lecture 111 Bottom-Up Parsing Lecture (From slides by G. Necula & R. Bodik)
Foundations of (Theoretical) Computer Science Chapter 2 Lecture Notes (Section 2.1: Context-Free Grammars) David Martin With some.
By Neng-Fa Zhou Syntax Analysis lexical analyzer syntax analyzer semantic analyzer source program tokens parse tree parser tree.
CS 310 – Fall 2006 Pacific University CS310 Pushdown Automata Sections: 2.2 page 109 October 9, 2006.
CS Master – Introduction to the Theory of Computation Jan Maluszynski - HT Lecture 4 Context-free grammars Jan Maluszynski, IDA, 2007
PZ02A - Language translation
Context-Free Grammars Lecture 7
January 14, 2015CS21 Lecture 51 CS21 Decidability and Tractability Lecture 5 January 14, 2015.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Bottom Up Parsing.
Normal forms for Context-Free Grammars
COS 320 Compilers David Walker. last time context free grammars (Appel 3.1) –terminals, non-terminals, rules –derivations & parse trees –ambiguous grammars.
Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.
COP4020 Programming Languages
FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY
1 214 review. 2 What we have learnt Generate scanner and parser –We do not program directly –Instead we write the specifications for the scanner and parser.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
Lecture 16 Oct 18 Context-Free Languages (CFL) - basic definitions Examples.
Syntax and Semantics Structure of programming languages.
Chapter 9 Syntax Analysis Winter 2007 SEG2101 Chapter 9.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
LR(k) Parsing CPSC 388 Ellen Walker Hiram College.
Pushdown Automata (PDA) Intro
A New Look at LR(k) Bill McKeeman, MathWorks Fellow for Worcester Polytechnic Computer Science Colloquium February 20, 2004.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase may be composed of a determiner (D/DET) and a noun (N). A noun phrase.
Grammars CPSC 5135.
Syntax and Semantics Structure of programming languages.
Parsing Introduction Syntactic Analysis I. Parsing Introduction 2 The Role of the Parser The Syntactic Analyzer, or Parser, is the heart of the front.
CS 321 Programming Languages and Compilers Lectures 16 & 17 Introduction to Formal Languages Regular Languages Lexical Analysis.
Chapter 5: Bottom-Up Parsing (Shift-Reduce)
Prof. Necula CS 164 Lecture 8-91 Bottom-Up Parsing LR Parsing. Parser Generators. Lecture 6.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
1 Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Syntax Analysis - Parsing Compiler Design Lecture (01/28/98) Computer Science Rensselaer Polytechnic.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
Introduction Finite Automata accept all regular languages and only regular languages Even very simple languages are non regular (  = {a,b}): - {a n b.
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
GRAMMARS & PARSING. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program, referred to as a.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 2 Context-Free Languages Some slides are in courtesy.
Bernd Fischer RW713: Compiler and Software Language Engineering.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
Bottom Up Parsing CS 671 January 31, CS 671 – Spring Where Are We? Finished Top-Down Parsing Starting Bottom-Up Parsing Lexical Analysis.
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
Syntax Analysis By Noor Dhia Syntax analysis:- Syntax analysis or parsing is the most important phase of a compiler. The syntax analyzer considers.
CS 154 Formal Languages and Computability March 22 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron Mak.
COMPILER CONSTRUCTION
2016/7/9Page 1 Lecture 11: Semester Review COMP3100 Dept. Computer Science and Technology United International College.
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
Syntax and Semantics Structure of programming languages.
Programming Languages Translator
Table-driven parsing Parsing performed by a finite state machine.
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Bottom-Up Syntax Analysis
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Language translation Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Sections
Presentation transcript:

Compilation 2007 Context-Free Languages Parsers and Scanners Michael I. Schwartzbach BRICS, University of Aarhus

2 Context-Free Languages Context-Free Grammars Example: sentence  subject verb object subject  person person  John | Joe | Zacharias verb  asked | kicked object  thing | person thing  the football | the computer  Nonterminal symbols: sentence, subject, person, verb, object, thing  Terminal symbols: John, Joe, Zacharias, asked, kicked, the football, the computer  Start symbol: sentence  Example of derivation: sentence  subject verb object ...  John asked the computer

3 Context-Free Languages Formal Definition of CFGs A context-free grammar (CFG) is a 4-tuple G = (V, , S, P) V is a finite set of nonterminal symbols  is an alphabet of terminal symbols and V   = Ø S  V is a start symbol P is a finite set of productions of the form A   where A  V and  (V  )*

4 Context-Free Languages Derivations  “  ” denotes a single derivation step, where a nonterminal is rewritten according to a production  Thus “  ” is a relation on the set (V  )*  If ,  (V  )* then    when  =  1 A  2 where  1,  2  (V  )* and A  V the grammar contains the production A    =  1  2

5 Context-Free Languages The Language of a CFG  Define the relation “  *” as the reflexive transitive closure of “  ”, that is:   *  iff  ... ...    The language of G is defined as L(G) = { x  * | S  * x }  A language L  * is context-free iff there is a CFG G where L(G)=L 0 or more steps

6 Context-Free Languages Example 1 The language L = { a n b n | n≥0 } is described by a CFG G=(V, ,S,P) where V = {S}  = {a,b} P = {S  aSb, S   } That is, L(G) = L alternative notation: S  aSb | 

7 Context-Free Languages Example 2 The language { x  {0,1}* | x=reverse(x) } is described by a CFG G=(V, ,S,P) where V = {S}  = {0,1} P = {S  , S  0, S  1, S  0S0, S  1S1} alternative notation: S   | 0 | 1 | 0S0 | 1S1

8 Context-Free Languages Why “Context-Free”?   1 A  2   1  2 if the grammar contains the production A    Thus,  may substitute for A independently of the context (  1 and  2 )

9 Context-Free Languages Algebraic Expressions A CFG G=(V, ,S,P) for algebraic expressions:  V = { S }   = { +, -, , /, (, ), x, y, z }  productions in P: S  S + S | S - S | S  S | S / S | ( S ) | x | y | z  x*y+z-(z/y+x)

10 Context-Free Languages Derivations Three derivations of the string x  y + z :  S  S + S  S  S + S  x  S + S  x  y + S  x  y + z  S  S + S  S + z  S  S + z  S  y + z  x  y + z  S  S  S  S  S + S  x  S + S  x  y + S  x  y + z

11 Context-Free Languages A derivation tree shows the structure of a derivation, but not the detailed order: A parser finds a derivation tree for a given string. Derivation Trees S + S S S S  xy z S  S S S S + yz x  and  

12 Context-Free Languages Ambiguous CFGs  A CFG G is ambiguous if there exists a string x  L(G) with more than one derivation tree  Thus, the CFG for algebraic expressions is ambiguous

13 Context-Free Languages A Simpler Syntax for Grammars S  S + S | S - S | S  S | S / S | ( S ) | x | y | z  V is inferred from the left-hand sides   contains the remaining symbols  S is the first nonterminal symbol used  P is written explicitly

14 Context-Free Languages Rewriting Grammars  The ambiguous grammar: S  S + S | S - S | S  S | S / S | ( S ) | x | y | z may be rewritten to become unambiguous: S  S + T | S - T | T T  T  F | T / F | F F  x | y | z | ( S )  This imposes an operator precedence

15 Context-Free Languages Unambiguous Parsing  The string x  y + z now only admits a single parse tree: S + ST x y z FT * FT F

16 Context-Free Languages Chomsky Normal Form  A CFG G=(V, ,S,P) is in Chomsky Normal Form (CNF) if every production in P is of the form A  BC or A  a for A,B,C  V and a   Any CFG G can be rewritten to a CFG G’ where G’ is in Chomsky Normal Form and L(G’) = L(G) - {  }  CNF transformation preserves unambiguity

17 Context-Free Languages Parsing Any CNF Grammar  Given a grammar G and a string x of length n  Define an |V| × n × n table P of booleans where P(A,i,j) iff A  * x[i..j]  Using dynamic programming, fill in this table bottom-up in time O(n 3 )  Now, x  L(G) iff P(S,0,n-1) is true  This algorithm can easily be extended to also construct a parse tree

18 Context-Free Languages An Impractical Approach  Cubic time is much too slow in practice: the source code for Windows is 60 million lines  An industrial parser must be close to linear time

19 Context-Free Languages Shift-Reduce Parsing  Extend the grammar with an EOF symbol $  Choose between the following actions: shift: move first input token to the top of a stack reduce: replace  on top of the stack by A for some production A   accept: when S$ is reduced

20 Context-Free Languages Shift-Reduce in Action x*y+z$ shift x *y+z$ reduce F->x F *y+z$ reduce T->F T *y+z$ shift T*y +z$ reduce F->y T*F +z$ reduce T->T*F T +z$ reduce S->T S +z$ shift S+z $ reduce F->z S+F $ reduce T->F S+T $ reduce S->S+T S $ shift S$ accept

21 Context-Free Languages A shift-reduce trace is the same as a backward right- most derivation sequence: x*y+z$ shift x *y+z$ reduce F->x x*y+z F *y+z$ reduce T->F F*y+z T *y+z$ shift T*y +z$ reduce F->y T*y+z T*F +z$ reduce T->T*F T*F+z T +z$ reduce S->T T+z S +z$ shift S+z $ reduce F->z S+z S+F $ reduce T->F S+F S+T $ reduce S->S+T S+T S $ shift S$ accept S Shift-Reduce Always Works        

22 Context-Free Languages Deterministic Parsing  A string is parsed by a grammar iff it is accepted by some run of the shift-reduce parser  We must know when to shift and when to reduce  A deterministic parser uses a table to determine which action to take  Some grammars can be parsed like this

23 Context-Free Languages The LALR(1) Algorithm Enumerate the productions of the grammar: 1: S  S + T 2: S  S - T 3: S  T 4: T  T  F 5: T  T / F 6: T  F 7: F  x 8: F  y 9: F  z 10: F  ( S )

24 Context-Free Languages LALR(1) Setup  A finite set of states (numbered 0, 1, 2,...)  An input string  A stack of states, initially just containing state 0  A magical table

25 Context-Free Languages LALR(1) Table xyz+-*/()$STF 0s1s2s3s4ag5g6g7 1r7 2r8 3r9 4s1s2s3s4g8g6g7 5s10s11s9 6r3 s12s13r3 7r6 8s10s11s14 9aaaaaaaaaa 10s1s2s3s4g15g7 11s1s2s3s4g16g7 12s1s2s3s4g17 13s1s2s3s4g18 14r10 15r1 s12s13r1 16r2 s12s13r2 17r4 18r5

26 Context-Free Languages LALR(1) Actions  s k: shift and push state k  g k: push state k  r i: pop |α| states and lookup another action at place (j,A), where j is the new stack top and A→α is the i'th production  a : accept  an empty table entry indicates a parse error

27 Context-Free Languages LALR(1) Parsing  Keep executing the action at place (j,a), where j is the stack top and a is the next input symbol  Stop at either accept or error  This is completely deterministic  The time complexity is linear in the input string

28 Context-Free Languages LALR(1) Example 0 x*y+z$ s1 0,1 *y+z$ r7 0,7 *y+z$ r6 0,6 *y+z$ s12 0,6,12 y+z$ s2 0,6,12,2 +z$ r8 0,6,12,17 +z$ r4 0,6 +z$ r3 0,5 +z$ s10 0,5,10 z$ s3 0,5,10,3 $ r9 0,5,10,7 $ r6 0,5,10,15 $ r1 0,5 $ s9 0,5,9 a

29 Context-Free Languages LALR(1) Conflicts  The LALR(1) algorithm tries to construct a table  For some grammars, the table becomes perfect  For other grammars, it may contain conflicts: shift/reduce: an entry contains both a shift and a reduce action reduce/reduce: an entry contains two different reduce actions

30 Context-Free Languages Grammar Containments Context-Free Unambiguous LALR(1)

31 Context-Free Languages LALR(1) Conflicts in Action  The ambiguous grammar S  S + S | S - S | S  S | S / S | ( S ) | x | y | z generates 16 LALR(1) shift/reduce conflicts  But the unambiguous version is LALR(1)...

32 Context-Free Languages Tokens  For a Java grammar,  = Unicode  This is not a practical approach  Instead, grammars use an alphabet of tokens: keywords identifiers numerals strings constants comments symbols ( ==, <=, ++,...) whitespace...

33 Context-Free Languages Tokens Are Regular Expressions  Tokens are defined through regular expressions: keyword: class identifier: [a-z][a-z0-9]* numeral: [+]?[0-9]+ symbol: ++ whitespace: [ ]*

34 Context-Free Languages Scanning  A scanner translates a string of characters into a string of tokens  It is defined by an ordered sequence of regular expressions for the tokens: r 1, r 2,..., r k  Let t i be the longest prefix of the input string that is recognized by r i  Let k = max{ |t i | }  Let j = min{ i | |t i | = k}  The next token is then t j

35 Context-Free Languages Scanning with a DFA  Scanning (for each token definition) can be efficiently performed with a minimal DFA  Run the input string through the DFA  At each accept state, record the current prefix  When the DFA crashes, the last prefix is the candidate token

36 Context-Free Languages Scanning in Action (1/3)  The previous collection of tokens: class [a-z][a-z0-9]* [+]?[0-9]+ ++ [ ]*

37 Context-Free Languages Scanning in Action (2/3)  The automata: class a-z a-z \u0020

38 Context-Free Languages Scanning in Action (3/3)  The input string: class foo +17 c++ generates the tokens: keyword: class whitespace identifier: foo whitespace numeral: +17 whitespace identifier: c symbol: ++