Chapter 4 Lexical and Syntax Analysis.

Slides:

Advertisements

Similar presentations

Lexical and Syntactic Analysis Here, we look at two of the tasks involved in the compilation process –Given source code, we need to first break it into.

Advertisements

Chapter 4 Lexical and Syntax Analysis Sections

Chapter 4 Lexical and Syntax Analysis Sections 1-4.

ISBN Chapter 4 Lexical and Syntax Analysis.

ISBN Chapter 4 Lexical and Syntax Analysis.

Slide1 Chapter 4 Lexical and Syntax Analysis. slide2 OutLines: In this chapter a major topics will be discussed : Introduction to lexical analysis, including.

ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.

CS 330 Programming Languages 09 / 23 / 2008 Instructor: Michael Eckmann.

Lexical and Syntax Analysis

Lecture 4 Concepts of Programming Languages Arne Kutzner Hanyang University / Seoul Korea.

ISBN Lecture 04 Lexical and Syntax Analysis.

Chapter 4 Lexical and Syntax Analysis. Chapter 4 Topics Introduction Lexical Analysis The Parsing Problem Recursive-Descent Parsing Bottom-Up Parsing.

Lexical and syntax analysis

CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University

Parsing. Goals of Parsing Check the input for syntactic accuracy Return appropriate error messages Recover if possible Produce, or at least traverse,

Chapter 4 Lexical and Syntax Analysis. 4-2 Chapter 4 Topics 4.1 Introduction 4.2 Lexical Analysis 4.3 The Parsing Problem 4.4 Recursive-Descent Parsing.

CS 330 Programming Languages 09 / 26 / 2006 Instructor: Michael Eckmann.

Lexical and Syntax Analysis

Some parts are Copyright © 2004 Pearson Addison-Wesley. All rights reserved.3-1 Programming Language Specification and Translation ICOM 4036 Spring 2009.

CS 330 Programming Languages 09 / 21 / 2006 Instructor: Michael Eckmann.

ISBN Chapter 4 Lexical and Syntax Analysis.

ISBN Chapter 4 Lexical and Syntax Analysis.

College of Computer Science and Engineering Course: ICS313

Bottom-Up Parsing David Woolbright. The Parsing Problem Produce a parse tree starting at the leaves The order will be that of a rightmost derivation The.

ISBN Chapter 4 Lexical and Syntax Analysis.

CS 330 Programming Languages 09 / 20 / 2007 Instructor: Michael Eckmann.

ISBN Chapter 4 Lexical and Syntax Analysis.

CS 330 Programming Languages 09 / 25 / 2007 Instructor: Michael Eckmann.

Copyright © 2004 Pearson Addison-Wesley. All rights reserved.3-1 Language Specification and Translation Lecture 8.

C HAPTER 4 Lexical and Syntax Analysis. C HAPTER 4 T OPICS Introduction Lexical Analysis The Parsing Problem Recursive-Descent Parsing Bottom-Up Parsing.

Copyright © 2004 Pearson Addison-Wesley. All rights reserved.3-1 Language Specification and Translation ICOM 4036 Spring 2004 Lecture 3.

Lexical and Syntax Analysis

Chapter 3 – Describing Syntax

Lexical and Syntax Analysis

Lecture 4 Concepts of Programming Languages

4.1 Introduction - Language implementation systems must analyze

Lexical and Syntax Analysis

Chapter 4 - Parsing CSCE 343.

Programming Languages Translator

Lexical and Syntax Analysis

Unit-3 Bottom-Up-Parsing.

UNIT - 3 SYNTAX ANALYSIS - II

Parsing IV Bottom-up Parsing

Lexical and Syntax Analysis

Lexical and Syntax Analysis

Lexical and Syntax Analysis

Lexical and Syntactic Analysis

4d Bottom Up Parsing.

Compilers Principles, Techniques, & Tools Taught by Jing Zhang

Lexical and Syntax Analysis

Parsing IV Bottom-up Parsing

Lexical and Syntax Analysis

Chapter 4: Lexical and Syntax Analysis Sangho Ha

4d Bottom Up Parsing.

Lexical and Syntax Analysis

Programming Language Specification and Translation

Language Specification and Translation

4d Bottom Up Parsing.

4d Bottom Up Parsing.

Programming Language Specification and Translation

4d Bottom Up Parsing.

Lexical and Syntax Analysis

Programming Language Specification and Translation

Lexical and Syntax Analysis

4d Bottom Up Parsing.

Lexical and Syntax Analysis

4.1 Introduction - Language implementation systems must analyze

Presentation transcript:

Chapter 4 Lexical and Syntax Analysis

Chapter 4 Outline Lexical Analysis The Parsing Problem Introduction to Parsing Top-Down Parsers Bottom-Up Parsers The Complexity of Parsing Recursive-Descent Parsing The Recursive-Descent Parsing Process The LL Grammar Class Bottom-Up Parsing The Parsing Problem for Bottom-Up Parsers Shift-Reduce Algorithms LR Parsers

Introduction Language implementation systems analyze source code Syntax analysis is based on a formal description of the syntax of the source language (BNF) The syntax analysis portion of a language processor consists of two parts: 1. A low-level part called a lexical analyzer 2. A high-level part called a syntax analyzer, or parser

Introduction Reasons to use BNF to describe syntax: 1. Provides a clear and concise syntax description 2. The parser can be based directly on the BNF 3. Parsers based on BNF are easy to maintain Reasons to separate lexical and syntax analysis: 1. Simplicity 2. Efficiency 3. Portability

Lexical Analysis A lexical analyzer is a pattern matcher for character strings A lexical analyzer is a “front-end” for the parser Identifies substrings of the source program that belong together lexemes match a character pattern associated with a lexical category called a token sum is a lexeme; its token may be IDENT

Lexical Analysis A lexical analyzer is a function that returns the next token in an input string (the program) Three approaches to building a lexical analyzer: Use a software-generated table-driven lexical analyzer Write a program implementing a state diagram Hand-construct a table-driven implementation of a state diagram

State Diagram Design A naïve state diagram would have a transition from every state on every character in the source language Such a diagram would be very large! Transitions can be combined to simplify the state diagram

Example State Diagram Design When recognizing an identifier, if all uppercase and lowercase letters are equivalent, use a character class that includes all letters When recognizing an integer literal, all digits are equivalent Reserved words and identifiers can be recognized together: a table lookup can determine identifier vs. reserved word

Example Convenient Utility Subprograms getChar gets the next character of input puts it in nextChar determines its class and puts the class in charClass addChar moves the character from nextChar into the variable where the lexeme is being accumulated: lexeme lookup determines whether the string in lexeme is a reserved word (returns a code)

Example

Example Implementation assume initialization int lex() { switch (charClass) { case LETTER: addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ while (charClass == LETTER || charClass == DIGIT) { addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ } return lookup(lexeme); break; case DIGIT: while (charClass == DIGIT) { return INT_LIT; } /* End of switch */ } /* End of function lex */

The Parsing Problem Find all syntax errors, produce a diagnostic message, and recover quickly! End result: a trace of the parse tree for the program Two categories of parsers Top down: builds parse tree starting at the root Bottom up: builds parse tree starting at the leaves Parsers look only one token ahead in the input

Top-Down Parsers Given a sentential form, xA , where x is a string of terminals, A is a non-terminal, and  is a string of terminals and non-terminals, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A The most common top-down parsing algorithms: 1. Recursive descent - a coded implementation 2. LL parsers - table driven implementation

Bottom-Up Parsers Given a right sentential form, , what substring of  is the right-hand side of the rule in the grammar that must be reduced (production in reverse) to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family: SLR LALR (parses more grammars than SLR) Canonical LR (parses more than LALR)

The Complexity of Parsing Parsers that work for any unambiguous grammar are complex and inefficient O(n3), where n is the length of the input Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time O(n), where n is the length of the input

Recursive-Descent Parsing Use a subprogram for each nonterminal in the grammar to parse sentences containing that nonterminal Since a grammar expressed in EBNF minimizes the number of nonterminals, EBNF is a good basis for recursive-descent parsing Assume a lexical analyzer named lex puts the token code of the next terminal of the input stream into nextToken By convention, every parsing routine ends calling lex to read the next input token and load nextToken

Recursive-Descent Parsing A grammar for simple expressions: <expr>  <term> {(+ | -) <term>} <term>  <factor> {(* | /) <factor>} <factor>  id | ( <expr> )

Recursive-Descent Parsing The coding process when there is only one RHS for a particular nonterminal: For each terminal symbol in the RHS, compare it to nextToken: match? continue no match? error For each nonterminal symbol in the RHS, call its associated parsing subprogram

Recursive-Descent Parsing function for expression /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Function expr Parses strings in the language generated by the rule: <expr> → <term> {(+ | -) <term>} * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ void expr() { /* Parse sequence of additive operations */ term(); /* Parse the first term */ while (nextToken==PLUS_CODE || nextToken==MINUS_CODE) { /* As long as the next token is + or - */ lex(); /* call lex to get the next token*/ term(); /* parse the next term */ }/* End of while */ }/* End of function expr */ This particular routine does not detect errors

Recursive-Descent Parsing The coding process when there is more than one RHS for a particular nonterminal: Call a process that determines which RHS to parse The correct RHS is chosen by comparing nextToken (the lookahead) with the first token generated by each RHS until a match is found If no match is found, it is a syntax error Once a particular RHS is chosen, procede as in the single RHS case

Recursive-Descent Parsing function for factor /* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Function factor Parses strings in the language generated by the rule: <factor> -> id | (<expr>) * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ void factor() { /* Determine which RHS */ if (nextToken)==ID_CODE) /* If RHS is id */ lex(); /* just call lex to load nextToken */ else if (nextToken==LEFT_PAREN_CODE){ /*If RHS==(<expr>)*/ lex(); /* call lex to pass over the left paren */ expr(); /* then call expr */ if (nextToken==RIGHT_PAREN_CODE) /* check right paren */ lex(); /* call lex to load nextToken */ else error(); }/* END OF else if (nextToken == LEFT_PAREN_CODE) */ else error(); /* since neither RHS matches */ }/* END OF FUNCTION factor() */

The LL Grammar Class Left-to-right scan, Leftmost derivation The Left Recursion Problem Two common characteristics of grammars disallow top-down parsing: 1. If a grammar has left recursion (either direct or indirect) A grammar can be modified to remove left recursion 2. If a grammar lacks pairwise disjointness

Pairwise Disjointness The ability to choose a RHS using one token lookahead Definition: FIRST() = {a |  =>* a } (If  =>* ,  Є FIRST()) Pairwise Disjointness Test: For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A => i and A => j, it must be true that FIRST(i) ∩ FIRST(j) =  Examples: A => a | bB | cAb GOOD! A => a | aB BAD!

Pairwise Disjointness Left factoring can resolve the problem Replace <variable> => ident | ident [<expr>] with <variable> => ident <new> <new> =>  | [<expr>] or <variable> => ident [[<expr>]] (outer brackets are metasymbols in EBNF)

Bottom-Up Parsing The parsing problem: dealing only with right-sentential forms, find the correct RHS to reduce to get the previous rightmost derivation step reduce = replace with the LHS of a production rule

LR Parsers Left-to-right scan, Rightmost derivation Advantages of LR parsers They work for most grammars that describe programming languages They work on a larger class of grammars than other bottom-up algorithms, but are just as efficient as any other bottom-up parser They can detect syntax errors as soon as possible The LR class of grammars is a superset of the class parsable by LL parsers

Bottom-Up Parsers Given a right sentential form, , what substring of  is the right-hand side of the rule in the grammar that must be reduced (production in reverse) to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family: SLR LALR (parses more grammars than SLR) Canonical LR (parses more than LALR)

Bottom-Up Parsing The parsing problem: dealing only with right-sentential forms, find the correct RHS to reduce to get the previous rightmost derivation step reduce = replace with the LHS of a production rule

LR Parsers Left-to-right scan, Rightmost derivation Advantages of LR parsers They work for most grammars that describe programming languages They work on a larger class of grammars than other bottom-up algorithms, but are just as efficient as any other bottom-up parser They can detect syntax errors as soon as possible The LR class of grammars is a superset of the class parsable by LL parsers

Bottom-Up Parsing Definitions β is a phrase of the right sentential form γ if and only if S =>* γ = α1Aα2 =>+ α1βα2 corresponding to the internal nodes of the parse tree β is a simple phrase of the right sentential form γ if and only if S =>* γ = α1Aα2 => α1βα2 corresponding to a RHS of the grammar β is the handle of the right sentential form γ = αβw if and only if S =>*rm αAw =>rm αβw

Shift-Reduce Algorithms Handles leftmost simple phrase The handle of a right sentential form is its leftmost simple phrase Given a parse tree, it is now easy to find the handle Bottom-up parsing can be thought of as handle pruning

Figure 4.2 A parse tree for E + T * id

Shift-Reduce Algorithms moving the next token to the top of the parse stack 2. Reduce replacing the handle on the top of the parse stack with its corresponding LHS

The structure of an LR parser Figure 4.3 The structure of an LR parser

LR Parsers LR parsers must be constructed with a tool Knuth’s insight: A bottom-up parser can use the entire history of the parse, up to the current point, to make parsing decisions. Only a finite and relatively small number of different parse situations can occur so the history can be stored in a parser state, on the parse stack

((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)) Bottom-up Parsing An LR configuration stores the state of an LR parser in a pair of data structures, a stack and a queue: The stack contains pairs of either a lexeme or a token paired with a state. ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)) The queue contains the input string of lexemes (terminals): (ai ai+1 … an $) Initial configuration: (S0) and (a1 … an $)

LR Configuration LR parsers are table driven, where the table has two components, an ACTION table and a GOTO table The ACTION table specifies the action of the parser, given the parser state and the next token Rows = state names Columns = terminals The GOTO table specifies which state to put on top of the parse stack after a reduction action is done Columns = nonterminals

Bottom-up Parsing Parser Actions Assume conditions (during the parse): stack is ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)) where Xi is a single terminal or nonterminal and Si is a state name queue is (ai ai+1 … an $), where all elements are unparsed terminals. Then fetch top of stack (Sm) and front of queue (ai) in order to determine ACTION[Sm,ai]

Bottom-up Parsing Parser Actions switch (ACTION[Sm,ai]) { case Shift(S): the next configuration is: stack: ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)(ai,S)) queue: (ai+1 ai+2 … an $) case Reduce(A): the next configuration is: stack: ((-,S0)(X1,S1)(X2,S2)…(Xm-r,Sm-r)(A,S)) queue: (ai ai+1 … an $) (unchanged) where S = GOTO[Sm-r,A], and r = length() case Accept: parse complete; no errors found default: Error, call an error-handling routine }

Shift-Reduce Parsing Example Traditional Grammar for Arithmetic Expressions 1. E => E + T 2. E => T 3. T => T * F 4. T => F 5. F => ( E ) 6. F => id Consider the expression: id + id * id $ STACK INPUT ACTION GOTO id + id * id $ Shift 5 0 id5 + id * id $ Reduce 6 [0,F] 0 F3 Reduce 4 [0,T] 0 T2 Reduce 2 [0,E] 0 E1 Shift 6 0 E1 +6 id * id $ 0 E1 +6 id5 * id $ [6,F] 0 E1 +6 F3 [6,T] 0 E1 +6 T9 Shift 7 0 E1 +6 T9 *7 id $ 0 E1 +6 T9 *7 id5 $ [7,F] 0 E1 +6 T9 *7 F10 Reduce 3 Reduce 1 Accept

The LR parsing table for an arithmetic expression grammar Figure 4.5 The LR parsing table for an arithmetic expression grammar

Summary Syntax analysis is a common part of language implementation A lexical analyzer is a pattern matcher that isolates small-scale parts of a program Detects syntax errors Produces a parse tree A recursive-descent parser is an LL parser EBNF Parsing problem for bottom-up parsers: find the substring of current sentential form The LR family of shift-reduce parsers is the most common bottom-up parsing approach