Chapter 4 Lexical and Syntax Analysis.

Chapter 4 Lexical and Syntax Analysis

Chapter 4 Outline Lexical Analysis The Parsing Problem
Introduction to Parsing Top-Down Parsers Bottom-Up Parsers The Complexity of Parsing Recursive-Descent Parsing The Recursive-Descent Parsing Process The LL Grammar Class Bottom-Up Parsing The Parsing Problem for Bottom-Up Parsers Shift-Reduce Algorithms LR Parsers

Introduction Language implementation systems analyze source code
Syntax analysis is based on a formal description of the syntax of the source language (BNF) The syntax analysis portion of a language processor consists of two parts: 1. A low-level part called a lexical analyzer 2. A high-level part called a syntax analyzer, or parser

Introduction Reasons to use BNF to describe syntax:
1. Provides a clear and concise syntax description 2. The parser can be based directly on the BNF 3. Parsers based on BNF are easy to maintain Reasons to separate lexical and syntax analysis: 1. Simplicity 2. Efficiency 3. Portability

Lexical Analysis A lexical analyzer is a pattern matcher for character strings A lexical analyzer is a “front-end” for the parser Identifies substrings of the source program that belong together lexemes match a character pattern associated with a lexical category called a token sum is a lexeme; its token may be IDENT

Lexical Analysis A lexical analyzer is a function that returns the next token in an input string (the program) Three approaches to building a lexical analyzer: Use a software-generated table-driven lexical analyzer Write a program implementing a state diagram Hand-construct a table-driven implementation of a state diagram

State Diagram Design A naïve state diagram would have a transition from every state on every character in the source language Such a diagram would be very large! Transitions can be combined to simplify the state diagram

Example State Diagram Design
When recognizing an identifier, if all uppercase and lowercase letters are equivalent, use a character class that includes all letters When recognizing an integer literal, all digits are equivalent Reserved words and identifiers can be recognized together: a table lookup can determine identifier vs. reserved word

Example Convenient Utility Subprograms
getChar gets the next character of input puts it in nextChar determines its class and puts the class in charClass addChar moves the character from nextChar into the variable where the lexeme is being accumulated: lexeme lookup determines whether the string in lexeme is a reserved word (returns a code)

Example

Example Implementation assume initialization
int lex() { switch (charClass) { case LETTER: addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ while (charClass == LETTER || charClass == DIGIT) { addChar(); /* nextChar -> lexeme */ getChar(); /* input -> nextChar */ } return lookup(lexeme); break; case DIGIT: while (charClass == DIGIT) { return INT_LIT; } /* End of switch */ } /* End of function lex */

The Parsing Problem Find all syntax errors,
produce a diagnostic message, and recover quickly! End result: a trace of the parse tree for the program Two categories of parsers Top down: builds parse tree starting at the root Bottom up: builds parse tree starting at the leaves Parsers look only one token ahead in the input

Top-Down Parsers Given a sentential form, xA , where
x is a string of terminals, A is a non-terminal, and  is a string of terminals and non-terminals, the parser must choose the correct A-rule to get the next sentential form in the leftmost derivation, using only the first token produced by A The most common top-down parsing algorithms: 1. Recursive descent - a coded implementation 2. LL parsers - table driven implementation

Bottom-Up Parsers Given a right sentential form, ,
what substring of  is the right-hand side of the rule in the grammar that must be reduced (production in reverse) to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family: SLR LALR (parses more grammars than SLR) Canonical LR (parses more than LALR)

The Complexity of Parsing
Parsers that work for any unambiguous grammar are complex and inefficient O(n3), where n is the length of the input Compilers use parsers that only work for a subset of all unambiguous grammars, but do it in linear time O(n), where n is the length of the input

Recursive-Descent Parsing
Use a subprogram for each nonterminal in the grammar to parse sentences containing that nonterminal Since a grammar expressed in EBNF minimizes the number of nonterminals, EBNF is a good basis for recursive-descent parsing Assume a lexical analyzer named lex puts the token code of the next terminal of the input stream into nextToken By convention, every parsing routine ends calling lex to read the next input token and load nextToken

A grammar for simple expressions: <expr>  <term> {(+ | -) <term>} <term>  <factor> {(* | /) <factor>} <factor>  id | ( <expr> )

The coding process when there is only one RHS for a particular nonterminal: For each terminal symbol in the RHS, compare it to nextToken: match? continue no match? error For each nonterminal symbol in the RHS, call its associated parsing subprogram

Recursive-Descent Parsing function for expression
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Function expr Parses strings in the language generated by the rule: <expr> → <term> {(+ | -) <term>} * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ void expr() { /* Parse sequence of additive operations */ term(); /* Parse the first term */ while (nextToken==PLUS_CODE || nextToken==MINUS_CODE) { /* As long as the next token is + or - */ lex(); /* call lex to get the next token*/ term(); /* parse the next term */ }/* End of while */ }/* End of function expr */ This particular routine does not detect errors

The coding process when there is more than one RHS for a particular nonterminal: Call a process that determines which RHS to parse The correct RHS is chosen by comparing nextToken (the lookahead) with the first token generated by each RHS until a match is found If no match is found, it is a syntax error Once a particular RHS is chosen, procede as in the single RHS case

Recursive-Descent Parsing function for factor
/* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Function factor Parses strings in the language generated by the rule: <factor> -> id | (<expr>) * * * * * * * * * * * * * * * * * * * * * * * * * * * * * */ void factor() { /* Determine which RHS */ if (nextToken)==ID_CODE) /* If RHS is id */ lex(); /* just call lex to load nextToken */ else if (nextToken==LEFT_PAREN_CODE){ /*If RHS==(<expr>)*/ lex(); /* call lex to pass over the left paren */ expr(); /* then call expr */ if (nextToken==RIGHT_PAREN_CODE) /* check right paren */ lex(); /* call lex to load nextToken */ else error(); }/* END OF else if (nextToken == LEFT_PAREN_CODE) */ else error(); /* since neither RHS matches */ }/* END OF FUNCTION factor() */

The LL Grammar Class Left-to-right scan, Leftmost derivation
The Left Recursion Problem Two common characteristics of grammars disallow top-down parsing: 1. If a grammar has left recursion (either direct or indirect) A grammar can be modified to remove left recursion 2. If a grammar lacks pairwise disjointness

Pairwise Disjointness
The ability to choose a RHS using one token lookahead Definition: FIRST() = {a |  =>* a } (If  =>* ,  Є FIRST()) Pairwise Disjointness Test: For each nonterminal, A, in the grammar that has more than one RHS, for each pair of rules, A => i and A => j, it must be true that FIRST(i) ∩ FIRST(j) =  Examples: A => a | bB | cAb GOOD! A => a | aB BAD!

Pairwise Disjointness
Left factoring can resolve the problem Replace <variable> => ident | ident [<expr>] with <variable> => ident <new> <new> =>  | [<expr>] or <variable> => ident [[<expr>]] (outer brackets are metasymbols in EBNF)

Bottom-Up Parsing The parsing problem:
dealing only with right-sentential forms, find the correct RHS to reduce to get the previous rightmost derivation step reduce = replace with the LHS of a production rule

LR Parsers Left-to-right scan, Rightmost derivation
Advantages of LR parsers They work for most grammars that describe programming languages They work on a larger class of grammars than other bottom-up algorithms, but are just as efficient as any other bottom-up parser They can detect syntax errors as soon as possible The LR class of grammars is a superset of the class parsable by LL parsers

Bottom-Up Parsers Given a right sentential form, ,
what substring of  is the right-hand side of the rule in the grammar that must be reduced (production in reverse) to produce the previous sentential form in the right derivation The most common bottom-up parsing algorithms are in the LR family: SLR LALR (parses more grammars than SLR) Canonical LR (parses more than LALR)

Bottom-Up Parsing The parsing problem:
dealing only with right-sentential forms, find the correct RHS to reduce to get the previous rightmost derivation step reduce = replace with the LHS of a production rule

LR Parsers Left-to-right scan, Rightmost derivation
Advantages of LR parsers They work for most grammars that describe programming languages They work on a larger class of grammars than other bottom-up algorithms, but are just as efficient as any other bottom-up parser They can detect syntax errors as soon as possible The LR class of grammars is a superset of the class parsable by LL parsers

Bottom-Up Parsing Definitions
β is a phrase of the right sentential form γ if and only if S =>* γ = α1Aα2 =>+ α1βα2 corresponding to the internal nodes of the parse tree β is a simple phrase of the right sentential form γ if and only if S =>* γ = α1Aα2 => α1βα2 corresponding to a RHS of the grammar β is the handle of the right sentential form γ = αβw if and only if S =>*rm αAw =>rm αβw

Shift-Reduce Algorithms Handles leftmost simple phrase
The handle of a right sentential form is its leftmost simple phrase Given a parse tree, it is now easy to find the handle Bottom-up parsing can be thought of as handle pruning

Figure 4.2 A parse tree for E + T * id

Shift-Reduce Algorithms
moving the next token to the top of the parse stack 2. Reduce replacing the handle on the top of the parse stack with its corresponding LHS

The structure of an LR parser
Figure 4.3 The structure of an LR parser

LR Parsers LR parsers must be constructed with a tool Knuth’s insight:
A bottom-up parser can use the entire history of the parse, up to the current point, to make parsing decisions. Only a finite and relatively small number of different parse situations can occur so the history can be stored in a parser state, on the parse stack

((-,S0)(X1,S1)(X2,S2)…(Xm,Sm))
Bottom-up Parsing An LR configuration stores the state of an LR parser in a pair of data structures, a stack and a queue: The stack contains pairs of either a lexeme or a token paired with a state. ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)) The queue contains the input string of lexemes (terminals): (ai ai+1 … an $) Initial configuration: (S0) and (a1 … an $)

LR Configuration LR parsers are table driven, where the table has two components, an ACTION table and a GOTO table The ACTION table specifies the action of the parser, given the parser state and the next token Rows = state names Columns = terminals The GOTO table specifies which state to put on top of the parse stack after a reduction action is done Columns = nonterminals

Bottom-up Parsing Parser Actions
Assume conditions (during the parse): stack is ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)) where Xi is a single terminal or nonterminal and Si is a state name queue is (ai ai+1 … an $), where all elements are unparsed terminals. Then fetch top of stack (Sm) and front of queue (ai) in order to determine ACTION[Sm,ai]

Bottom-up Parsing Parser Actions
switch (ACTION[Sm,ai]) { case Shift(S): the next configuration is: stack: ((-,S0)(X1,S1)(X2,S2)…(Xm,Sm)(ai,S)) queue: (ai+1 ai+2 … an $) case Reduce(A): the next configuration is: stack: ((-,S0)(X1,S1)(X2,S2)…(Xm-r,Sm-r)(A,S)) queue: (ai ai+1 … an $) (unchanged) where S = GOTO[Sm-r,A], and r = length() case Accept: parse complete; no errors found default: Error, call an error-handling routine }

Shift-Reduce Parsing Example
Traditional Grammar for Arithmetic Expressions 1. E => E + T 2. E => T 3. T => T * F 4. T => F 5. F => ( E ) 6. F => id Consider the expression: id + id * id $ STACK INPUT ACTION GOTO id + id * id $ Shift 5 0 id5 + id * id $ Reduce 6 [0,F] 0 F3 Reduce 4 [0,T] 0 T2 Reduce 2 [0,E] 0 E1 Shift 6 0 E1 +6 id * id $ 0 E id5 * id $ [6,F] 0 E F3 [6,T] 0 E T9 Shift 7 0 E T9 *7 id $ 0 E T9 *7 id5 $ [7,F] 0 E T9 *7 F10 Reduce 3 Reduce 1 Accept

The LR parsing table for an arithmetic expression grammar
Figure 4.5 The LR parsing table for an arithmetic expression grammar

Summary Syntax analysis is a common part of language implementation
A lexical analyzer is a pattern matcher that isolates small-scale parts of a program Detects syntax errors Produces a parse tree A recursive-descent parser is an LL parser EBNF Parsing problem for bottom-up parsers: find the substring of current sentential form The LR family of shift-reduce parsers is the most common bottom-up parsing approach

Chapter 4 Lexical and Syntax Analysis.

Similar presentations

Presentation on theme: "Chapter 4 Lexical and Syntax Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 4 Lexical and Syntax Analysis.

Similar presentations

Presentation on theme: "Chapter 4 Lexical and Syntax Analysis."— Presentation transcript:

Similar presentations

About project

Feedback