Chapter 4 Lexical Analysis.

Slides:



Advertisements
Similar presentations
Compiler Construction
Advertisements

CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)
Session 14 (DM62) / 15 (DM63) Recursive Descendent Parsing.
Chapter 4 Syntax.
By Neng-Fa Zhou Syntax Analysis lexical analyzer syntax analyzer semantic analyzer source program tokens parse tree parser tree.
1 Lexical Analysis Why split it from parsing? –Simplifies design Parsers with whitespace and comments are more awkward –Efficiency Only use the most powerful.
Chapter 4 Lexical and Syntax Analysis Sections 1-4.
Context-Free Grammars Lecture 7
ISBN Chapter 4 Lexical and Syntax Analysis The Parsing Problem Recursive-Descent Parsing.
1 Predictive parsing Recall the main idea of top-down parsing: Start at the root, grow towards leaves Pick a production and try to match input May need.
1 Chapter 4: Top-Down Parsing. 2 Objectives of Top-Down Parsing an attempt to find a leftmost derivation for an input string. an attempt to construct.
Table-driven parsing Parsing performed by a finite state machine. Parsing algorithm is language-independent. FSM driven by table (s) generated automatically.
CSE 413 Programming Languages & Implementation Hal Perkins Autumn 2012 Context-Free Grammars and Parsing 1.
Chapter 2 Syntax A language that is simple to parse for the compiler is also simple to parse for the human programmer. N. Wirth.
Compiler Principle and Technology Prof. Dongming LU Mar. 7th, 2014.
Syntax and Semantics Structure of programming languages.
Chapter 9 Syntax Analysis Winter 2007 SEG2101 Chapter 9.
BİL 744 Derleyici Gerçekleştirimi (Compiler Design)1 Syntax Analyzer Syntax Analyzer creates the syntactic structure of the given source program. This.
Syntax and Backus Naur Form
Parsing Jaruloj Chongstitvatana Department of Mathematics and Computer Science Chulalongkorn University.
Context-Free Grammars and Parsing
Grammars CPSC 5135.
PART I: overview material
Programming Languages Third Edition Chapter 6 Syntax.
Profs. Necula CS 164 Lecture Top-Down Parsing ICOM 4036 Lecture 5.
Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
ISBN Chapter 3 Describing Syntax and Semantics.
Syntax and Semantics Structure of programming languages.
1 COMP313A Programming Languages Syntax Analysis (2)
CPS 506 Comparative Programming Languages Syntax Specification.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
A Programming Languages Syntax Analysis (1)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Overview of Previous Lesson(s) Over View  In our compiler model, the parser obtains a string of tokens from the lexical analyzer & verifies that the.
Unit-3 Parsing Theory (Syntax Analyzer) PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE.
Top-Down Parsing.
Syntax Analyzer (Parser)
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Topic #4: Syntactic Analysis (Parsing) CSC 338 – Compiler Design and implementation Dr. Mohamed Ben Othman ( )
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
Syntax and Semantics Structure of programming languages.
Last Chapter Review Source code characters combination lexemes tokens pattern Non-Formalization Description Formalization Description Regular Expression.
CSE 3302 Programming Languages
CS 3304 Comparative Languages
Chapter 3 – Describing Syntax
lec02-parserCFG May 8, 2018 Syntax Analyzer
Parsing & Context-Free Grammars
Programming Languages Translator
CS510 Compiler Lecture 4.
Chapter 3 Context-Free Grammar and Parsing
Table-driven parsing Parsing performed by a finite state machine.
Top-down parsing cannot be performed on left recursive grammars.
Compiler Construction (CS-636)
CSE 3302 Programming Languages
Lexical and Syntax Analysis
Lecture 7: Introduction to Parsing (Syntax Analysis)
R.Rajkumar Asst.Professor CSE
CS 3304 Comparative Languages
CS 3304 Comparative Languages
Computing Follow(A) : All Non-Terminals
Chapter 4: Lexical and Syntax Analysis Sangho Ha
BNF 9-Apr-19.
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
lec02-parserCFG May 27, 2019 Syntax Analyzer
COMPILER CONSTRUCTION
Parsing CSCI 432 Computer Science Theory
Presentation transcript:

Chapter 4 Lexical Analysis

Lexical Analysis Why split it from parsing? Simplifies design Parsers with whitespace and comments are more awkward Efficiency Only use the most powerful technique that works And nothing more No parsing sledgehammers for lexical nuts Portability More modular code More code re-use

Source Code Characteristics Identifiers Count, max, get_num Language keywords: reserved or predefined switch, if .. then.. else, printf, return, void Mathematical operators +, *, >> …. <=, =, != … Literals “Hello World” Comments Whitespace

Reserved words versus predefined identifiers Reserved words cannot be used as the name of anything in a definition (i.e., as an identifier). Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t). Examples of predefined identifiers in Java: anything in java.lang package, such as String, Object, System, Integer.

Language of Lexical Analysis Tokens: category Patterns: regular expression Lexemes:actual string matched

Tokens are not enough… Clearly, if we replaced every occurrence of a variable with a token then …. We would lose other valuable information (value, name) Other data items are attributes of the tokens Stored in the symbol table

Token delimiters When does a token/lexeme end? e.g xtemp=ytemp

Ambiguity in identifying tokens A programming language definition will state how to resolve uncertain token assignment <> Is it 1 or 2 tokens? Reserved keywords (e.g. if) take precedence over identifiers (rules are same for both) Disambiguating rules state what to do ‘Principle of longest substring’: greedy match

Regular Expressions To represent patterns of strings of characters REs Alphabet – set of legal symbols Meta-characters – characters with special meanings  is the empty string 3 basic operations Choice – choice1|choice2, a|b matches either a or b Concatenation – firstthing secondthing (a|b)c matches the strings { ac, bc } Repetition (Kleene closure)– repeatme* a* matches { , a, aa, aaa, aaaa, ….} Precedence: * is highest, | is lowest Thus a|bc* is a|(b(c*))

Regular Expressions… We can add in regular definitions digit = 0|1|2 …|9 And then use them: digit digit* A sequence of 1 or more digits One or more repetitions: (a|b)(a|b)*  (a|b)+ Any character in the alphabet . .*b.* - strings containing at least one b Ranges [a-z], [a-zA-Z], [0-9], (assume character set ordering) Not: ~a or [^a]

Some exercises Describe the languages denoted by the following regular expressions 0 ( 0 | 1 ) * 0 ( ( 11 | 0 ) * ) * 0* 1 0* 1 0* 1 0 * Write regular definitions for the following regular expressions All strings that contain the five vowels in order (but not necessarily adjacent) aabcaadggge is okay All strings of letters in which the letters are in ascending lexicographic order All strings of 0’s and 1’s that do not contain the substring 011

For example, nested Tags in HTML Limitations of REs REs can describe many language constructs but not all For example Alphabet = {a,b}, describe the set of strings consisting of a single a surrounded by an equal number of b’s S= {a, bab, bbabb, bbbabbb, …} For example, nested Tags in HTML

Lookahead <=, <>, < When we read a token delimiter to establish a token we need to make sure that it is still available as part of next token It is the start of the next token! This is lookahead Decide what to do based on the character we ‘haven’t read’ Sometimes implemented by reading from a buffer and then pushing the input back into the buffer And then starting with recognizing the next token

Classic Fortran example DO 99 I=1,10 becomes DO99I=1,10 versus DO99I=1.10 The first is a do loop, the second an assignment. We need lots of lookahead to distinguish. When can the lexical analyzer assign a token? Push back into input buffer or ‘backtracking’

Finite Automata A recognizer determines if an input string is a sentence in a language Uses a regular expression Turn the regular expression into a finite automaton Could be deterministic or non-deterministic

Transition diagram for identifiers RE Identifier -> letter (letter | digit)* letter accept start letter other 1 2 digit

An NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over  . In the case of multiple transitions from a state over the same character, when we are at this state and we read this character, we have more than one choice; the NFA succeeds if at least one of these choices succeeds. The  transition doesn't consume any input characters, so you may jump to another state for free. Clearly DFAs are a subset of NFAs. But it turns out that DFAs and NFAs have the same expressive power.

From a Regular Expression to an NFA Thompson’s Construction (a | b)* abb e a 2 3 e e start e e 1 a b b 6 7 8 9 10 e e 4 5 accept b e

a start a b b 1 2 3 accept b Non-deterministic finite state automata NFA b a start a b b accept 01 02 03 a b a Equivalent deterministic finite state automata DFA

Transition Table (NFA) State a b {0,1} {0} 1 {2} 2 {3} Input Symbol

NFA -> DFA (subset construction) Suppose that you assign a number to each NFA state. The DFA states generated by subset construction have sets of numbers, instead of just one number. For example, a DFA state may have been assigned the set {5, 6, 8}. This indicates that arriving to the state labeled {5, 6, 8} in the DFA is the same as arriving to the state 5, the state 6, or the state 8 in the NFA when parsing the same input. Recall that a particular input sequence when parsed by a DFA, leads to a unique state, while when parsed by a NFA it may lead to multiple states. First we need to handle transitions that lead to other states for free (without consuming any input). These are the transitions. We define the closure of a NFA node as the set of all the nodes reachable by this node using zero, one, or more transitions.

NFA -> DFA (cont) The start state of the constructed DFA is labeled by the closure of the NFA start state. For every DFA state labeled by some set {s1,..., sn} and for every character c in the language alphabet, you find all the states reachable by s1, s2, ..., or sn using c arrows and you union together the closures of these nodes. If this set is not the label of any other node in the DFA constructed so far, you create a new DFA node with this label.

Transition Table (DFA) State a b 01 02 03 Input Symbol

Writing a lexical analyzer The DFA helps us to write the scanner. Figure 4.1 in your text gives a good example of what a scanner might look like.

LEX (FLEX) Tool for generating programs which recognize lexical patterns in text Takes regular expressions and turns them into a program

Lexical Errors Only a small percentage of errors can be recognized during Lexical Analysis Consider if (good == “bad)

Examples from the PERL language Line ends inside literal string Illegal character in input file missing semi-colon missing operator missing paren unquoted string unopened file handle

In general What does a lexical error mean? Strategies for dealing with: “Panic-mode” Delete chars from input until something matches Inserting characters Re-ordering characters Replacing characters For an error like “illegal character” then we should report it sensibly

Syntax Analysis also known as Parsing Grouping together tokens into larger structures Analogous to lexical analysis Input: Tokens (output of Lexical Analyzer) Output: Structured representation of original program

Parsing Fundamentals Source program: 3 + 4 After Lexical Analysis: ???

A Context Free Grammar A grammar is a four tuple (, N,P,S) where  is the terminal alphabet N is the non terminal alphabet P is the set of productions S is a designated start symbol in N

Parsing Expression  number plus number Similar to regular definitions: Concatenation Choice Expression  number Operator number operator  + | - | * | / Repetition is done differently

BNF Grammar Expression  number Operator number Structure on the left is defined to consist of the choices on the right hand side Meta-symbols:  | Different conventions for writing BNF Grammars: <expression> ::= number <operator> number Expression  number Operator number

Derivations Derivation: Sequence of replacements of structure names by choices on the RHS of grammar rules Begin: start symbol End: string of token symbols Each step one replacement is made Exp  Exp Op Exp | number Op  + | - | * | /

Example Derivation Note the different arrows:  Derivation applies grammar rules  Used to define grammar rules Non-terminals: Exp, Op Terminals: number, * Terminals: because they terminate the derivation

Derivations (2) E  ( E ) | a What sentences does this grammar generate? An example derivation: E  ( E ) ((E))  ((a)) Note that this is what we couldn’t achieve with regular definitions

Recursive Grammars E  ( E ) | a is recursive E  ( E ) is the general case E  a is the terminating case We have no * operator in context free grammars Repetition = recursion E  E  |  derives ,  ,   ,     ,      …. All strings beginning with  followed by zero or more repetitions of   *

Recursive Grammars (2) a+ (regular expression) E  E a | a (1) Or E  a E | a (2) 2 different grammars can derive the same language (1) is left recursive (2) is right recursive a* Implies we need the empty production E  E a | 

Recursive Grammars (3) Require recursive data structures Parse Trees Exp  Exp Op Exp | number Op  + | - | * | / 1 exp 3 2 4 exp op exp number * number

Parse Trees & Derivations Leafs = terminals Interior nodes = non-terminals If we replace the non-terminals right to left The parse tree sequence is right to left A rightmost derivation -> reverse post-order traversal If we derive left to right: A leftmost derivation pre-order traversal parse trees encode information about the derivation process

Formal Methods of Describing Syntax 1950: Noam Chomsky (noted linguist) described generative devices which describe four classes of languages (in order of decreasing power) recursively enumerable x y where x and y can be any string of nonterminals and terminals. context-sensitive x  y where x and y can be string of terminals and non-terminals but y must be the same length or longer than x. Can recognize anbncn context-free (yacc) - nonterminals appear singly on left-side of productions. Any nonterminal can be replaced by its right hand side regardless of the context it appears in. Ex: If you were in the boxing ring and said ``Hit me'' it would imply a different action than if you were playing cards. Ex: If a IDENTSY which is between brackets is treated differently in terms of what it matches than an IDENTSY between parens, this is context sensitive Can recognize  anbn, palindromes regular (lex) Can recognize  anbm Chomsky was interested in the theoretic nature of natural languages.

Abstract Syntax Trees Parse trees contain surplus information + exp exp op exp 3 4 This is all the information we actually need Token sequence number + number 3 4

An exercise Consider the grammar S->(L) | a L->L,S |S What are the terminals, nonterminals and start symbol Find leftmost and rightmost derivations and parse trees for the following sentences (a,a) (a, (a,a)) (a, ((a,a), (a,a)))

Parsing token sequence: id + id * id E  E + E | E * E | ( E ) | - E | id

Ambiguity If a sentence has two distinct parse trees, the grammar is ambiguous Or alternatively:is ambiguous if there are two different right-most derivations for the same string. In English, the phrase ``small dogs and cats'' is ambiguous as we aren't sure if the cats are small or not. `I see flying planes' is also ambiguous A language is said to be ambiguous if no unambiguous grammar exists for it. Dance is at the old main gym.  How it is parsed?

Ambiguous Grammars Problem – no clear structure is expressed A grammar that generates a string with 2 distinct parse trees is called an ambiguous grammar 2+3*4 = 2 + (3*4) = 14 2+3*4 = (2+3) * 4 = 20 Our experience of math says interpretation 1 is correct but the grammar does not express this: E  E + E | E * E | ( E ) | - E | id

Example of Ambiguity Grammar: Expression: 2 + 3 * 4 Parse trees: expr ® expr + expr | expr  expr | ( expr ) | NUMBER Expression: 2 + 3 * 4 Parse trees:

Removing Ambiguity Two methods 1. Disambiguating Rules positives: leaves grammar unchanged negatives: grammar is not sole source of syntactic knowledge 2. Rewrite the Grammar Using knowledge of the meaning that we want to use later in the translation into object code to guide grammar alteration

Precedence E  E addop Term | Term Addop  + | - Term  Term * Factor | Term/Factor |Factor Factor  ( exp ) | number | id Operators of equal precedence are grouped together at the same ‘level’ of the grammar  ’precedence cascade’ The lowest level operators have highest precedence (The first shall be last and the last shall be first.)

Associativity 45-10-5 ? 30 or 40 Subtraction is left associative, left to right (=30) E  E addop E | Term Does not tell us how to split up 45-10-5 E  E addop Term | Term Forces left associativity via left recursion Precedence & associativity remove ambiguity of arithmetic expressions Which is what our math teachers took years telling us!

Ambiguous grammars Statement -> If-statement | other If-statement -> if (Exp) Statement | if (Exp) Statement else Statement Exp -> 0 | 1 Parse if (0) if (1) other1 else other2

Removing ambiguity Statement -> Matched-stmt | Unmatched-stmt Matched-stmt -> if (Exp) Matched-stmt else Matched-stmt | other Unmatched-stmt ->if (Exp) Statement | if (Exp) Matched-stmt else Unmatched-stmt

Extended BNF Notation Notation for repetition and optional features. {…} expresses repetition: expr ® expr + term | term becomes expr ® term { + term } […] expresses optional features: if-stmt ® if( expr ) stmt | if( expr ) stmt else stmt becomes if-stmt ® if( expr ) stmt [ else stmt ]

Notes on use of EBNF Use {…} only for left recursive rules: expr ® term + expr | term should become expr ® term [ + expr ] Do not start a rule with {…}: write expr ® term { + term }, not expr ® { term + } term Exception to previous rule: simple token repetition, e.g. expr ® { - } term … Square brackets can be used anywhere, however: expr ® expr + term | term | unaryop term should be written as expr ® [ unaryop ] term { + term }

Syntax Diagrams An alternative to EBNF. Rarely seen any more: EBNF is much more compact. Example (if-statement, p. 101):

How is Parsing done? Recursive descent (top down). Bottom up – tries to match input with the right hand side of a rule. Sometimes called shift-reduce parsers.

Predictive Parsing Top down parsing LL(1) parsing Table driven predictive parsing (no recursion) versus recursive descent parsing where each nonterminal is associated with a procedure call No backtracking E -> E + T | T T -> T * F | F F -> (E) | id

Two grammar problems T -> T * F | F F -> (E) | id Eliminating left recursion (without changing associativity) A -> Aa | b A -> bA’ A’ -> aA’ | e Example E -> E + T | T T -> T * F | F F -> (E) | id The general case A -> Aa1 | Aa2 | …| Aam | b1 | b2 | …| bn

Two grammar problems Eliminating left recursion involving derivations of two or more steps S -> Aa | b A -> Ac | Sd | e A -> Ac | Aad | bd | e

Removing Left Recursion Before A --> A x A --> y After A --> yB B --> x B  B --> e

Two grammar problems… Left factoring Stmt -> if Exp then Stmt else Stmt | if Expr then Stmt A -> ab1 | ab2 A -> aA’ A’ -> b1 | b2

exercises Eliminate left recursion from the following grammars. S->(L) | a L->L,S | S Bexpr ->Bexpr or Bterm | Bterm Bterm -> Bterm and Bfactor | Bfactor Bfactor -> not Bfactor | (Bexpr) | true | false

COMP313A Programming Languages Syntax Analysis (3)

Table driven predictive parsing Getting the grammar right Constructing the table

Table Driven Predictive Parsing id + id * id Input a + b $ Predictive Parsing Program X Output Y Z $ Stack Parsing Table

Table Driven Predictive Parsing Input Symbol Non Terminal id + ( ) $ E E->TE’ E->TE’ E’ E’->+TE’ E’->e E’->e T T->FT’ T->FT’ T’ T’->e T’->*FT’ T’->e T’->e F F->id F->(E)

Table Driven Predictive Parsing Parse id + id * id Leftmost derivation and parse tree using the grammar E -> TE’ E’ -> +TE’ | e T -> FT’ T’ -> *FT’ | e F -> (E) | id

First and Follow Sets First and Follow sets tell when it is appropriate to put the right hand side of some production on the stack. (i.e. for which input symbols) E -> TE’ E’ -> +TE’ | e T -> FT’ T’ -> *FT | e F -> (E) | id id + id * id

First Sets If X is a terminal, then FIRST(X) is {X} IF X -> e is a production, then add e to FIRST(X) IF X is a non terminal and X -> Y1Y2…Yk is a production, then place a in FIRST(X) if for some i, a is in FIRST(Yi), and e is in all of First(Y1), …First(Yi-1). If e is in FIRST(Yj) for all j = 1, 2, …k, then add e to FIRST(X).

FIRST sets E -> TE’ E’ -> +TE’ | e T -> FT’ T’ -> *FT | e F -> (E) | id

Follow Sets Place $ in Follow(S), where S is the start symbol and $ is the input right endmarker If there is a production A -> aBb, then everything in FIRST(b) except for e is placed in FOLLOW(B) If there is a production A -> aB, or a production A -> aBb where FIRST(b) contains e (i.e., b  e), then everything in Follow(A) is in FOLLOW(B) *

Follow Sets E -> TE’ E’ -> +TE’ | e T -> FT’ T’ -> *FT | e F -> (E) | id

FIRST and FOLLOW sets Construct first and follow sets for the following grammar after left recursion has been eliminated S->(L) | a L->L,S | S

Construction of the Predictive Parsing Table Algorithm from Aho et al. For each production A -> a of the grammar do steps 2 and 3 For each terminal a in FIRST(a), add A -> a to M[A,a] If e is in FIRST(a), add A -> a to M[A, b] for each terminal b in FOLLOW(A). If e is in FIRST(a) and $ is in FOLLOW(A), add A -> a to M[A, $]. Make each undefined entry of M be error

Predictive Parsing Table Construct the parsing table for the grammar S->(L) | a L->L,S | S

COMP313A Programming Languages Syntax Analysis (4)

Lecture Outline A problem for predictive parsers Predictive parsing LL(1) grammars Error recovery in predictive parsing Recursive Descent parsing

Producing code from parse on the fly E -> TE’ E’ -> +TE’ | e T -> FT’ T’ -> *FT’ | e F -> (E) | id

Table Driven Predictive Parsing Input Symbol Non Terminal id + ( ) $ * E E->TE’ E->TE’ E’ E’->e E’->e E’->+TE’ T T->FT’ T->FT’ T’ T’->e T->*FT’ T’->e T’->e F F->id F->(E)

Construct the parsing table LL(1) grammars S -> if E then S S’ | a S’ -> else S | e E -> b FIRST(S) = {if, a} FIRST(S’) = {else, e} FIRST(E) = {b} FOLLOW(S) = {$, else} FOLLOW(S’) ={$, else} FOLLOW(E) = {then} Construct the parsing table

LL(1) grammars An LL(1) grammar has no multiply defined entries in its parsing table Left-recursive and ambiguous grammars are not LL(1) A grammar G is LL(1) iff whenever A -> a | b are two distinct productions of G For no terminal a do both a and b derive strings beginning with a At most one of a and b can derive the empty string If be then a does not derive any string beginning with a terminal in FOLL0W(A) *

Error Recovery in Predictive Parsing Panic mode recovery based on a set of synchronizing tokens Heuristics for synchronizing sets For nonterminal A all symbols in Follow(A) and FIRST(A) Symbols that begin higher constructs If A derives e then A -> e can be used as the default Pop a nonmatching terminal from the top of the stack

Recursive Descent Parsers A function for each nonterminal example expression grammar Function Expr If the next input symbol is a ( or id then call function Term followed by function Expr’ Else Error Expr -> Term Expr’ Expr’ -> +Term Expr’ | e Term -> Factor Term’ Term’ -> *Factor Term’ | e Factor -> (Expr) | id