Lexical Analysis (2 Lectures)
CSE244 Compilers 2 Overview Basic Concepts Regular Expressions –Language Lexical analysis by hand Regular Languages Tools –NFA –DFA Scanning tools –Lex / Flex / JFlex / ANTLR
CSE244 Compilers 3 Scanning Perspective Purpose –Transform a stream of symbols –Into a stream of tokens
CSE244 Compilers 4 Lexical Analyzer Responsibilities Lexical analyzer [Scanner] –Scan input –Remove white spaces –Remove comments –Manufacture tokens –Generate lexical errors –Pass token to parser
CSE244 Compilers 5 Modular design Rationale –Separate the two analysis High cohesion / Low coupling –Improve efficiency –Improve portability / maintainability –Enable integration of third-party lexers [lexer = lexical analysis tool]
CSE244 Compilers 6 Terminology Token –A classification for a common set of strings –Examples: Identifier, Integer, Float, Assign, LeftParen, RightParen,.... Pattern –The rules that characterize the set of strings for a token –Examples: [0-9]+ Lexeme –Actual sequence of characters that matches a pattern and has a given Token class. –Examples: Identifier: Name,Data,x Integer: 345,2,0,629,....
CSE244 Compilers 7 Examples “”“”
CSE244 Compilers 8 Lexical Errors Error Handling is very localized, w.r.t. Input Source Example: fi(a==f(x)) … generates no lexical error in C In what situations do errors occur? Prefix of remaining input doesn’t match any defined token Possible error recovery actions: Deleting or Inserting Input Characters Replacing or Transposing Characters Or, skip over to next separator to ignore problem
CSE244 Compilers 9 Basic Scanning technique Use 1 character of look-ahead –Obtain char with getc() Do a case analysis –Based on lookahead char –Based on current lexeme Outcome –If char can extend lexeme, all is well, go on. –If char cannot extend lexeme: Figure out what the complete lexeme is and return its token Put the lookahead back into the symbol stream
CSE244 Compilers 10 Language Concepts Alphabet Language {0,1}{0,10,100,1000,10000,…} {0,1,100,000,111,…} {a,b,c}{abc,aabbcc,aaabbbccc,…} {A…Z}{TEE,FORE,BALL…} {FOR,WHILE,GOTO…} {A…Z,a…z,0…9,{All legal PASCAL progs} +,-,…,,…}{All grammatically correct English Sentences} Special Languages: Φ – EMPTY LANGUAGE ε – contains empty string ε only A language, L, is simply any set of strings over a fixed alphabet.
CSE244 Compilers 11 Formal Language Operations
CSE244 Compilers 12 Examples
CSE244 Compilers 13 Regular Languages All examples above are –Quite expressive –Simple languages But also... –Belong to a special class: regular languages A Regular Expression is a Set of Rules / Techniques for Constructing Sequences of Symbols (Strings) From an Alphabet. Let Σ Be an Alphabet, r a Regular Expression Then L(r) is the Language That is Characterized by the Rules of r
CSE244 Compilers 14 Rules fix alphabet Σ ε is a regular expression denoting {ε} If a is in Σ, a is a regular expression that denotes {a} Let r and s be R.E. for L(r) and L(s). Then (a) (r) | (s) is a regular expression L(r) ∪ L(s) (b) (r)(s) is a regular expression L(r) L(s) (c) (r)* is a regular expression (L(r))* (d) (r) is a regular expression L(r) All are Left-Associative. Parentheses are dropped as allowed by precedences. Precedeence
CSE244 Compilers 15 Example revisited
CSE244 Compilers 16 Algebraic Properties
CSE244 Compilers 17 More Examples All Strings that start with “tab” or end with “bat”: tab{A,…,Z,a,...,z}*|{A,…,Z,a,....,z}*bat All Strings in Which {1,2,3} exist in ascending order: {A,…,Z}*1 {A,…,Z}*2 {A,…,Z}*3 {A,…,Z}*
CSE244 Compilers 18 Tokens as R.E. … … … “+” “?” …
CSE244 Compilers 19 Tokens as Patterns Patterns are ??? Tokens are ???
CSE244 Compilers 20 Throw Away Tokens Fact –Some languages define tokens as useless –Example: C whitespace, tabulations, carriage return, and comments can be discarded without affecting the program’s meaning.
CSE244 Compilers 21 Automaton A tool to specify a token
CSE244 Compilers 22 A More Complex Automaton
CSE244 Compilers 23 Two More...
CSE244 Compilers 24 What about keywords ? Easy! –Use the “Identifier” token –After a match, lookup the keyword table If found, return a token for the matched keyword If not, return a token for the true identifier
CSE244 Compilers 25 Yes... But how to scan? Remember the algorithm? –Acquire 1 character of lookahead –Case analysis based On lookahead On state of automaton
CSE244 Compilers 26 Scanner code class Scanner { InputStream _in; char _la; // The lookahead character char[] _window; // lexeme window Token nextToken() { startLexeme(); // reset window at start while(true) { switch(_state) { case 0: { _la = getChar(); if (_la == ‘<’) _state = 1; else if (_la == ‘=’) _state = 5; else if (_la == ‘>’) _state = 6; else failure(state); }break; case 6: { _la = getChar(); if (_la == ‘=’) _state = 7; else _state = 8; }break; } case 7: { return new Token(GEQUAL); }break; case 8: { pushBack(_la); return new Token(GREATER); }
CSE244 Compilers 27 Handling Failures Meaning –The automaton for this token failed solution –If another automaton is available “rewind” the input to the beginning of last lexeme Jump to start state of next automaton Start recognizing again –If no other automaton This is a true lexical error. Discard lexeme (or at least first char of lexeme) Start from state 0 again
CSE244 Compilers 28 Overview Basic Concepts Regular Expressions –Language Lexical analysis by hand Regular Languages Tools –NFA / DFA Scanning with DFAs Scanning tools –Lex / Flex / JFlex
CSE244 Compilers 29 Automata & Language Theory Terminology –FSA A recognizer that takes an input string and determines whether it’s a valid string of the language. –Non-Deterministic FSA (NFA) Has several alternative actions for the same input symbol –Deterministic FSA (DFA) Has at most 1 action for any given input symbol Bottom Line –expressive power(NFA) == expressive power(DFA) –Conversion can be automated
CSE244 Compilers 30 NFA An NFA is a mathematical model that consists of : S, a set of states , the symbols of the input alphabet move, a transition function. move(state, symbol) → set of states move : S { ∈ } → Pow(S) A state, s 0 ∈ S, the start state F ⊆ S, a set of final or accepting states.
CSE244 Compilers 31 Representing NFA Transition Diagrams : Transition Tables: Number states (circles), arcs, final states, … More suitable to representation within a computer We’ll see examples of both !
CSE244 Compilers 32 Example NFA S = { 0, 1, 2, 3 } s 0 = 0 F = { 3 } Σ = { a, b } start 03 b 21 ba a b What Language is defined ? What is the Transition Table ? state i n p u t ab { 0, 1 } --{ 2 } --{ 3 } { 0 } ∈ (null) moves possible ji ∈ Switch state but do not use any input symbol
CSE244 Compilers 33 Epsilon-Transitions Given the regular expression : (a (b*c)) | (a (b | c+)?) Find a transition diagram NFA that recognizes it. Solution ?
CSE244 Compilers 34 NFA Construction Automatic construction example a(b*c) a(b|c+)? Build a Disjunction
CSE244 Compilers 35 Resulting NFA
CSE244 Compilers 36 Working NFA start 03 b 21 ba a b Given an input string, we trace moves If no more input & in final state, ACCEPT EXAMPLE: Input: ababb move(0, a) = 1 move(1, b) = 2 move(2, a) = ? (undefined) REJECT ! move(0, a) = 0 move(0, b) = 0 move(0, a) = 1 move(1, b) = 2 move(2, b) = 3 ACCEPT ! -OR-
CSE244 Compilers 37 Handling Undefined Transitions We can handle undefined transitions by defining one more state, a “death” state, and transitioning all previously undefined transition to this death state. start 03 b 21 ba a b 4 a, b a a
CSE244 Compilers 38 Worse still... Not all path result in acceptance! start 03 b 21 ba a b aabb is accepted along path : 0 → 0 → 1 → 2 → 3 BUT… it is not accepted along the valid path: 0 → 0 → 0 → 0 → 0
CSE244 Compilers 39 The NFA “Problem” Two problems –Valid input may not be accepted –Non-deterministic behavior from run to run... Solution?
CSE244 Compilers 40 The DFA Save The Day A DFA is an NFA with a few restrictions –No epsilon transitions –For every state s, there is only one transition (s,x) from s for any symbol x in Σ Corollaries –Easy to implement a DFA with an algorithm! –Deterministic behavior
CSE244 Compilers 41 NFA vs. DFA NFA –smaller number of states Q nfa –In order to simulate it requires a |Q nfa | computation for each input symbol. DFA –larger number of states Q dfa –In order to simulate it requires a constant computation for each input symbol. caveat - generic NFA=>DFA construction: Q dfa ~ 2^{Q nfa } but: DFA’s are perfectly optimizable! (i.e., you can find smallest possible Q dfa )
CSE244 Compilers 42 One catch... NFA-DFA comparison
CSE244 Compilers 43 NFA to DFA Conversion Idea –Look at the state reachable without consuming any input –Aggregate them in macro states
CSE244 Compilers 44 Final Result A state is final –IFF one of the NFA state was final
CSE244 Compilers 45 Preliminary Definitions NFA N = ( S, Σ, s 0, F, MOVE ) ε -Closure(s) : s ε S –set of states in S that are reachable from s via ε -moves of N that originate from s. ε -Closure(T) : T ⊆ S NFA states reachable from all t ε T on ε -moves only. move(T,a): T ⊆ S, a ε Σ Set of states to which there is a transition on input a from some t ε T
CSE244 Compilers 46 Algorithm forall(t in T) push(t); initialize ε -closure(T) to T; while stack is not empty do begin t = pop(); for each u ε S with edge t→u labeled ε if u is not in ε -closure(T) add u to ε -closure(T) ; push u onto stack computing the ε - closure
CSE244 Compilers 47 DFA construction computing the The set of states The transitions computing the The set of states The transitions let Q = ε -closure(s 0 ) ; D = { Q }; enQueue(Q) while queue not empty do X = deQueue(); for each a ε Σ do Y := ε -closure(move(X,a)); T[X,a] := Y if Y is not in D D = D U { Y } enQueue(Y); end
CSE244 Compilers 48 Summary We can –Specify tokens with R.E. –Use DFA to scan an input and recognize token –Transform an NFA into a DFA automatically What we are missing –A way to transform an R.E. into an NFA Then, we will have a complete solution –Build a big R.E. –Turn the R.E. into an NFA –Turn the NFA into a DFA –Scan with the obtained DFA
CSE244 Compilers 49 R.E. To NFA Process –Inductive definition Use the structure of the R.E. Use atomic automata for atomic R.E. Use composition rules for each R.E. expression Recall –RE::= ε ::= s in Σ ::= rs ::= r | s ::= r*
CSE244 Compilers 50 Epsilon Construction RE::= ε
CSE244 Compilers 51 Symbol Construction RE::= x in Σ
CSE244 Compilers 52 Chaining Construction RE::= rs
CSE244 Compilers 53 Branching Construction RE::= r | s
CSE244 Compilers 54 Kleene-Closure Construction RE::= r*
CSE244 Compilers 55 NFA Construction Example R.E. –(ab*c) | (a(b|c*)) Parse Tree: r 13 r 12 r5r5 r3r3 r 11 r4r4 r9r9 r 10 r8r8 r7r7 r6r6 r0r0 r1r1 r2r2 b * c a a | ( ) b | * c
CSE244 Compilers 56 NFA Construction Example 2 r3:r3: a r0:r0: b r2:r2: c b ∈ ∈ ∈ ∈ r1:r1:r 4 : r 1 r 2 b ∈ ∈ ∈ ∈ c r 5 : r 3 r 4 b ∈ ∈ ∈ ∈ ac
CSE244 Compilers 57 NFA Construction Example 3 r 11 : a r7:r7: b r6:r6: c c ∈ ∈ ∈ ∈ r 9 : r 7 | r 8 ∈ ∈ b c ∈ ∈ ∈ ∈ r8:r8: c ∈ ∈ ∈ ∈ r 12 : r 11 r 10 ∈ ∈ b a r 10 : r 9
CSE244 Compilers 58 NFA Construction Example 4 r 13 : r 5 | r 12 b ∈ ∈ ∈ ∈ ac c ∈ ∈ ∈ ∈ ∈ ∈ b a ∈ ∈∈ ∈
CSE244 Compilers 59 Overall Summary How does this all fit together ? –Reg. Expr. → NFA construction –NFA → DFA conversion –DFA simulation for lexical analyzer Recall Lex Structure –Pattern Action – … … Each pattern recognizes lexemes Each pattern described by regular expression ∈ ∈ etc. (abc)*ab (a | b)*abb Recognizer!
CSE244 Compilers 60 Morale? All of this can be automated with a tool! –LEXThe first lexical analyzer tool for C –FLEXA newer/faster implementation C / C++ friendly –JFLEXA lexer for Java. Based on same principles. –JavaCC –ANTLR
CSE244 Compilers 61 Ahead... Grammars Parsing –Bottom Up –Top Down