Download presentation
Presentation is loading. Please wait.
1
Lecture 2 Lexical Analysis
Compiler Design Lecture 2 Lexical Analysis
2
Lexical Analysis A Lexical Analyzer (Lexer, or Scanner) groups input characters into tokens input token value identifier x equal = star * x = x * (acc+123) left-paren ( identifier acc plus + integer 123 right-paren ) Tokens are typically represented by numbers Tokens are typically represented by numbers. For example, the token * may be assigned number 35. Some tokens require some extra information.
3
Communication with the Parser
scanner parser get token token source file get next character AST Each time the parser needs a token, it sends a request to the scanner the scanner reads as many characters from the input stream as necessary to construct a single token when a single token is formed, the scanner is suspended and returns the token to the parser the parser will repeatedly call the scanner to read all the tokens from the input stream
4
Tasks of a Scanner A typical scanner:
recognizes the keywords of the language these are the reserved words that have a special meaning in the language, such as the word class in C++ recognizes special characters, such as ( and ), or groups of special characters, such as := and == recognizes identifiers, integers, reals, decimals, strings, etc ignores whitespaces (tabs, blanks, etc) and comments recognizes and processes special directives (such as the #include "file" directive in C) and macros
5
Regular Expressions are a very convenient form of representing (possibly infinite) sets of strings, called regular sets eg, the RE (a | b)*aa represents the infinite set {“aa”,“aaa”,“baa”,“abaa”, ... } a RE is one of the following: name RE designation epsilon {“”} symbol a {“a”} for some character a concatenation AB the set { rs | rA, sB }, where rs is string concatenation, and A and B designate the REs for A and B alternation A | B the set A B, where A and B designate the REs for A and B repetition A* the set | A | (AA) | (AAA) | ... (an infinite set) eg, the RE (a | b)c designates { rs | r{“a”}{“b”}, s {“c”} }, which is equal to {“ac”,“bc”} Shortcuts: P+ = PP*, P? = P | , [a-z] = (“a”|“b”|...|“z”), P2 = PP
6
Properties Kleen closure (*) binds before concatenation before alteration (|) eg, a|ab* is equivalent to a|(a(b*))
7
Examples for-keyword = for letter = [a-zA-Z] digit = [0-9]
identifier = letter (letter | digit)* sign = + | - | integer = sign (0 | [1-9]digit*) decimal = integer . digit* real = (integer | decimal) E sign digit+
8
Disambiguation Rules Problem: One string may match many regular expressions longest match rule: from all tokens that match the input prefix, choose the one that matches the most characters rule priority: if more than one token has the longest match, choose the one listed first Examples: for8 is it the for-keyword, the identifier “f”, the identifier “fo”, the identifier “for”, or the identifier “for8”? Use rule 1: “for8” matches the most characters. for is it the for-keyword, the identifier “f”, the identifier “fo”, or the identifier “for”? Use rule 1 & 2: the for-keyword and the “for” identifier have the longest match but the for-keyword is listed first.
9
How to write a Scanner? Write a program with switch case for regular expression of each token: Very difficult and complex It will lend to deep nested switch and if statements Static, i.e. when a regular expression changes we have to modify the program manually Use a finite automaton What is it?
10
Finite Automata A finite automaton can be used to decide if an input string is a member in some particular set of strings. A finite automaton consists of: a finite set of states a set of transitions (moves) one start state a set of final states (accepting states) Two types of finite automaton: Deterministic Finite Automaton (DFA) Non-deterministic Finite Automaton (NFA)
11
Deterministic Finite Automaton (DFA)
A DFA accepts a string if starting from the start state and moving from state to state, each time following the arrow that corresponds the current input character, it reaches a final state when the entire input string is consumed eg, the RE (abc+)+ is represented by the DFA: A DFA has a unique transition for every state-character combination
12
DFA (cont.) The transition table T gives the next state T[s,c] for a state s and a character c Ø means error state a b c 1 2 Ø Ø 2 Ø 3 Ø 3 Ø Ø 4 4 2 Ø 4 (abc+)+
13
The DFA of a Scanner for-keyword = for identifier = [a-z][a-z0-9]*
14
Scanner Code for DFA For each transition in a DFA generate code:
s1: current_character = get_next_character(); ... if ( current_character == 'c' ) goto s2; s2: current_character = get_next_character(); s2 c
15
Scanner Code for DFA using Transition Table T
state = initial_state; current_character = get_next_character(); while ( true ) { next_state = T[state,current_character]; if (next_state == ERROR) break; state = next_state; if ( current_character == EOF ) }; if ( is_final_state(state) ) `we have a valid token' else `report an error'
16
Non-deterministic Finite Automaton (NFA)
DFA is very difficult to construct from RE An NFA is similar to a DFA but it also permits: multiple transitions over the same character and, transitions over ε state a b ε 1 Ø 3 2 {1,3} Ø Ø 3 Ø Ø Ø a*(a|b)
17
Combined NFA for several tokens
18
How Scanner Generators Work
Translate REs into a finite state machine Done in three steps: translate REs into a no-deterministic finite automaton (NFA) translate the NFA into a deterministic finite automaton (DFA) optimize the DFA (optional) We’ll study only step 1.
19
Converting RE to NFA The following rules construct NFAs with only one final state: a ε s t s | t s*
20
NFA for the regular expression (a|b)*ac
21
Converting RE short hands to NFA
Example: generates [0..9]
22
Advantages And Disadvantages of NFA
Easy to construct from RE Disadvantages: Big size, i.e. large number of states Large memory space Need backtrack to recognize a string, since many moves may exist for one input character and ε transitions Long time for recognition
23
References Basics of Compiler Design, Torben Ægidius Mogensen. Published through lulu.com, 2009 Compiler Design: Theory, Tools and Examples, Seth D. Bergmann. William C. Brown, 1994 Course notes of Leonidas Fegaras, University of Texas at Arlington, CSE, 2005
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.