CS375 Compilers Lexical Analysis 4 th February, 2010
2 Outline Overview of a compiler. What is lexical analysis? Writing a Lexer –Specifying tokens: regular expressions –Converting regular expressions to NFA, DFA Optimizations.
3 How It Works Source code (character stream) Lexical Analysis Syntax Analysis (Parsing) Token stream Abstract syntax tree (AST) Semantic Analysis if (b == 0) a = b; if(b)a=b;0== if == b0 = ab Decorated AST if == int b int 0 = int a lvalue int b boolean int Program representation
What is a lexical analyzer What? –Reads in a stream of characters and groups them into “tokens” or “lexemes”. –Language definition describes what tokens are valid. Why? –Makes writing the parser a lot easier, parser operates on “tokens”. –Input dependent functionality such as character codes, EOF, new line characters. 4
5 First Step: Lexical Analysis Source code (character stream) Lexical Analysis Token stream Semantic Analysis if (b == 0) a = b; if(b)a=b;0== Syntax Analysis
6 What it should do? ? Token? {Yes/No} Input String Token Description w We want some way to describe tokens, and have our Lexer take that description as input and decide if a string is a token or not.
8 Tokens Logical grouping of characters. Identifiers: x y11 elsen _i00 Keywords: if else while break Constants: –Integer: L 0x777 –Floating-point: e5 0.e-10 –String: ”x” ”He said, \”Are you?\”\n” –Character: ’c’ ’\000’ –Symbols: + * { } ++ = Whitespace (typically recognized and discarded): –Comment: /** don’t change this **/ –Space: –Format characters:
9 Ad-hoc Lexer Hand-write code to generate tokens How to read identifier tokens? Token readIdentifier( ) { String id = “”; while (true) { char c = input.read(); if (!identifierChar(c)) return new Token(ID, id, lineNumber); id = id + String(c); } Problems –How to start? –What to do with following character? –How to avoid quadratic complexity of repeated concatenation? –How to recognize keywords?
10 Scan text one character at a time Use look-ahead character (next) to determine what kind of token to read and when the current token ends char next; … while (identifierChar(next)) { id = id + String(next); next = input.read (); } Look-ahead Character else n next (lookahead)
11 Ad-hoc Lexer: Top-level Loop class Lexer { InputStream s; char next; Lexer(InputStream _s) { s = _s; next = s.read(); } Token nextToken( ) { if (identifierFirstChar(next))//starts with a char return readIdentifier(); //is an identifier if (numericFirstChar(next)) //starts with a num return readNumber(); //is a number if (next == ‘\”’) return readStringConst(); … }
12 Problems Might not know what kind of token we are going to read from seeing first character –if token begins with “i’’ is it an identifier? (what about int, if ) –if token begins with “2” is it an integer constant? –interleaved tokenizer code hard to write correctly, harder to maintain –in general, unbounded look-ahead may be needed
Problems (cont.) How to specify (unambiguously) tokens. Once specified, how to implement them in a systematic way? How to implement them efficiently? 13
14 Problems (cont. ) For instance, consider. –How to describe tokens unambiguously 2.e0 20.e “”“x”“\\” “\”\’” –How to break up text into tokens if (x == 0) a = x<<1; if (x == 0) a = x<1; –How to tokenize efficiently tokens may have similar prefixes want to look at each character ~1 time
15 Principled Approach Need a principled approach 1.Lexer Generators –lexer generator that generates efficient tokenizer automatically (e.g., lex, flex, Jlex) a.k.a. scanner generator 2.Your own Lexer –Describe programming language’s tokens with a set of regular expressions –Generate scanning automaton from that set of regular expressions
Top level idea… Have a formal language to describe tokens. –Use regular expressions. Have a mechanical way of converting this formal description to code. –Convert regular expressions to finite automaton (acceptors/state machines) Run the code on actual inputs. –Simulate the finite automaton. 16
An Example : Integers Consider integers. We can describe integers using the following grammar: Num -> ‘-’ Pos Num -> Pos Pos ->0 | 1 |…|9 Pos ->0 | 1 |…|9 Pos Or in a more compact notation, we have: –Num-> -? [0-9]+ 17
An Example : Integers Using Num-> -? [0-9]+ we can generate integers such as -12, 23, 0. We can also represent above regular expression as a state machine. –This would be useful in simulation of the regular expression. 18
An Example : Integers The Non-deterministic Finite Automaton is as follows. We can verify that -123, 65, 0 are accepted by the state machine. But which path to take? - paths?
An Example : Integers The NFA can be converted to an equivalent Deterministic FA as below. –We shall see later how. It accepts the same tokens. –-123 –65 –0 20 {0,1}{1} {2,3} - 0-9
An Example : Integers The deterministic Finite automaton makes implementation very easier, as we shall see later. So, all we have to do is: –Express tokens as regular expressions –Convert RE to NFA –Convert NFA to DFA –Simulate the DFA on inputs 21
The larger picture… 22 RE NFA Conversion NFA DFA Conversion DFA Simulation Yes, if w is valid token No, if not Input String Regular Expression describing tokens R w
24 Language Theory Review Let be a finite set – called an alphabet –a called a symbol * is the set of all finite strings consisting of symbols from A subset L * is called a language If L 1 and L 2 are languages, then L 1 L 2 is the concatenation of L 1 and L 2, i.e., the set of all pair-wise concatenations of strings from L 1 and L 2, respectively
25 Language Theory Review, ctd. Let L * be a language Then –L 0 = {} –L n+1 = L L n for all n 0 Examples –if L = {a, b} then L 1 = L = {a, b} L 2 = {aa, ab, ba, bb} L 3 = {aaa, aab, aba, aba, baa, bab, bba, bbb} …
26 Syntax of Regular Expressions Set of regular expressions (RE) over alphabet is defined inductively by –Let a and R, S RE. Then: a RE ε RE RE R|S RE RS RE R* RE In concrete syntactic form, precedence rules, parentheses, and abbreviations
27 Semantics of Regular Expressions Regular expression T RE denotes the language L(R) * given according to the inductive structure of T: –L(a) ={a}the string “a” –L(ε) = {“”}the empty string –L( ) = {}the empty set –L(R|S) = L(R) L(S)alternation –L(RS) = L(R) L(S)concatenation –L(R*) = {“”} L(R) L(R 2 ) L(R 3 ) L(R 4 ) … Kleene closure
28 Simple Examples L(R) = the “language” defined by R –L( abc ) = { abc } –L( hello|goodbye ) = {hello, goodbye} OR operator, so L(a|b) is the language containing either strings of a, or strings of b. –L( 1(0|1)* ) = all non-zero binary numerals beginning with 1 Kleene Star. Zero or more repetitions of the string enclosed in the parenthesis.
29 Convienent RE Shorthand R + one or more strings from L(R): R(R*) R?optional R: (R|ε) [abce]one of the listed characters: (a|b|c|e) [a-z]one character from this range: (a|b|c|d|e|…|y|z) [^ab] anything but one of the listed chars [^a-z] one character not from this range ”abc”the string “abc” \(the character ’(’... id=Rnamed non-recursive regular expressions
30 More Examples Regular Expression RStrings in L(R) digit = [0-9]“0” “1” “2” “3” … posint = digit+“8” “412” … int = -? posint“-42” “1024” … real = int ((. posint)?)“-1.56” “12” “1.0” = (-|ε)([0-9]+)((. [0-9]+)|ε) [a-zA-Z_][a-zA-Z0-9_]*C identifiers elsethe keyword “else”
Historical Anomalies PL/I –Keywords not reserved IF IF THEN THEN ELSE ELSE; FORTRAN –Whitespace stripped out prior to scanning DO 123 I = 1 DO 123 I = 1, 2 By and large, modern language design intentionally makes scanning easier 31
Writing a Lexer Regular Expressions can be very useful in describing languages (tokens). –Use an automatic Lexer generator (Flex, Lex) to generate a Lexer from language specification. –Have a systematic way of writing a Lexer from a specification such as regular expressions. 33
How To Use Regular Expressions Given R RE and input string w, need a mechanism to determine if w L(R) Such a mechanism is called an acceptor 35 Input string w (from the program) R RE (that describes a token family) ? Yes, if w is a token No, if w not a token
Acceptors Acceptor determines if an input string belongs to a language L Finite Automata are acceptors for languages described by regular expressions 36 Input String Description of language Acceptor w L Yes, if w L No, if w L Finite Automaton
Finite Automata Informally, finite automaton consist of: –A finite set of states –Transitions between states –An initial state (start state) –A set of final states (accepting states) Two kinds of finite automata: –Deterministic finite automata (DFA): the transition from each state is uniquely determined by the current input character –Non-deterministic finite automata (NFA): there may be multiple possible choices, and some “spontaneous” transitions without input 37
DFA Example Finite automaton that accepts the strings in the language denoted by regular expression ab*a Can be represented as a graph or a transition table. –A graph. Read symbol Follow outgoing edge 01 2 a b a 38
DFA Example (cont.) Representing FA as transition tables makes the implementation very easy. The above FA can be represented as : –Current state and current symbol determine next state. –Until error state. End of input. 39 ab 01 Error ErrorError
Simulating the DFA transition_table[NumSTATES][NumCHARS] accept_states[NumSTATES] state = INITIAL while (state != Error) { c = input.read(); if (c == EOF) break; state = trans_table[state][c]; } return (state!=Error) && accept_states[state]; 01 2 a b a Determine if the DFA accepts an input string 40
RE Finite automaton? Can we build a finite automaton for every regular expression? Strategy: build the finite automaton inductively, based on the definition of regular expressions a a ε 41
RE Finite automaton? Alternation R|S Concatenation: RS Recall ? implies optional move. R automaton S automaton ? ? ? R automaton S automaton 42
NFA Definition A non-deterministic finite automaton (NFA) is an automaton where: –There may be ε-transitions (transitions that do not consume input characters) –There may be multiple transitions from the same state on the same input character a b b a a Example: 43
RE NFA intuition -?[0-9]+ 44 When to take the -path?
NFA construction (Thompson) NFA only needs one stop state (why?) Canonical NFA form: Use this canonical form to inductively construct NFAs for regular expressions 45
Inductive NFA Construction 46 RS R S R|S R S ε εε ε R* R ε ε ε ε ε
Inductive NFA Construction 47 RS R S R|S R S ε εε ε R* R ε ε ε ε
DFA vs NFA DFA: action of automaton on each input symbol is fully determined –obvious table-driven implementation NFA: –automaton may have choice on each step –automaton accepts a string if there is any way to make choices to arrive at accepting state –every path from start state to an accept state is a string accepted by automaton –not obvious how to implement! 48
Simulating an NFA Problem: how to execute NFA? “strings accepted are those for which there is some corresponding path from start state to an accept state” Solution: search all paths in graph consistent with the string in parallel –Keep track of the subset of NFA states that search could be in after seeing string prefix –“Multiple fingers” pointing to graph 49
Example Input string: -23 NFA states: –Start:{0,1} –“-” :{1} –“2” :{2, 3} –“3” :{2, 3} But this is very difficult to implement directly. 0 1 50
NFA DFA conversion Can convert NFA directly to DFA by same approach Create one DFA state for each distinct subset of NFA states that could arise States: {0,1}, {1}, {2, 3} Called the “subset construction” 0 1 {0,1}{1} {2,3}
Algorithm For a set S of states in the NFA, compute ε-closure(S) = set of states reachable from states in S by one or more ε-transitions T = S Repeat T = T U {s’ | s T, (s,s’) is ε-transition} Until T remains unchanged ε-closure(S) = T For a set S of ε-closed states in the NFA, compute DFAedge(S,c) = the set of states reachable from states in S by transitions on symbol c and ε-transitions DFAedge(S,c) = ε-closure( { s’ | s S, (s,s’) is c-transition} ) 52
Algorithm DFA-initial-state = ε-closure(NFA-initial-state) Worklist = { DFA-initial-state } While ( Worklist not empty ) Pick state S from Worklist For each character c S’ = DFAedge(S,c) if (S’ not in DFA states) Add S’ to DFA states and worklist Add an edge (S, S’) labeled c in DFA For each DFA-state S If S contains an NFA-final state Mark S as DFA-final-state 53
Putting the Pieces Together RE NFA Conversion NFA DFA Conversion DFA Simulation Yes, if w L(R) No, if w L(R) Input String Regular Expression R w 54
State minimization State Minimization is an optimization that converts a DFA to another DFA that recognizes the same language and has a minimum number of states. –Divide all states into “equivalence” groups. Two states p and q are equivalent if for all symbols, the outgoing edges either lead to error or the same destination group. Collapse the states of a group to a single state (instead of p and q, have a single state). 56
State Minimization (Equivalence) More formally, all states in group G i are equivalent iff for any two states p and q in G i, and for every symbol σ, transition(p,σ) and transition(q,σ) are either both Error, or are states in the same group G j (possibly G i itself). For example: p q GiGi GkGk GjGj a (or Error) r b a b a b c c c 57
58 Step1. Partition states of original DFA into maximal- sized groups of “equivalent” states S = {G 1, …,G n } Step 2. Construct the minimized DFA such that there is a state for each group G i State Minimization a 4 a b 3 a b b b a a a b b
DFA Minimization Step1. Partition states of original DFA into maximal- sized groups of equivalent states –Step 1a. Discard states not reachable from start state –Step 1b. Initial partition is S = {Final, Non-final} –Step 1c. Repeatedly refine the partition {G 1,…,G n } while some group G i contains states p and q such that for some symbol σ, transitions from p and q on σ are to different groups p q GiGi GkGk GjGj a a (or Error) j ≠ k 59
DFA Minimization Step1. Partition states of original DFA into maximal- sized groups of equivalent states –Step 1a. Discard states not reachable from start state –Step 1b. Initial partition is S = {Final, Non-final} –Step 1c. Repeatedly refine the partition {G 1,…,G n } while some group G i contains states p and q such that for some symbol σ, transitions from p and q on σ are to different groups 60 GkGk GjGj a a p q GiGi G i’ (or Error) j ≠ k
After state minimization. We have an optimized acceptor. 61 RE NFA NFA DFA DFA Simulation Yes, if w L(R) No, if w L(R) Input String Regular Expression R w Minimize DFA
62 Lexical Analyzers vs Acceptors We really need a Lexer, not an acceptor. Lexical analyzers use the same mechanism, but they: –Have multiple RE descriptions for multiple tokens –Output a sequence of matching tokens (or an error) –Always return the longest matching token –For multiple longest matching tokens, use rule priorities
63 Lexical Analyzers RE NFA NFA DFA Minimize DFA DFA Simulation Character Stream REs for all valid Tokens R1 … RnR1 … Rn program Token stream (and errors)
64 Handling Multiple REs whitespace identifier number keywords NFAs Minimized DFA Construct one NFA for each RE Associate the final state of each NFA with the given RE Combine NFAs for all REs into one NFA Convert NFA to minimized DFA, associating each final DFA state with the highest priority RE of the corresponding NFA states
65 Using Roll Back Consider three REs: {aa ba aabb] and input: aaba Reach state 3 with no transition on next character a Roll input back to position on entering state 2 (i.e., having read aa) Emit token for aa On next call to scanner, start in state 0 again with input ba a a 3 b b 65 a b 2
Automatic Lexer Generators Input: token specification –list of regular expressions in priority order –associated action for each RE (generates appropriate kind of token, other bookkeeping) Output: lexer program –program that reads an input stream and breaks it up into tokens according to the REs (or reports lexical error -- “Unexpected character” ) 66
Automatic Lexer (C) Generator 67
Example: Jlex (Java) % digits = 0|[1-9][0-9]* letter = [A-Za-z] identifier = {letter}({letter}|[0-9_])* whitespace = [\ \t\n\r]+ % {whitespace}{/* discard */} {digits}{ return new Token(INT, Integer.parseInt(yytext()); } ”if”{ return new Token(IF, yytext()); } ”while”{ return new Token(WHILE, yytext()); } … {identifier}{ return new Token(ID, yytext()); } 68
Example Output (Java) Java Lexer which implements the functionality described in the language specification. For instance : case 5: { return new Token(WHILE, yytext()); } case 6: break; case 2: { return new Token(ID, yytext()); } case 7: break; case 4: { return new Token(IF, yytext());} case 8: break; case 1: { return new Token(INT, Integer.parseInt(yytext());} 69
Start States Mechanism that specifies state in which to start the execution of the DFA Declare states in the second section –%state STATE Use states as prefixes of regular expressions in the third section: – regex {action} Set current state in the actions –yybegin(STATE) There is a pre-defined initial state: YYINITIAL 70
Example STRING INITIAL. ” ” if % %state STRING % “if”{ return new Token(IF, null); } “\””{ yybegin(STRING); … } “\””{ yybegin(YYINITIAL); … }.{ … } 71
72 Summary Lexical analyzer converts a text stream to tokens Ad-hoc lexers hard to get right, maintain For most languages, legal tokens are conveniently and precisely defined using regular expressions Lexer generators generate lexer automaton automatically from token RE’s, prioritization
73 Summary To write your own Lexer: –Describe tokens using Regular Expressions. –Construct NFAs for those tokens. If you have no ambiguities in the NFA, or you have a DFA directly from the regular expressions, you are done. –Construct DFA from NFA using the algorithm described. –Systematically implement the DFA using transition tables.
74 Reading IC Language spec JLEX manual CVS manual Links on course web home page Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby,...), Russ Cox, January 2007 Acknowledgement The slides are based on similar content by Tim Teitelbaum, Cornell.