Lexical Analysis - An Introduction
The Front End The purpose of the front end is to deal with the input language Perform a membership test: code source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler Source code Front End Errors Machine code Back End IR
The Front End Scanner Maps stream of characters into words Basic unit of syntax x = x + y ; becomes Characters that form a word are its lexeme Its part of speech (or syntactic category) is called its token type Scanner discards white space & (often) comments Source code Scanner IR Parser Errors tokens Speed is an issue in scanning use a specialized recognizer
The Front End Parser Checks stream of classified words (parts of speech) for grammatical correctness Determines if code is syntactically well- formed Guides checking at deeper levels than syntax Builds an IR representation of the code Source code Scanner IR Parser Errors tokens
The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar 1. goal expr 2. expr expr op term 3. | term 4. term number 5. | id 6. op + 7. | – S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7}
The Big Picture Language syntax is specified with parts of speech, not words Syntax checking matches parts of speech against a grammar 1. goal expr 2. expr expr op term 3. | term 4. term number 5. | id 6. op + 7. | – S = goal T = { number, id, +, - } N = { goal, expr, term, op } P = { 1, 2, 3, 4, 5, 6, 7} No words here! Parts of speech, not words!
The Big Picture Scanner Generator specifications source codeparts of speech & words tables or code Specifications written as “regular expressions” Represent words as indices into a global table
The Big Picture Why study lexical analysis? Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies
How to implement a scanner Regular Expressions NFADFA
Regular Expressions Lexical patterns form a regular language *** any finite language is regular *** Regular expressions ( RE s) describe regular languages Ever type “rm *.o a.out” ?
Regular Expressions Regular Expression (over alphabet ) is a RE denoting the set { } If a is in , then a is a RE denoting {a} If x and y are RE s denoting L(x) and L(y) then x |y is an RE denoting L(x) L(y) xy is an RE denoting L(x)L(y) x * is an RE denoting L(x)*
Set Operations ( review )
Examples of Regular Expressions Identifiers: Letter (a|b|c| … |z|A|B|C| … |Z) Digit (0|1|2| … |9) Identifier Letter ( Letter | Digit ) * Numbers: Integer (+|-| ) (0| (1|2|3| … |9)(Digit * ) ) Decimal Integer. Digit * Real ( Integer | Decimal ) E (+|-| ) Digit * Complex ( Real, Real )
Regular Expressions ( the point ) Regular expressions can be used to specify the words to be translated to parts of speech by a lexical analyzer Using results from automata theory and theory of algorithms, we can automatically build recognizers from regular expressions We study REs and associated theory to automate scanner construction !
Consider the problem of recognizing ILOC register names Register r (0|1|2| … | 9) (0|1|2| … | 9) * Allows registers of arbitrary number Requires at least one digit RE corresponds to a recognizer (or DFA) Example S0S0 S2S2 S1S1 r (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register
DFA operation Start in state S 0 & take transitions on each input character DFA accepts a word x iff x leaves it in a final state (S 2 ) So, r17 takes it through s 0, s 1, s 2 and accepts r takes it through s 0, s 1 and fails Example S0S0 S2S2 S1S1 r (0|1|2| … 9) accepting state (0|1|2| … 9) Recognizer for Register
Example ( continued ) To be useful, recognizer must turn into code Char next character State s 0 while (Char EOF ) State (State,Char) Char next character if (State is a final state ) then report success else report failure Skeleton recognizer Table encoding RE r 0,1,2,3, 4,5,6,7, 8,9 All other s s0s0 s1s1 sese sese s1s1 sese s2s2 sese s2s2 sese s2s2 sese sese sese sese sese
Example ( continued ) r 0,1,2,3,4,5,6, 7,8,9 All other s s0s0 s 1 start s e error s e error s1s1 s e erro r s 2 add s e error s2s2 s e erro r s 2 add s e error sese s e erro r s e error s e error Char next character State s 0 while (Char EOF ) State (State,Char) perform specified action Char next character if (State is a final state ) then report success else report failure Skeleton recognizer Table encoding RE
Algorithm Project 1 Open a file to read from Open a file to write to Open a file to read from Open a file to write to Create a scanner object Call a method from Scanner class to scan, classify each token and write to the output file
Algorithm scanner Read a line from input file c i = first character of this line Read a line from input file c i = first character of this line Recognize the meta character Current Token = meta character Print out token with new line Recognize the meta character Current Token = meta character Print out token with new line c i ==‘#’ || (c i ==‘/’ && c i+1 ==‘/’ false true C i ==‘”’ true Recognize the string (read until you reach another “) Current Token = string Print out the token Recognize the string (read until you reach another “) Current Token = string Print out the token C i is a digit Recognize the number Current Token = number Print out the token Recognize the number Current Token = number Print out the token C i is a letter Recognize the id token is not a keyword true false true Token is an ID, print with tag true false Token is a keyword Print the token Token is a keyword Print the token C i is symbol true Recognize the symbol Print Recognize the symbol Print false i+= len of token Print c i i++ Print c i i++ i+= len of token C i is symbol Not the end of file the end of line the end of file Read another line
Sample:test1.c #include void sample() { int b=4; printf("Helloworld %d",b); } int main() { sample(); }
Token list #include ---- meta void ---- keyword sample ---- id ( ---- symbol ) ---- symbol { ---- symbol int b ---- id = ---- symbol number ; ---- symbol printf ---- keyword ( ---- symbol "Helloworld %d" ---- string, ---- symbol b ---- id ) ---- symbol ; ---- symbol } ---- symbol int ---- keyword main ---- id ( ---- symbol ) ---- symbol { ---- symbol sample ---- symbol ( ---- symbol ) ---- symbol ; ---- symbol }---- symbol
Result #include void CS322sample() { int CS322b=4; printf("Helloworld %d",CS322b); } int CS322main() { CS322sample(); }