Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions' language , Semester 1, Lexical Analysis
Compilers: lex analysis/2 2 Overview 1. Why Lexical Analysis? 2. Using a Lexical Analyzer 3. Implementing a Lexical Analyzer 4. Regular Expressions (REs) 5. The Expressions Language 6.exprTokens.c 7.From REs to Code Automatically
Compilers: lex analysis/2 3 In this lecture Source Program Target Lang. Prog. Semantic Analyzer Syntax Analyzer Lexical Analyzer Front End Code Optimizer Target Code Generator Back End Int. Code Generator Intermediate Code
Compilers: lex analysis/ Why Lexical Analysis? Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants) Simplifies the design of the rest of the compiler – –the code uses tokens, not strings or characters Can be implemented efficiently – –by hand or automatically Improves portability – –non-standard symbols / foreign characters are translated here, so do not affect the rest of the compiler
Compilers: lex analysis/ Using a Lexical Analyzer Lexical Analyzer (using chars) Syntax Analyzer (using tokens) Source Program 3. Token, token value 1. Get next token lexical errors syntax errors 2. Get chars to make a token
Compilers: lex analysis/2 6 A Source Program is Chars Consider the program fragment: if (i==j); z=1; else; z=0; endif; The lexical analyzer reads it in as a string of characters: if_(i==j);\n\tz=1; \nelse; \tz=0;\nendif; Lexical analysis divides the string into tokens.
Compilers: lex analysis/2 7 Tokens and Token Values Lexical Analyzer "y = *foo" Syntax Analyzer token token value get tokens (one at a time) get chars
Compilers: lex analysis/2 8 Tokens, Lexemes, and Patterns A token is a lexical type – –e.g id, int A lexeme is a token value – –e.g. " abc", 123 A pattern says how to make a token from chars – –e.g. id = letter followed by letters and digits int = non-empty sequence of digits – –a pattern is defined using regular expressions (REs)
Compilers: lex analysis/ Implementing a Lexical Analyzer Issues: Lookahead – –how to group chars into tokens Ignoring whitespace and comments. Separating variables from keywords – –e.g. "if", "else" (Automatically) translating REs into a lexical analyzer.
Compilers: lex analysis/2 10 Lookahead A token is created by reading in characters, and grouping them together. It is not always possible to decide if a token is finished without looking ahead at the next char. For example: – –Is "i" a variable, or the first character of "if"? – –Is "=" an assignment or the beginning of "=="?
Compilers: lex analysis/ Regular Expressions (REs) REs are an algebraic way of specifying how to recognise input – –‘algebraic’ means that the recognition pattern is defined using RE operands and operators Covered in more detail in "maths for CoE"
Compilers: lex analysis/ REs in grep grep searches input lines, a line at a time. If the line contains a string that matches grep's RE (pattern), then the line is output. grep "RE" input lines (e.g. from a file) hello andy my name is andy my bye byhe output matching lines (e.g. to a file) continued
Compilers: lex analysis/2 13 Examples grep "and" hello andy my name is andy my bye byhe hello andy my name is andy hello andy my name is andy my bye byhe hello andy my name is andy my bye byhe continued "|" means "or" grep -E "an|my"
Compilers: lex analysis/2 14 grep "hel*" hello andy my name is andy my bye byhe hello andy my bye byhe "*" means "0 or more"
Compilers: lex analysis/ The RE Language A RE defines a pattern which recognises (matches) a set of strings – –e.g. a RE can be defined that recognises the strings { aa, aba, abba, abbba, abbbba, …} These recognisable strings are sometimes called the RE’s language.
Compilers: lex analysis/2 16 RE Operands There are 4 basic kinds of operands: – –characters (e.g. ‘a’, ‘1’, ‘(‘) – –the symbol (means an empty string ‘’) – –the symbol {} (means the empty set) – –variables, which can be assigned a RE variable = RE
Compilers: lex analysis/2 17 RE Operators There are three basic operators: – –union ‘|’ – –concatenation – –closure *
Compilers: lex analysis/2 18 Union S | T – –this RE can use the S or T RE to match strings Example REs: a | bmatches strings {a, b} a | b | cmatches strings {a, b, c }
Compilers: lex analysis/2 19 Concatenation S T – –this RE will use the S RE followed by the T RE to match against strings Example REs: a bmatches the string { ab } w | (a b)matches the strings {w, ab}
Compilers: lex analysis/2 20 What strings are matched by the RE (a | ab ) (c | bc) Equivalent to: {a, ab} followed by {c, bc} => {ac, abc, abc, abbc} => {ac, abc, abbc}
Compilers: lex analysis/2 21 Closure S* – –this RE can use the S RE 0 or more times to match against strings Example RE: a*matches the strings: { , a, aa, aaa, aaaa, aaaaa,... } empty string
Compilers: lex analysis/ REs for C Identifiers We define two RE variables, letter and digit : letter = A | B | C | D... Z | a | b | c | d.... z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 id is defined using letter and digit : id = letter ( letter | digit )* continued
Compilers: lex analysis/2 23 Strings matched by id include: ab345wh5g Strings not matched: 2$abc****
Compilers: lex analysis/ RE Summary ExpressionMeaning Empty pattern aAny pattern represented by ‘a’ abStrings with pattern ‘a’ followed by ‘b’ a|bStrings consisting of pattern ‘a’ or ‘b’ a*Zero or more occurrences of patterns in ‘a’ a+One or more occurrences of patterns in ‘a’ a 3 Patterns in ‘a’ repeated exactly 3 times a?(a | ) ; Optional single pattern from ‘a’.Any single character
Compilers: lex analysis/2 25 More Operators See the regular expressions "cheat-sheet" at the course website in the "Useful Info" subdirectory: See the regular expressions "cheat-sheet" at the course website in the "Useful Info" subdirectory: –over 80 operators!!
Compilers: lex analysis/2 26 Wild Card Symbol: '.' The ‘.’ stands for any character except the newline – –e.g. grep ‘a..b.$’ chapter1.txt grep ‘t.*t.*t’ /usr/share/dict/words the UNIX/Linux 'dictionary'
Compilers: lex analysis/2 27 grep "a..b." A A's AOL AOL's : adobe alibi ameba /usr/share/dict/words
Compilers: lex analysis/ REs for Integers and Floats We redefine digit : digit = 0|1|2|3|4|5|6|7|8|9 or digit = [1 – 9] int and float : int = {digit}+ float = {digit}+ "." {digit}+
Compilers: lex analysis/2 29 Integers and floats with exponents: number = {digit}+ ('.' {digit}+ )? ( 'E'('+'|'-')? {digit}+ )?
Compilers: lex analysis/ More on REs v v See RE summary on the course website: regular_expressions_cheat_sheet.pdf v v I have the standard RE book: – –Mastering Regular Expressions Jeffrey E. F. Freidl O'Reilly & Associates continued
Compilers: lex analysis/2 31 v v There are many websites that explain REs: helpsheets/unix/regex.html
Compilers: lex analysis/ The Expressions Language In my expressions language, a program is a series of expressions and assignments. Example: // test2.txt example let x56 = 2 let bing_BONG = (27 * 2) - x56 5 * (67 / 3)
Compilers: lex analysis/ REs for the Language alpha = a | b | c |... | z | A | B |... | Z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 alphanum = alpha | digit id = alpha (alphanum | '_' )* int = digit+
Compilers: lex analysis/2 34 keywords = "let" | "SCANEOF" punctuation = '(' | ')' | '+' | '-' | '*' | '/' | '=' | '\n' Ignore: – –whitespace (but not newlines) – –comments ("//" to the end of the line)
Compilers: lex analysis/ From REs to Tokens Using the REs as a guide, we create tokens and token values. How? In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.
Compilers: lex analysis/2 36 Tokens and Token Values TokenToken Value ID"var" and the id string INT"num" and the value LPAREN'(' RPAREN')' PLUSOP'+' MINUSOP'-' MULTOP'*' DIVOP'/'
Compilers: lex analysis/2 37 TokenToken Value ASSIGNOP'=' NEWLINE'\n' LET"let" SCANEOF eof character
Compilers: lex analysis/ exprTokens.c exprTokens.c is a lexical analyzer for the expressions language. It reads in an expressions program on stdin, and prints out the tokens (and their values).
Compilers: lex analysis/ Usage > gcc -Wall -o exprTokens exprTokens.c >./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof' > or a Windows C compiler: lcc-win32,
Compilers: lex analysis/ Code // constants for tokens and their values #define NUMKEYS 2 typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF } Token; char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=", "+", "-", "*", "/", "eof"}; char *keywords[NUMKEYS] = {"let", "SCANEOF"}; Token keywordToks[NUMKEYS] = {LET, SCANEOF};
Compilers: lex analysis/2 41 Callgraph for exrprTokens.c calls
Compilers: lex analysis/2 42 main() and its globals Token currToken; int lineNum = 1; // num lines read in int main(void) { printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF); return 0; }
Compilers: lex analysis/2 43 Printing the Tokens #define MAX_IDLEN 30 char tokString[MAX_IDLEN]; int currTokValue; // used when token is an integer void printToken(void) { if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT)// a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks } // end of printToken()
Compilers: lex analysis/2 44 Getting a Token void nextToken(void) { currToken = scanner(); }
Compilers: lex analysis/2 45 scanner() Overview Token scanner(void)// converts chars into a token { int inCh; clearTokStr(); if (feof(stdin)) return SCANEOF; while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;
Compilers: lex analysis/2 46 else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token // return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token // change token to int return INT; } else if (inCh == '(') return LPAREN; else if... // more tests of inCh... else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); } return SCANEOF; } // end of scanner() punctuation
Compilers: lex analysis/2 47 Processing an ID : else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')* extendTokStr(inCh); for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); return checkKeyword(); } : in scanner()
Compilers: lex analysis/2 48 Token String Functions void clearTokStr(void) // reset the token string to be empty { tokString[0] = '\0'; tokStrLen = 0; } // end of clearTokStr() void extendTokStr(char ch) // add ch to the end of the token string { if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string } } // end of extendTokStr()
Compilers: lex analysis/2 49 Checking for a Keyword Token checkKeyword(void) { int i; for(i=0; i<NUMKEYS; i++) { if(!strcmp(tokString, keywords[i])) return keywordToks[i]; } return ID; } // end of checkKeyword()
Compilers: lex analysis/2 50 Processing an INT : else if (isdigit(inCh)){ // INT = DIGIT+ extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; } : in scanner()
Compilers: lex analysis/2 51 Reporting an Error void lexicalErr(char ch) { printf("Lexical error at \"%c\" on line %d\n", ch, lineNum); exit(1); } No recovery attempted.
Compilers: lex analysis/ Some Good News Most programming languages use very similar lexical analyzers – –e.g. the same kind of IDs, INTs, punctuation, and keywords Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.
Compilers: lex analysis/ From REs to Code Automatically 1. Write the REs for the language. 2. Convert to Non-deterministic Finite Automata (NFA). 3. Convert to Deterministic Finite Automata (DFA) 4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser. There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.