Compiler Structures 2. Lexical Analysis Objectives 242-437, Semester 2, 2018-2019 2. Lexical Analysis Objectives what is lexical analysis? look at a lexical analyzer for a simple 'expressions' language
Overview 1. Why Lexical Analysis? 2. Using a Lexical Analyzer 3. Implementing a Lexical Analyzer 4. The Expressions Language 5. exprTokens.c 6. From REs to Code Automatically
In this lecture Front End Back End Source Program Lexical Analyzer Syntax Analyzer Semantic Analyzer Int. Code Generator Intermediate Code Code Optimizer Back End As I said earlier, there will be 5 homeworks, each of which will contribute to 5% of your final grade. You will have at least 2 weeks to complete each of the homeworks. Talking about algorithms really helps you learn about them, so I encourage you all to work in small groups. If you don’t have anyone to work with please either e-mail me or stop by my office and I will be sure to match you up with others. PLEASE make sure you all work on each problem; you will only be hurting yourself if you leach off of your partners. Problems are HARD! I will take into account the size of your group when grading your homework. Later in the course I will even have a contest for best algorithm and give prizes out for those who are most clever in their construct. I will allow you one late homework. You *must* write on the top that you are taking your late. Homework 1 comes out next class. Target Code Generator Target Lang. Prog.
1. Why Lexical Analysis? Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants) Simplifies the design of the rest of the compiler the code uses tokens, not strings or characters Can be implemented efficiently by hand or automatically Improves portability non-standard symbols / foreign characters are translated here, so do not affect the rest of the compiler
2. Using a Lexical Analyzer 3. Token, token value Syntax Analyzer (using tokens) Source Program Lexical Analyzer (using chars) 1. Get next token 2. Get chars to make a token lexical errors syntax errors
A Source Program is Chars Consider the program fragment: if (i==j); z=1; else; z=0; endif; The lexical analyzer reads it in as a string of characters: As I said earlier, there will be 5 homeworks, each of which will contribute to 5% of your final grade. You will have at least 2 weeks to complete each of the homeworks. Talking about algorithms really helps you learn about them, so I encourage you all to work in small groups. If you don’t have anyone to work with please either e-mail me or stop by my office and I will be sure to match you up with others. PLEASE make sure you all work on each problem; you will only be hurting yourself if you leach off of your partners. Problems are HARD! I will take into account the size of your group when grading your homework. Later in the course I will even have a contest for best algorithm and give prizes out for those who are most clever in their construct. I will allow you one late homework. You *must* write on the top that you are taking your late. Homework 1 comes out next class. i f _ ( i = = j ) ; \n \t z = 1 ; \n e l s ; \t z = n d i f Lexical analysis divides the string into tokens.
Tokens and Token Values Lexical Analyzer "y = 31 + 28*foo" get chars <id, “y”> <=, > <int, 31> <+, > <int, 28> <*, > <id, “foo”> token get tokens (one at a time) token value Syntax Analyzer
Tokens, Lexemes, and Patterns A token is a lexical type e.g id, int A lexeme is a token value e.g. "abc", 123 A pattern says how to make a token from chars e.g. id = letter followed by letters and digits int = non-empty sequence of digits a pattern is defined using regular expressions (REs)
3. Implementing a Lexical Analyzer Issues: Lookahead how to group chars into tokens Ignoring whitespace and comments. Separating variables from keywords e.g. "if", "else" (Automatically) translating REs into a lexical analyzer.
Lookahead A token is created by reading in characters, and grouping them together. It is not always possible to decide if a token is finished without looking ahead at the next char. For example: Is "i" a variable, or the first character of "if"? Is "=" an assignment or the beginning of "=="?
4. The Expressions Language In my expressions language, a program is a series of expressions and assignments. Example: // test2.txt example let x56 = 2 let bing_BONG = (27 * 2) - x56 5 * (67 / 3)
4.1. REs for the Language alpha = a | b | c | ... | z | A | B | ... | Z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 alphanum = alpha | digit id = alpha (alphanum | '_' )* int = digit+
keywords = "let" | "SCANEOF" punctuation = '(' | ')' | '+' | '-' | '*' | '/' | '=' | '\n' Ignore: whitespace (but not newlines) comments ("//" to the end of the line)
4.2. From REs to Tokens Using the REs as a guide, we create tokens and token values. How? In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.
Tokens and Token Values Token Token Value ID "var" and the id string INT "num" and the value LPAREN '(' RPAREN ')' PLUSOP '+' MINUSOP '-' MULTOP '*' DIVOP '/'
Token Token Value ASSIGNOP '=' NEWLINE '\n' LET "let" SCANEOF eof character
5. exprTokens.c exprTokens.c is a lexical analyzer for the expressions language. It reads in an expressions program on stdin, and prints out the tokens (and their values).
5.1. Usage > gcc -Wall -o exprTokens exprTokens.c > ./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof' > or a Windows C compiler: lcc-win32, http://www.cs.virginia.edu/~lcc-win32/
5.2. Code // constants for tokens and their values #define NUMKEYS 2 typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF } Token; char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=", "+", "-", "*", "/", "eof"}; char *keywords[NUMKEYS] = {"let", "SCANEOF"}; Token keywordToks[NUMKEYS] = {LET, SCANEOF};
Callgraph for exrprTokens.c calls
main() and its globals Token currToken; int lineNum = 1; // num lines read in int main(void) { printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF); return 0; }
Printing the Tokens #define MAX_IDLEN 30 char tokString[MAX_IDLEN]; int currTokValue; // used when token is an integer void printToken(void) { if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT) // a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks } // end of printToken()
Getting a Token void nextToken(void) { currToken = scanner(); }
scanner() Overview Token scanner(void) // converts chars into a token { int inCh; clearTokStr(); if (feof(stdin)) return SCANEOF; while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;
else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_') else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token // return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token // change token to int return INT; else if (inCh == '(') return LPAREN; else if ... // more tests of inCh ... else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); return SCANEOF; } // end of scanner() punctuation
Processing an ID in scanner() : else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')* extendTokStr(inCh); for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar()) ungetc(inCh, stdin); return checkKeyword(); } :
Token String Functions void clearTokStr(void) // reset the token string to be empty { tokString[0] = '\0'; tokStrLen = 0; } // end of clearTokStr() void extendTokStr(char ch) // add ch to the end of the token string { if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string } } // end of extendTokStr()
Checking for a Keyword Token checkKeyword(void) { int i; for(i=0; i<NUMKEYS; i++) { if(!strcmp(tokString, keywords[i])) return keywordToks[i]; } return ID; } // end of checkKeyword()
Processing an INT in scanner() : else if (isdigit(inCh)){ // INT = DIGIT+ extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh); inCh = getchar()) ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; } :
Reporting an Error No recovery attempted. void lexicalErr(char ch) { printf("Lexical error at \"%c\" on line %d\n", ch, lineNum); exit(1); } No recovery attempted.
5.3. Some Good News Most programming languages use very similar lexical analyzers e.g. the same kind of IDs, INTs, punctuation, and keywords Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.
6. From REs to Code Automatically 1. Write the REs for the language. 2. Convert to Non-deterministic Finite Automata (NFA). 3. Convert to Deterministic Finite Automata (DFA) 4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser. There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.