Compiler Structures 2. Lexical Analysis Objectives

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Compiler construction in4020 – lecture 2 Koen Langendoen Delft University of Technology The Netherlands.
1 Week 2 Questions / Concerns Schedule this week: Homework1 & Lab1a due at midnight on Friday. Sherry will be in Klamath Falls on Friday Lexical Analyzer.
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
1 IMPLEMENTATION OF FINITE AUTOMAT IN CODE There are several ways to translate either a DFA or an NFA into code. Consider, again the example of a DFA that.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
Chapter 2 A Simple Compiler
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Chapter 2 A Simple Compiler. 2 Outlines 2.1 The Structure of a Micro Compiler 2.2 A Micro Scanner 2.3 The Syntax of Micro 2.4 Recursive Descent Parsing.
Compilers: topDown/5 1 Compiler Structures Objective – –look at top-down (LL) parsing using recursive descent and tables – –consider a recursive.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Lexical Analysis (I) Compiler Baojian Hua
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions'
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
. n COMPILERS n n AND n n INTERPRETERS. -Compilers nA compiler is a program thatt reads a program written in one language - the source language- and translates.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Lexical Analysis - Scanner- Contd Computer Science Rensselaer Polytechnic Compiler Design Lecture 3(01/21/98)
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Department of Software & Media Technology
Lexical and Syntax Analysis
Scanner Scanner Introduction to Compilers.
Chapter 3 Lexical Analysis.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Finite-State Machines (FSMs)
Compiler Construction
CS 536 / Fall 2017 Introduction to programming languages and compilers
Department of Software & Media Technology
Chapter 3: Lexical Analysis
Lexical Analysis - An Introduction
Review: Compiler Phases:
Lecture 5: Lexical Analysis III: The final bits
CS 3304 Comparative Languages
Scanner Scanner Introduction to Compilers.
Designing a Predictive Parser
Compiler Structures 8. Attribute Grammars Objectives
Scanner Scanner Introduction to Compilers.
Compiler Structures 3. Lex Objectives , Semester 2,
Compiler Structures 5. Top-down Parsing Objectives
Lexical Analysis - An Introduction
Compiler Structures 4. Syntax Analysis Objectives
10. Intermediate Code Generation
Scanner Scanner Introduction to Compilers.
9. Creating and Evaluating a
Compiler Construction
Lexical Elements & Operators
Scanner Scanner Introduction to Compilers.
Lecture 5 Scanning.
Scanner Scanner Introduction to Compilers.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Compiler Structures 2. Lexical Analysis Objectives 242-437, Semester 2, 2018-2019 2. Lexical Analysis Objectives what is lexical analysis? look at a lexical analyzer for a simple 'expressions' language

Overview 1. Why Lexical Analysis? 2. Using a Lexical Analyzer 3. Implementing a Lexical Analyzer 4. The Expressions Language 5. exprTokens.c 6. From REs to Code Automatically

In this lecture Front End Back End Source Program Lexical Analyzer Syntax Analyzer Semantic Analyzer Int. Code Generator Intermediate Code Code Optimizer Back End As I said earlier, there will be 5 homeworks, each of which will contribute to 5% of your final grade. You will have at least 2 weeks to complete each of the homeworks. Talking about algorithms really helps you learn about them, so I encourage you all to work in small groups. If you don’t have anyone to work with please either e-mail me or stop by my office and I will be sure to match you up with others. PLEASE make sure you all work on each problem; you will only be hurting yourself if you leach off of your partners. Problems are HARD! I will take into account the size of your group when grading your homework. Later in the course I will even have a contest for best algorithm and give prizes out for those who are most clever in their construct. I will allow you one late homework. You *must* write on the top that you are taking your late. Homework 1 comes out next class. Target Code Generator Target Lang. Prog.

1. Why Lexical Analysis? Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants) Simplifies the design of the rest of the compiler the code uses tokens, not strings or characters Can be implemented efficiently by hand or automatically Improves portability non-standard symbols / foreign characters are translated here, so do not affect the rest of the compiler

2. Using a Lexical Analyzer 3. Token, token value Syntax Analyzer (using tokens) Source Program Lexical Analyzer (using chars) 1. Get next token 2. Get chars to make a token lexical errors syntax errors

A Source Program is Chars Consider the program fragment: if (i==j); z=1; else; z=0; endif; The lexical analyzer reads it in as a string of characters: As I said earlier, there will be 5 homeworks, each of which will contribute to 5% of your final grade. You will have at least 2 weeks to complete each of the homeworks. Talking about algorithms really helps you learn about them, so I encourage you all to work in small groups. If you don’t have anyone to work with please either e-mail me or stop by my office and I will be sure to match you up with others. PLEASE make sure you all work on each problem; you will only be hurting yourself if you leach off of your partners. Problems are HARD! I will take into account the size of your group when grading your homework. Later in the course I will even have a contest for best algorithm and give prizes out for those who are most clever in their construct. I will allow you one late homework. You *must* write on the top that you are taking your late. Homework 1 comes out next class. i f _ ( i = = j ) ; \n \t z = 1 ; \n e l s ; \t z = n d i f Lexical analysis divides the string into tokens.

Tokens and Token Values Lexical Analyzer "y = 31 + 28*foo" get chars <id, “y”> <=, > <int, 31> <+, > <int, 28> <*, > <id, “foo”> token get tokens (one at a time) token value Syntax Analyzer

Tokens, Lexemes, and Patterns A token is a lexical type e.g id, int A lexeme is a token value e.g. "abc", 123 A pattern says how to make a token from chars e.g. id = letter followed by letters and digits int = non-empty sequence of digits a pattern is defined using regular expressions (REs)

3. Implementing a Lexical Analyzer Issues: Lookahead how to group chars into tokens Ignoring whitespace and comments. Separating variables from keywords e.g. "if", "else" (Automatically) translating REs into a lexical analyzer.

Lookahead A token is created by reading in characters, and grouping them together. It is not always possible to decide if a token is finished without looking ahead at the next char. For example: Is "i" a variable, or the first character of "if"? Is "=" an assignment or the beginning of "=="?

4. The Expressions Language In my expressions language, a program is a series of expressions and assignments. Example: // test2.txt example let x56 = 2 let bing_BONG = (27 * 2) - x56 5 * (67 / 3)

4.1. REs for the Language alpha = a | b | c | ... | z | A | B | ... | Z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 alphanum = alpha | digit id = alpha (alphanum | '_' )* int = digit+

keywords = "let" | "SCANEOF" punctuation = '(' | ')' | '+' | '-' | '*' | '/' | '=' | '\n' Ignore: whitespace (but not newlines) comments ("//" to the end of the line)

4.2. From REs to Tokens Using the REs as a guide, we create tokens and token values. How? In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.

Tokens and Token Values Token Token Value ID "var" and the id string INT "num" and the value LPAREN '(' RPAREN ')' PLUSOP '+' MINUSOP '-' MULTOP '*' DIVOP '/'

Token Token Value ASSIGNOP '=' NEWLINE '\n' LET "let" SCANEOF eof character

5. exprTokens.c exprTokens.c is a lexical analyzer for the expressions language. It reads in an expressions program on stdin, and prints out the tokens (and their values).

5.1. Usage > gcc -Wall -o exprTokens exprTokens.c > ./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof' > or a Windows C compiler: lcc-win32, http://www.cs.virginia.edu/~lcc-win32/

5.2. Code // constants for tokens and their values #define NUMKEYS 2 typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF } Token; char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=", "+", "-", "*", "/", "eof"}; char *keywords[NUMKEYS] = {"let", "SCANEOF"}; Token keywordToks[NUMKEYS] = {LET, SCANEOF};

Callgraph for exrprTokens.c calls

main() and its globals Token currToken; int lineNum = 1; // num lines read in int main(void) { printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF); return 0; }

Printing the Tokens #define MAX_IDLEN 30 char tokString[MAX_IDLEN]; int currTokValue; // used when token is an integer void printToken(void) { if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT) // a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks } // end of printToken()

Getting a Token void nextToken(void) { currToken = scanner(); }

scanner() Overview Token scanner(void) // converts chars into a token { int inCh; clearTokStr(); if (feof(stdin)) return SCANEOF; while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;

else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_') else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token // return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token // change token to int return INT; else if (inCh == '(') return LPAREN; else if ... // more tests of inCh ... else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); return SCANEOF; } // end of scanner() punctuation

Processing an ID in scanner() : else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')* extendTokStr(inCh); for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar()) ungetc(inCh, stdin); return checkKeyword(); } :

Token String Functions void clearTokStr(void) // reset the token string to be empty { tokString[0] = '\0'; tokStrLen = 0; } // end of clearTokStr() void extendTokStr(char ch) // add ch to the end of the token string { if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string } } // end of extendTokStr()

Checking for a Keyword Token checkKeyword(void) { int i; for(i=0; i<NUMKEYS; i++) { if(!strcmp(tokString, keywords[i])) return keywordToks[i]; } return ID; } // end of checkKeyword()

Processing an INT in scanner() : else if (isdigit(inCh)){ // INT = DIGIT+ extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh); inCh = getchar()) ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; } :

Reporting an Error No recovery attempted. void lexicalErr(char ch) { printf("Lexical error at \"%c\" on line %d\n", ch, lineNum); exit(1); } No recovery attempted.

5.3. Some Good News Most programming languages use very similar lexical analyzers e.g. the same kind of IDs, INTs, punctuation, and keywords Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.

6. From REs to Code Automatically 1. Write the REs for the language. 2. Convert to Non-deterministic Finite Automata (NFA). 3. Convert to Deterministic Finite Automata (DFA) 4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser. There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.