241-437 Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions'

Slides:



Advertisements
Similar presentations
Lexical Analysis Dragon Book: chapter 3.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lex -- a Lexical Analyzer Generator (by M.E. Lesk and Eric. Schmidt) –Given tokens specified as regular expressions, Lex automatically generates a routine.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Lecture 02 – Lexical Analysis Eran Yahav 1. 2 You are here Executable code exe Source text txt Compiler Lexical Analysis Syntax Analysis Parsing Semantic.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
Lexical Analysis Mooly Sagiv html:// Textbook:Modern Compiler Implementation in C Chapter 2.
Chapter 2 A Simple Compiler
College of Computer Science & Technology Compiler Construction Principles & Implementation Techniques -1- Compiler Construction Principles & Implementation.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automata and Regular Expression Discrete Mathematics and Its Applications Baojian Hua
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Chapter 2 A Simple Compiler. 2 Outlines 2.1 The Structure of a Micro Compiler 2.2 A Micro Scanner 2.3 The Syntax of Micro 2.4 Recursive Descent Parsing.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Compilers: topDown/5 1 Compiler Structures Objective – –look at top-down (LL) parsing using recursive descent and tables – –consider a recursive.
Compilers: lex/3 1 Compiler Structures Objectives – –describe lex – –give many examples of lex's use , Semester 1, Lex.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Review: Regular expression: –How do we define it? Given an alphabet, Base case: – is a regular expression that denote { }, the set that contains the empty.
Lexical Analysis (I) Compiler Baojian Hua
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
Lexical Analysis: Regular Expressions CS 671 January 22, 2008.
LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
IN LINE FUNCTION AND MACRO Macro is processed at precompilation time. An Inline function is processed at compilation time. Example : let us consider this.
1 Using Lex. Flex – Lexical Analyzer Generator A language for specifying lexical analyzers Flex compilerlex.yy.clang.l C compiler -lfl a.outlex.yy.c a.outtokenssource.
Muhammad Idrees, Lecturer University of Lahore 1 Top-Down Parsing Top down parsing can be viewed as an attempt to find a leftmost derivation for an input.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
The Role of Lexical Analyzer
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
Lexical Analysis: Regular Expressions CS 471 September 3, 2007.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
CS510 Compiler Lecture 2.
Lecture 2 Lexical Analysis
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
Finite-State Machines (FSMs)
Compiler Construction
12. Automata and Regular Expressions
Review: Compiler Phases:
Compiler Structures 3. Lex Objectives , Semester 2,
Compiler Structures 2. Lexical Analysis Objectives
1.5 Regular Expressions (REs)
Lexical Elements & Operators
Presentation transcript:

Compilers: lex analysis/2 1 Compiler Structures Objective – –what is lexical analysis? – –look at a lexical analyzer for a simple 'expressions' language , Semester 1, Lexical Analysis

Compilers: lex analysis/2 2 Overview 1. Why Lexical Analysis? 2. Using a Lexical Analyzer 3. Implementing a Lexical Analyzer 4. Regular Expressions (REs) 5. The Expressions Language 6.exprTokens.c 7.From REs to Code Automatically

Compilers: lex analysis/2 3 In this lecture Source Program Target Lang. Prog. Semantic Analyzer Syntax Analyzer Lexical Analyzer Front End Code Optimizer Target Code Generator Back End Int. Code Generator Intermediate Code

Compilers: lex analysis/ Why Lexical Analysis? Stream of input text (e.g. from a file) is converted to an output stream of tokens (e.g. structs, records, constants) Simplifies the design of the rest of the compiler – –the code uses tokens, not strings or characters Can be implemented efficiently – –by hand or automatically Improves portability – –non-standard symbols / foreign characters are translated here, so do not affect the rest of the compiler

Compilers: lex analysis/ Using a Lexical Analyzer Lexical Analyzer (using chars) Syntax Analyzer (using tokens) Source Program 3. Token, token value 1. Get next token lexical errors syntax errors 2. Get chars to make a token

Compilers: lex analysis/2 6 A Source Program is Chars Consider the program fragment: if (i==j); z=1; else; z=0; endif; The lexical analyzer reads it in as a string of characters: if_(i==j);\n\tz=1; \nelse; \tz=0;\nendif; Lexical analysis divides the string into tokens.

Compilers: lex analysis/2 7 Tokens and Token Values Lexical Analyzer "y = *foo" Syntax Analyzer token token value get tokens (one at a time) get chars

Compilers: lex analysis/2 8 Tokens, Lexemes, and Patterns A token is a lexical type – –e.g id, int A lexeme is a token value – –e.g. " abc", 123 A pattern says how to make a token from chars – –e.g. id = letter followed by letters and digits int = non-empty sequence of digits – –a pattern is defined using regular expressions (REs)

Compilers: lex analysis/ Implementing a Lexical Analyzer Issues: Lookahead – –how to group chars into tokens Ignoring whitespace and comments. Separating variables from keywords – –e.g. "if", "else" (Automatically) translating REs into a lexical analyzer.

Compilers: lex analysis/2 10 Lookahead A token is created by reading in characters, and grouping them together. It is not always possible to decide if a token is finished without looking ahead at the next char. For example: – –Is "i" a variable, or the first character of "if"? – –Is "=" an assignment or the beginning of "=="?

Compilers: lex analysis/ Regular Expressions (REs) REs are an algebraic way of specifying how to recognise input – –‘algebraic’ means that the recognition pattern is defined using RE operands and operators Covered in more detail in "maths for CoE"

Compilers: lex analysis/ REs in grep grep searches input lines, a line at a time. If the line contains a string that matches grep's RE (pattern), then the line is output. grep "RE" input lines (e.g. from a file) hello andy my name is andy my bye byhe output matching lines (e.g. to a file) continued

Compilers: lex analysis/2 13 Examples grep "and" hello andy my name is andy my bye byhe hello andy my name is andy hello andy my name is andy my bye byhe hello andy my name is andy my bye byhe continued "|" means "or" grep -E "an|my"

Compilers: lex analysis/2 14 grep "hel*" hello andy my name is andy my bye byhe hello andy my bye byhe "*" means "0 or more"

Compilers: lex analysis/ The RE Language A RE defines a pattern which recognises (matches) a set of strings – –e.g. a RE can be defined that recognises the strings { aa, aba, abba, abbba, abbbba, …} These recognisable strings are sometimes called the RE’s language.

Compilers: lex analysis/2 16 RE Operands There are 4 basic kinds of operands: – –characters (e.g. ‘a’, ‘1’, ‘(‘) – –the symbol  (means an empty string ‘’) – –the symbol {} (means the empty set) – –variables, which can be assigned a RE variable = RE

Compilers: lex analysis/2 17 RE Operators There are three basic operators: – –union ‘|’ – –concatenation – –closure *

Compilers: lex analysis/2 18 Union S | T – –this RE can use the S or T RE to match strings Example REs: a | bmatches strings {a, b} a | b | cmatches strings {a, b, c }

Compilers: lex analysis/2 19 Concatenation S T – –this RE will use the S RE followed by the T RE to match against strings Example REs: a bmatches the string { ab } w | (a b)matches the strings {w, ab}

Compilers: lex analysis/2 20 What strings are matched by the RE (a | ab ) (c | bc) Equivalent to: {a, ab} followed by {c, bc} => {ac, abc, abc, abbc} => {ac, abc, abbc}

Compilers: lex analysis/2 21 Closure S* – –this RE can use the S RE 0 or more times to match against strings Example RE: a*matches the strings: { , a, aa, aaa, aaaa, aaaaa,... } empty string

Compilers: lex analysis/ REs for C Identifiers We define two RE variables, letter and digit : letter = A | B | C | D... Z | a | b | c | d.... z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 id is defined using letter and digit : id = letter ( letter | digit )* continued

Compilers: lex analysis/2 23 Strings matched by id include: ab345wh5g Strings not matched: 2$abc****

Compilers: lex analysis/ RE Summary ExpressionMeaning  Empty pattern aAny pattern represented by ‘a’ abStrings with pattern ‘a’ followed by ‘b’ a|bStrings consisting of pattern ‘a’ or ‘b’ a*Zero or more occurrences of patterns in ‘a’ a+One or more occurrences of patterns in ‘a’ a 3 Patterns in ‘a’ repeated exactly 3 times a?(a |  ) ; Optional single pattern from ‘a’.Any single character

Compilers: lex analysis/2 25 More Operators See the regular expressions "cheat-sheet" at the course website in the "Useful Info" subdirectory: See the regular expressions "cheat-sheet" at the course website in the "Useful Info" subdirectory: –over 80 operators!!

Compilers: lex analysis/2 26 Wild Card Symbol: '.' The ‘.’ stands for any character except the newline – –e.g. grep ‘a..b.$’ chapter1.txt grep ‘t.*t.*t’ /usr/share/dict/words the UNIX/Linux 'dictionary'

Compilers: lex analysis/2 27 grep "a..b." A A's AOL AOL's : adobe alibi ameba /usr/share/dict/words

Compilers: lex analysis/ REs for Integers and Floats We redefine digit : digit = 0|1|2|3|4|5|6|7|8|9 or digit = [1 – 9] int and float : int = {digit}+ float = {digit}+ "." {digit}+

Compilers: lex analysis/2 29 Integers and floats with exponents: number = {digit}+ ('.' {digit}+ )? ( 'E'('+'|'-')? {digit}+ )?

Compilers: lex analysis/ More on REs v v See RE summary on the course website: regular_expressions_cheat_sheet.pdf v v I have the standard RE book: – –Mastering Regular Expressions Jeffrey E. F. Freidl O'Reilly & Associates continued

Compilers: lex analysis/2 31 v v There are many websites that explain REs: helpsheets/unix/regex.html

Compilers: lex analysis/ The Expressions Language In my expressions language, a program is a series of expressions and assignments. Example: // test2.txt example let x56 = 2 let bing_BONG = (27 * 2) - x56 5 * (67 / 3)

Compilers: lex analysis/ REs for the Language alpha = a | b | c |... | z | A | B |... | Z digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 alphanum = alpha | digit id = alpha (alphanum | '_' )* int = digit+

Compilers: lex analysis/2 34 keywords = "let" | "SCANEOF" punctuation = '(' | ')' | '+' | '-' | '*' | '/' | '=' | '\n' Ignore: – –whitespace (but not newlines) – –comments ("//" to the end of the line)

Compilers: lex analysis/ From REs to Tokens Using the REs as a guide, we create tokens and token values. How? In general, the top-level REs (id, num) become tokens, and so do the punctuation and the keywords.

Compilers: lex analysis/2 36 Tokens and Token Values TokenToken Value ID"var" and the id string INT"num" and the value LPAREN'(' RPAREN')' PLUSOP'+' MINUSOP'-' MULTOP'*' DIVOP'/'

Compilers: lex analysis/2 37 TokenToken Value ASSIGNOP'=' NEWLINE'\n' LET"let" SCANEOF eof character

Compilers: lex analysis/ exprTokens.c exprTokens.c is a lexical analyzer for the expressions language. It reads in an expressions program on stdin, and prints out the tokens (and their values).

Compilers: lex analysis/ Usage > gcc -Wall -o exprTokens exprTokens.c >./exprTokens < test2.txt 1: 2: 3: 4: 'let' var(x56) '=' num(2) 5: 'let' var(bing_BONG) '=' '(' num(27) '*' num(2) ')' '-' var(x56) 6: 7: num(5) '*' '(' num(67) '/' num(3) ')' 8: 'eof' > or a Windows C compiler: lcc-win32,

Compilers: lex analysis/ Code // constants for tokens and their values #define NUMKEYS 2 typedef enum token_types { LET, ID, INT, LPAREN, RPAREN, NEWLINE, ASSIGNOP, PLUSOP, MINUSOP, MULTOP, DIVOP, SCANEOF } Token; char *tokSyms[] = {"let", "var", "num", "(", ")", "\n", "=", "+", "-", "*", "/", "eof"}; char *keywords[NUMKEYS] = {"let", "SCANEOF"}; Token keywordToks[NUMKEYS] = {LET, SCANEOF};

Compilers: lex analysis/2 41 Callgraph for exrprTokens.c calls

Compilers: lex analysis/2 42 main() and its globals Token currToken; int lineNum = 1; // num lines read in int main(void) { printf("%2d: ", lineNum); do { nextToken(); printToken(); } while (currToken != SCANEOF); return 0; }

Compilers: lex analysis/2 43 Printing the Tokens #define MAX_IDLEN 30 char tokString[MAX_IDLEN]; int currTokValue; // used when token is an integer void printToken(void) { if (currToken == ID) // an ID, variable name printf("%s(%s) ", tokSyms[currToken], tokString); else if (currToken == INT)// a number printf("%s(%d) ", tokSyms[currToken], currTokValue); // show value else if (currToken == NEWLINE) printf("%s%2d: ", tokSyms[currToken], lineNum); // print newline token else printf("'%s' ", tokSyms[currToken]); // other toks } // end of printToken()

Compilers: lex analysis/2 44 Getting a Token void nextToken(void) { currToken = scanner(); }

Compilers: lex analysis/2 45 scanner() Overview Token scanner(void)// converts chars into a token { int inCh; clearTokStr(); if (feof(stdin)) return SCANEOF; while ((inCh = getchar()) != EOF) { /* EOF is ^D */ if (inCh == '\n') { lineNum++; return NEWLINE; } else if (isspace(inCh)) // do nothing continue;

Compilers: lex analysis/2 46 else if (isalpha(inCh)){ // ID= ALPHA (ALPHA_NUM| '_')* // read in chars to make id token // return ID or keyword } else if (isdigit(inCh)){ // INT = DIGIT+ // read in chars to make int token // change token to int return INT; } else if (inCh == '(') return LPAREN; else if... // more tests of inCh... else if (inCh == '=') return ASSIGNOP; else lexicalErr(inCh); } return SCANEOF; } // end of scanner() punctuation

Compilers: lex analysis/2 47 Processing an ID : else if (isalpha(inCh)){ // ID = ALPHA (ALPHA_NUM | '_')* extendTokStr(inCh); for (inCh = getchar(); (isalnum(inCh) || inCh == '_'); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); return checkKeyword(); } : in scanner()

Compilers: lex analysis/2 48 Token String Functions void clearTokStr(void) // reset the token string to be empty { tokString[0] = '\0'; tokStrLen = 0; } // end of clearTokStr() void extendTokStr(char ch) // add ch to the end of the token string { if (tokStrLen == (MAX_IDLEN-1)) printf("Token string too long for %c\n", ch); else { tokString[tokStrLen] = ch; tokStrLen++; tokString[tokStrLen] = '\0'; // terminate string } } // end of extendTokStr()

Compilers: lex analysis/2 49 Checking for a Keyword Token checkKeyword(void) { int i; for(i=0; i<NUMKEYS; i++) { if(!strcmp(tokString, keywords[i])) return keywordToks[i]; } return ID; } // end of checkKeyword()

Compilers: lex analysis/2 50 Processing an INT : else if (isdigit(inCh)){ // INT = DIGIT+ extendTokStr(inCh); for (inCh = getchar(); isdigit(inCh); inCh = getchar()) extendTokStr(inCh); ungetc(inCh, stdin); currTokValue = atoi(tokString); // token --> int return INT; } : in scanner()

Compilers: lex analysis/2 51 Reporting an Error void lexicalErr(char ch) { printf("Lexical error at \"%c\" on line %d\n", ch, lineNum); exit(1); } No recovery attempted.

Compilers: lex analysis/ Some Good News Most programming languages use very similar lexical analyzers – –e.g. the same kind of IDs, INTs, punctuation, and keywords Once you've written one lexical analyzer, you can reuse it for other languages with only minor changes.

Compilers: lex analysis/ From REs to Code Automatically 1. Write the REs for the language. 2. Convert to Non-deterministic Finite Automata (NFA). 3. Convert to Deterministic Finite Automata (DFA) 4. Convert to a table that can be 'plugged' into an 'empty' lexical analyser. There are tools that will do stages 2-4 automatically. We'll look at one such tool, lex, in the next chapter.