CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1 Chapter 4 Chapter 4 Lexical analysis.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

4b Lexical analysis Finite Automata
Compiler Baojian Hua Lexical Analysis (II) Compiler Baojian Hua
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
CSc 453 Lexical Analysis (Scanning)
1 COMP 144 Programming Language Concepts Felix Hernandez-Campos Lecture 3: Lexical Analysis COMP 144 Programming Language Concepts Spring 2002 Felix Hernandez-Campos.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Tools for building compilers Clara Benac Earle. Tools to help building a compiler C –Lexical Analyzer generators: Lex, flex, –Syntax Analyzer generator:
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
4a 4a Lexical analysis CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Flex. 2 Flex A Lexical Analyzer Generator  generates a scanner procedure directly, with regular expressions and user-written procedures Steps to using.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis Mooly Sagiv Schrierber Wed 10:00-12:00 html:// Textbook:Modern.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
COMP 3438 – Part II - Lecture 2: Lexical Analysis (I) Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer (Checker)
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Scanning & FLEX CPSC 388 Ellen Walker Hiram College.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Lexical Analysis – Part I EECS 483 – Lecture 2 University of Michigan Monday, September 11, 2006.
By Neng-Fa Zhou Lexical Analysis 4 Why separate lexical and syntax analyses? –simpler design –efficiency –portability.
Introduction to Lex Ying-Hung Jiang
Introduction to Lexical Analysis and the Flex Tool. © Allan C. Milne Abertay University v
CSc 453 Lexical Analysis (Scanning)
1 Lex & Yacc. 2 Compilation Process Lexical Analyzer Source Code Syntax Analyzer Symbol Table Intermed. Code Gen. Code Generator Machine Code.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Lexical Analysis (Scanning) Lexical Analysis (Scanning)
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lexical Analysis.
1st Phase Lexical Analysis
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Prof. Necula CS 164 Lecture 31 Lexical Analysis Lecture 3-4.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
Lecture 2 Lexical Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Lexical Analysis (Sections )
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
4 Lexical analysis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
4 Lexical analysis.
4a Lexical analysis.
4 Lexical analysis.
Review: Compiler Phases:
4a Lexical analysis.
4 Lexical analysis.
Lexical Analysis Lecture 3-4 Prof. Necula CS 164 Lecture 3.
Lexical and Syntax Analysis
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 1 Chapter 4 Chapter 4 Lexical analysis

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 2 Scanner Main task: identify tokens –Basic building blocks of programs –E.g. keywords, identifiers, numbers, punctuation marks Desk calculator language example: read A sum := A e-3 write sum write sum / 2

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 3 Formal definition of tokens A set of tokens is a set of strings over an alphabet –{read, write, +, -, *, /, :=, 1, 2, …, 10, …, 3.45e-3, …} A set of tokens is a regular set that can be defined by comprehension using a regular expression For every regular set, there is a deterministic finite automaton (DFA) that can recognize it –(Aka deterministic Finite State Machine (FSM)) –i.e. determine whether a string belongs to the set or not –Scanners extract tokens from source code in the same way DFAs determine membership

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 4 Regular Expressions A regular expression (RE) is: A single character The empty string,  The concatenation of two regular expressions –Notation: RE 1 RE 2 (i.e. RE 1 followed by RE 2 ) The union of two regular expressions –Notation: RE 1 | RE 2 The closure of a regular expression –Notation: RE* –* is known as the Kleene star –* represents the concatenation of 0 or more strings Caution: notations for regular expressions vary –Learn the basic concepts and the rest is just syntactic sugar

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 5 Token Definition Example Numeric literals in Pascal, e.g. 1, 123, , 10e-3, 3.14e4 Definition of token unsignedNum DIG  0|1|2|3|4|5|6|7|8|9 unsignedInt  DIG DIG* unsignedNum  unsignedInt ((. unsignedInt) |  ) ((e ( + | – |  ) unsignedInt) |  ) Notes: –Recursion is not allowed! –Parentheses used to avoid ambiguity –It’s always possible to rewrite removing epsilons DIG * *. * e e + - FAs with epsilons are nondeterministic. NFAs are much harder to implement (use backtracking) Every NFA can be rewriten as a DFA (gets larger, tho)

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 6 Simple Problem Write a C program which reads in a character string, consisting of a’s and b’s, one character at a time. If the string contains a double aa, then print string accepted else print string rejected. An abstract solution to this can be expressed as a DFA a 1-3+ b b a a, b 2 Start state An accepting state The state transitions of a DFA can be encoded as a table which specifies the new state for a given current state and input a b input current state

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 7 #include main() { enum State {S1, S2, S3}; enum State currentState = S1; int c = getchar(); while (c != EOF) { switch(currentState) { case S1: if (c == ‘a’) currentState = S2; if (c == ‘b’) currentState = S1; break; case S2: if (c == ‘a’) currentState = S3; if (c == ‘b’) currentState = S1; break; case S3: break; } c = getchar(); } if (currentState == S3) printf(“string accepted\n”); else printf(“string rejected\n”); } an approach in C

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 8 #include main() { enum State {S1, S2, S3}; enum Label {A, B}; enum State currentState = S1; enum State table[3][2] = {{S2, S1}, {S3, S1}, {S3, S3}}; int label; int c = getchar(); while (c != EOF) { if (c == ‘a’) label = A; if (c == ‘b’) label = B; currentState = table[currentState][label]; c = getchar(); } if (currentState == S3) printf(“string accepted\n”); else printf(“string rejected\n”); } Using a table simplifies the program

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 9 Lex Lexical analyzer generator –It writes a lexical analyzer Assumption –each token matches a regular expression Needs –set of regular expressions –for each expression an action Produces –A C program Automatically handles many tricky problems flex is the gnu version of the venerable unix tool lex. –Produces highly optimized code

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 10 Scanner Generators E.g. lex, flex These programs take a table as their input and return a program (i.e. a scanner) that can extract tokens from a stream of characters A very useful programming utility, especially when coupled with a parser generator (e.g., yacc) standard in Unix

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 11 Lex example lexccfoolex foo.lfoolex.cfoolex tokens input > flex -ofoolex.c foo.l > cc -ofoolex foolex.c -lfl >more input begin if size>10 then size * end > foolex < input Keyword: begin Keyword: if Identifier: size Operator: > Integer: 10 (10) Keyword: then Identifier: size Operator: * Operator: - Float: (3.1415) Keyword: end

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 12 A Lex Program … definitions … % … rules … % … subroutines … DIG [0-9] ID [a-z][a-z0-9]* % {DIG}+ printf("Integer\n”); {DIG}+"."{DIG}* printf("Float\n”); {ID} printf("Identifier\n”); [ \t\n]+ /* skip whitespace */. printf(“Huh?\n"); % main(){yylex();}

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 13 Simplest Example %.|\n ECHO; % main() { yylex(); }

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 14 % (a|b)*aa(a|b)* {printf(“Accept %s\n”, yytext);} [ab]+ {printf(“Reject %s\n”, yytext);}.|\n ECHO; % main() {yylex();} Strings containing aa

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 15 Rules Each has a rule has a pattern and an action. Patterns are regular expression Only one action is performed –The action corresponding to the pattern matched is performed. –If several patterns match the input, the one corresponding to the longest sequence is chosen. –Among the rules whose patterns match the same number of characters, the rule given first is preferred.

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 16 xcharacter 'x'.any character except newline [xyz]character class, in this case, matches either an 'x', a 'y', or a 'z' [abj-oZ]character class with a range in it; matches 'a', 'b', any letter from 'j' through 'o', or 'Z' [^A-Z]negated character class, i.e., any character but those in the class, e.g. any character except an uppercase letter. [^A-Z\n] any character EXCEPT an uppercase letter or a newline r*zero or more r's, where r is any regular expression r+one or more r's r?zero or one r's (i.e., an optional r) {name} expansion of the "name" definition (see above) "[xy]\"foo" the literal string: '[xy]"foo' (note escaped “) \xif x is an 'a', 'b', 'f', 'n', 'r', 't', or 'v', then the ANSI-C interpretation of \x. Otherwise, a literal 'x' (e.g., escape) rs RE r followed by RE s (e.g., concatenation) r|s either an r or an s >end-of-file Flex’s RE syntax

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. 17 /* scanner for a toy Pascal-like language */ %{ #include /* needed for call to atof() */ %} DIG [0-9] ID [a-z][a-z0-9]* % {DIG}+ printf("Integer: %s (%d)\n", yytext, atoi(yytext)); {DIG}+"."{DIG}* printf("Float: %s (%g)\n", yytext, atof(yytext)); if|then|begin|end printf("Keyword: %s\n",yytext); {ID} printf("Identifier: %s\n",yytext); "+"|"-"|"*"|"/" printf("Operator: %s\n",yytext); "{"[^}\n]*"}" /* skip one-line comments */ [ \t\n]+ /* skip whitespace */. printf("Unrecognized: %s\n",yytext); % main(){yylex();}