CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
CSc 453 Lexical Analysis (Scanning)
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
2. Lexical Analysis Prof. O. Nierstrasz
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Compiler Construction Lexical Analysis. The word lexical means textual or verbal or literal. The lexical analysis implemented in the “SCANNER” module.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CSc 453 Lexical Analysis (Scanning)
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lexical Analysis.
1st Phase Lexical Analysis
Lexical Analysis.
CS412/413 Introduction to Compilers and Translators Spring ’99 Lecture 2: Lexical Analysis.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
Lexical Analysis.
Lecture 2 Lexical Analysis
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
CSc 453 Lexical Analysis (Scanning)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Two issues in lexical analysis
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Recognition of Tokens.
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Lecture 5 Scanning.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras

CSE 5317/4305 L2: Lexical Analysis2 Lexical Analysis A scanner groups input characters into tokens inputtokenvalue identifierx equal= identifierx star* x = x * (acc+123) left-paren( identifieracc plus+ integer123 right-paren) Tokens are typically represented by numbers

CSE 5317/4305 L2: Lexical Analysis3 Communication with the Parser Each time the parser needs a token, it sends a request to the scanner the scanner reads as many characters from the input stream as necessary to construct a single token when a single token is formed, the scanner is suspended and returns the token to the parser the parser will repeatedly call the scanner to read all the tokens from the input stream scannerparser get token token source file get next character AST

CSE 5317/4305 L2: Lexical Analysis4 Tasks of a Scanner A typical scanner: –recognizes the keywords of the language these are the reserved words that have a special meaning in the language, such as the word class in Java –recognizes special characters, such as ( and ), or groups of special characters, such as := and == –recognizes identifiers, integers, reals, decimals, strings, etc –ignores whitespaces (tabs, blanks, etc) and comments –recognizes and processes special directives (such as the #include "file" directive in C) and macros

CSE 5317/4305 L2: Lexical Analysis5 Scanner Generators Input: a scanner specification –describes every token using Regular Expressions (REs) eg, the RE [a-z][a-zA-Z0-9]* recognizes all identifiers with at least one alphanumeric letter whose first letter is lower-case alphabetic –handles whitespaces and resolve ambiguities Output: the actual scanner Scanner generators compile regular expressions into efficient programs (finite state machines) You will use a scanner generator for Java, called JLex, for the project

CSE 5317/4305 L2: Lexical Analysis6 Regular Expressions are a very convenient form of representing (possibly infinite) sets of strings, called regular sets –eg, the RE (a | b)*aa represents the infinite set {“aa”,“aaa”,“baa”,“abaa”,... } –a RE is one of the following: nameREdesignation epsilon  {“”} symbol a{“a”} for some character a concatenation AB the set { rs | r  A, s  B }, where rs is string concatenation, and A and B designate the REs for A and B alternation A | B the set A  B, where A and B designate the REs for A and B repetition A* the set  | A | (AA) | (AAA) |... (an infinite set) –eg, the RE (a | b)c designates { rs | r  {“a”}  {“b”}, s  {“c”} }, which is equal to {“ac”,“bc”} –Shortcuts: P + = PP*, P? = P | , [a-z] = (“a”|“b”|...|“z”)

CSE 5317/4305 L2: Lexical Analysis7 Properties concatenation and alternation are associative –eg, ABC means (AB)C and is equivalent to A(BC) alternation is commutative –eg, A | B = B | A repetition is idempotent –eg, A** = A* concatenation distributes over alternation –eg, (a | b)c = ac | bc

CSE 5317/4305 L2: Lexical Analysis8 Examples for-keyword= for letter= [a-zA-Z] digit= [0-9] identifier= letter (letter | digit)* sign= + | - |  integer= sign (0 | [1-9]digit*) decimal= integer. digit* real= (integer | decimal) E sign digit +

CSE 5317/4305 L2: Lexical Analysis9 Disambiguation Rules 1)longest match rule: from all tokens that match the input prefix, choose the one that matches the most characters 2)rule priority: if more than one token has the longest match, choose the one listed first Examples: for8is it the for-keyword, the identifier “f”, the identifier “fo”, the identifier “for”, or the identifier “for8”? Use rule 1: “for8” matches the most characters. foris it the for-keyword, the identifier “f”, the identifier “fo”, or the identifier “for”? Use rule 1 & 2: the for-keyword and the “for” identifier have the longest match but the for-keyword is listed first.

CSE 5317/4305 L2: Lexical Analysis10 How Scanner Generators Work Translate REs into a finite state machine Done in three steps: 1)translate REs into a no-deterministic finite automaton (NFA) 2)translate the NFA into a deterministic finite automaton (DFA) 3)optimize the DFA (optional)

CSE 5317/4305 L2: Lexical Analysis11 Deterministic Finite Automata A DFA represents a finite state machine that recognizes a RE –eg, the RE (abc + ) + is represented by the DFA: A finite automaton consists of –a finite set of states –a set of transitions (moves) –one start state –a set of final states (accepting states) a DFA has a unique transition for every state-character combination A DFA accepts a string if starting from the start state and moving from state to state, each time following the arrow that corresponds the current input character, it reaches a final state when the entire input string is consumed

CSE 5317/4305 L2: Lexical Analysis12 DFA (cont.) The error state 0 is implied: The transition table T gives the next state T[s,c] for a state s and a character c abc

CSE 5317/4305 L2: Lexical Analysis13 The DFA of a Scanner for-keyword= for identifier= [a-z][a-z0-9]*

CSE 5317/4305 L2: Lexical Analysis14 Scanner Code The scanner code that uses the transition table T: state = initial_state; current_character = get_next_character(); while ( true ) { next_state = T[state,current_character]; if (next_state == ERROR) break; state = next_state; current_character = get_next_character(); if ( current_character == EOF ) break; }; if ( is_final_state(state) ) `we have a valid token' else `report an error'

CSE 5317/4305 L2: Lexical Analysis15 With Longest Match state = initial_state; final_state = ERROR; current_character = get_next_character(); while ( true ) { next_state = T[state,current_character]; if (next_state == ERROR) break; state = next_state; if ( is_final_state(state) ) final_state = state; current_character = get_next_character(); if (current_character == EOF) break; }; if ( final_state == ERROR ) `report an error' else if ( state != final_state ) `we have a valid token but need to backtrack (to put characters back into the input stream)' else `we have a valid token'

CSE 5317/4305 L2: Lexical Analysis16 Alternative Scanner Code For each transition in a DFA s1 generate code: s1: current_character = get_next_character();... if ( current_character == 'c' ) goto s2;... s2: current_character = get_next_character();... s2 c

CSE 5317/4305 L2: Lexical Analysis17 Mapping a RE into an NFA An NFA is similar to a DFA but it also permits multiple transitions over the same character and transitions over  The following rules construct NFAs with only one final state:

CSE 5317/4305 L2: Lexical Analysis18 Example The RE (a | b)c is mapped into the NFA:

CSE 5317/4305 L2: Lexical Analysis19 Converting an NFA to a DFA Subset construction: –assign a number to each NFA state –each DFA state will be assigned a set of numbers –the closure of a DFA state {n 1,...,n k } is the DFA state that contains all the NFA states that can be reached by zero or more empty transitions (ie,  transitions) from the NFA states n 1,..., or n k so the closure of {n 1,...,n k } is a superset of or equal to {n 1,...,n k } –the initial DFA state is the closure of the initial NFA state –for every DFA state labelled by some set {n 1,...,n k } and for every character c in the language alphabet, you find all the states reachable by n 1, n 2, or n k using c arrows and you union together the closures of these nodes. If this set is not the label of any other node in the DFA constructed so far, you create a new DFA node with this label

CSE 5317/4305 L2: Lexical Analysis20 Example

CSE 5317/4305 L2: Lexical Analysis21 Example (a | b)*(abb | a + b)

CSE 5317/4305 L2: Lexical Analysis22 JLex Regular expressions (where e and f are regular expressions): –cany character c other than: ? * + | ( ) ^ $. [ ] { } " \ –\cany character c, but \n is newline, \^c is control-c, etc –.any character except \n –“...”the concatenation of all the characters in the string –efconcatenation –e | falternation –e*Kleene closure –e+ee* –e?optional e –{name}macro expansion –[...]any character enclosed in [ ] (but only one character), from: ca character c (or use \c) efany character from e or from f a-bany character from a to b “...”any character in the string –[^...]any character except those enclosed by [ ]

CSE 5317/4305 L2: Lexical Analysis23 JLex Rules A JLex rule: RE{ action } where action is Java code –typically, the action returns a token –but you want to skip whitespaces and comments –yytext() returns the part of the input that matches the RE JLex uses longest match and rule priority States and state transitions can be used for better control –the initial (default) state is YYINITIAL –any other state should be declared using the %state directive –now a rule can take the form: RE{ action } which can match if we are in state s only –you jump to a state s using yybegin(s)

CSE 5317/4305 L2: Lexical Analysis24 Case Study: The Calculator Scanner The calculator example is available at: After you download it on gamma, do: tar xfz calc.tar.gz cd calc build run then try it with some input; eg, 2*(3+8); x:=3+4; x+3; define f(n) = if n=0 then 1 else n*f(n-1); f(5); quit;

CSE 5317/4305 L2: Lexical Analysis25 Tokens are Defined in calc.cup terminal LP, RP, COMMA, SEMI, ASSIGN, IF, THEN, ELSE, AND, OR, NOT, QUIT, PLUS, TIMES, MINUS, DIV, EQ, LT, GT, LE, NE, GE, FALSE, TRUE, DEFINE; terminal StringID; terminal IntegerINT; terminal FloatREALN; terminal StringSTRINGT; The class constructor Symbol pairs together a terminal token with an optional value (a Java Object) –if a terminal is specified with a class (a subtype of Object) then an object of this class should be provided along with the token –eg, Symbol(sym.ID,“x”) –eg, Symbol(sym.INT,10)

CSE 5317/4305 L2: Lexical Analysis26 The Calculator Scanner import java_cup.runtime.Symbol; % %class CalcLex %public %line %char %cup DIGIT=[0-9] ID=[a-zA-Z][a-zA-Z0-9_]* %

CSE 5317/4305 L2: Lexical Analysis27 The Calculator Scanner (cont.) {DIGIT}+{ return new Symbol(sym.INT,new Integer(yytext())); } {DIGIT}+"."{DIGIT}+ { return new Symbol(sym.REALN,new Float(yytext())); } "("{ return new Symbol(sym.LP); } ")"{ return new Symbol(sym.RP); } ","{ return new Symbol(sym.COMMA); } ";"{ return new Symbol(sym.SEMI); } ":=" { return new Symbol(sym.ASSIGN); } "define" { return new Symbol(sym.DEFINE); } "quit" { return new Symbol(sym.QUIT); } "if"{ return new Symbol(sym.IF); } "then"{ return new Symbol(sym.THEN); } "else"{ return new Symbol(sym.ELSE); } "and" { return new Symbol(sym.AND); } "or" { return new Symbol(sym.OR); } "not" { return new Symbol(sym.NOT); } "false"{ return new Symbol(sym.FALSE); } "true"{ return new Symbol(sym.TRUE); }

CSE 5317/4305 L2: Lexical Analysis28 The Calculator Scanner (cont.) "+"{ return new Symbol(sym.PLUS); } "*"{ return new Symbol(sym.TIMES); } "-"{ return new Symbol(sym.MINUS); } "/"{ return new Symbol(sym.DIV); } "="{ return new Symbol(sym.EQ); } "<"{ return new Symbol(sym.LT); } ">"{ return new Symbol(sym.GT); } "<="{ return new Symbol(sym.LE); } "!="{ return new Symbol(sym.NE); } ">="{ return new Symbol(sym.GE); } {ID}{ return new Symbol(sym.ID,yytext()); } \"[^\"]*\"{ return new Symbol(sym.STRINGT, yytext().substring(1,yytext().length()-1)); } [ \t\r\n\f] { /* ignore white spaces. */ }.{ System.err.println("Illegal character: "+yytext()); }