Lecture 2 Lexical Analysis

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

4b Lexical analysis Finite Automata
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Regular Expressions Finite State Automaton. Programming Languages2 Regular expressions  Terminology on Formal languages: –alphabet : a finite set of.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary  Quoted string in.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
CSE 5317/4305 L2: Lexical Analysis1 Lexical Analysis Leonidas Fegaras.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
CS 536 Fall Scanner Construction  Given a single string, automata and regular expressions retuned a Boolean answer: a given string is/is not in.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
CSc 453 Lexical Analysis (Scanning)
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
C Chuen-Liang Chen, NTUCS&IE / 35 SCANNING Chuen-Liang Chen Department of Computer Science and Information Engineering National Taiwan University Taipei,
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
June 13, 2016 Prof. Abdelaziz Khamis 1 Chapter 2 Scanning – Part 2.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Department of Software & Media Technology
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CS314 – Section 5 Recitation 2
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Lecture 2 Lexical Analysis Joey Paquet, 2000, 2002, 2012.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
RegExps & DFAs CS 536.
Finite-State Machines (FSMs)
Two issues in lexical analysis
Recognizer for a Language
Lexical Analysis - An Introduction
Lecture 5: Lexical Analysis III: The final bits
CS 3304 Comparative Languages
Lecture 4: Lexical Analysis & Chomsky Hierarchy
4b Lexical analysis Finite Automata
CS 3304 Comparative Languages
4b Lexical analysis Finite Automata
Lexical Analysis - An Introduction
Lecture 5 Scanning.
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lecture 2 Lexical Analysis Compiler Design Lecture 2 Lexical Analysis

Lexical Analysis A Lexical Analyzer (Lexer, or Scanner) groups input characters into tokens input token value identifier x equal = star * x = x * (acc+123) left-paren ( identifier acc plus + integer 123 right-paren ) Tokens are typically represented by numbers Tokens are typically represented by numbers. For example, the token * may be assigned number 35. Some tokens require some extra information.

Communication with the Parser scanner parser get token token source file get next character AST Each time the parser needs a token, it sends a request to the scanner the scanner reads as many characters from the input stream as necessary to construct a single token when a single token is formed, the scanner is suspended and returns the token to the parser the parser will repeatedly call the scanner to read all the tokens from the input stream

Tasks of a Scanner A typical scanner: recognizes the keywords of the language these are the reserved words that have a special meaning in the language, such as the word class in C++ recognizes special characters, such as ( and ), or groups of special characters, such as := and == recognizes identifiers, integers, reals, decimals, strings, etc ignores whitespaces (tabs, blanks, etc) and comments recognizes and processes special directives (such as the #include "file" directive in C) and macros

Regular Expressions are a very convenient form of representing (possibly infinite) sets of strings, called regular sets eg, the RE (a | b)*aa represents the infinite set {“aa”,“aaa”,“baa”,“abaa”, ... } a RE is one of the following: name RE designation epsilon  {“”} symbol a {“a”} for some character a concatenation AB the set { rs | rA, sB }, where rs is string concatenation, and A and B designate the REs for A and B alternation A | B the set A  B, where A and B designate the REs for A and B repetition A* the set  | A | (AA) | (AAA) | ... (an infinite set) eg, the RE (a | b)c designates { rs | r{“a”}{“b”}, s {“c”} }, which is equal to {“ac”,“bc”} Shortcuts: P+ = PP*, P? = P | , [a-z] = (“a”|“b”|...|“z”), P2 = PP

Properties Kleen closure (*) binds before concatenation before alteration (|) eg, a|ab* is equivalent to a|(a(b*))

Examples for-keyword = for letter = [a-zA-Z] digit = [0-9] identifier = letter (letter | digit)* sign = + | - |  integer = sign (0 | [1-9]digit*) decimal = integer . digit* real = (integer | decimal) E sign digit+

Disambiguation Rules Problem: One string may match many regular expressions longest match rule: from all tokens that match the input prefix, choose the one that matches the most characters rule priority: if more than one token has the longest match, choose the one listed first Examples: for8 is it the for-keyword, the identifier “f”, the identifier “fo”, the identifier “for”, or the identifier “for8”? Use rule 1: “for8” matches the most characters. for is it the for-keyword, the identifier “f”, the identifier “fo”, or the identifier “for”? Use rule 1 & 2: the for-keyword and the “for” identifier have the longest match but the for-keyword is listed first.

How to write a Scanner? Write a program with switch case for regular expression of each token: Very difficult and complex It will lend to deep nested switch and if statements Static, i.e. when a regular expression changes we have to modify the program manually Use a finite automaton What is it?

Finite Automata A finite automaton can be used to decide if an input string is a member in some particular set of strings. A finite automaton consists of: a finite set of states a set of transitions (moves) one start state a set of final states (accepting states) Two types of finite automaton: Deterministic Finite Automaton (DFA) Non-deterministic Finite Automaton (NFA)

Deterministic Finite Automaton (DFA) A DFA accepts a string if starting from the start state and moving from state to state, each time following the arrow that corresponds the current input character, it reaches a final state when the entire input string is consumed eg, the RE (abc+)+ is represented by the DFA: A DFA has a unique transition for every state-character combination

DFA (cont.) The transition table T gives the next state T[s,c] for a state s and a character c Ø means error state a b c 1 2 Ø Ø 2 Ø 3 Ø 3 Ø Ø 4 4 2 Ø 4 (abc+)+

The DFA of a Scanner for-keyword = for identifier = [a-z][a-z0-9]*

Scanner Code for DFA For each transition in a DFA generate code: s1: current_character = get_next_character(); ... if ( current_character == 'c' ) goto s2; s2: current_character = get_next_character(); s2 c

Scanner Code for DFA using Transition Table T state = initial_state; current_character = get_next_character(); while ( true ) { next_state = T[state,current_character]; if (next_state == ERROR) break; state = next_state; if ( current_character == EOF ) }; if ( is_final_state(state) ) `we have a valid token' else `report an error'

Non-deterministic Finite Automaton (NFA) DFA is very difficult to construct from RE An NFA is similar to a DFA but it also permits: multiple transitions over the same character and, transitions over ε state a b ε 1 Ø 3 2 2 {1,3} Ø Ø 3 Ø Ø Ø a*(a|b)

Combined NFA for several tokens

How Scanner Generators Work Translate REs into a finite state machine Done in three steps: translate REs into a no-deterministic finite automaton (NFA) translate the NFA into a deterministic finite automaton (DFA) optimize the DFA (optional) We’ll study only step 1. 

Converting RE to NFA The following rules construct NFAs with only one final state: a ε s t s | t s*

NFA for the regular expression (a|b)*ac

Converting RE short hands to NFA Example: generates [0..9]+ 0120

Advantages And Disadvantages of NFA Easy to construct from RE Disadvantages: Big size, i.e. large number of states Large memory space Need backtrack to recognize a string, since many moves may exist for one input character and ε transitions Long time for recognition

References Basics of Compiler Design, Torben Ægidius Mogensen. Published through lulu.com, 2009 Compiler Design: Theory, Tools and Examples, Seth D. Bergmann. William C. Brown, 1994 Course notes of Leonidas Fegaras, University of Texas at Arlington, CSE, 2005