Lexical Analyzer (Checker)

Slides:



Advertisements
Similar presentations
4b Lexical analysis Finite Automata
Advertisements

Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Chapter 2 Lexical Analysis Nai-Wei Lin. Lexical Analysis Lexical analysis recognizes the vocabulary of the programming language and transforms a string.
Chapter 3 Lexical Analysis Yu-Chen Kuo.
LEXICAL ANALYSIS Phung Hua Nguyen University of Technology 2006.
Lexical Analyzer Second lecture. Compiler Construction Outline Informal sketch of lexical analysis Identifies tokens in input string Issues in lexical.
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 Chapter 2: Scanning 朱治平. Scanner (or Lexical Analyzer) the interface between source & compiler could be a separate pass and places its output on an.
2. Lexical Analysis Prof. O. Nierstrasz
Lexical Analysis Recognize tokens and ignore white spaces, comments
1 Pertemuan Lexical Analysis (Scanning) Matakuliah: T0174 / Teknik Kompilasi Tahun: 2005 Versi: 1/6.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
CPSC 388 – Compiler Design and Construction
 We are given the following regular definition: if -> if then -> then else -> else relop -> |>|>= id -> letter(letter|digit)* num -> digit + (.digit.
Chapter 3 Lexical Analysis
Topic #3: Lexical Analysis
Lexical Analysis Natawut Nupairoj, Ph.D.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
CS308 Compiler Principles Lexical Analyzer Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Fall 2012.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Topic #3: Lexical Analysis EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
CH3.1 CS 345 Dr. Mohamed Ramadan Saady Algebraic Properties of Regular Expressions AXIOMDESCRIPTION r | s = s | r r | (s | t) = (r | s) | t (r s) t = r.
1 November 1, November 1, 2015November 1, 2015November 1, 2015 Azusa, CA Sheldon X. Liang Ph. D. Computer Science at Azusa Pacific University Azusa.
Compiler Construction 2 주 강의 Lexical Analysis. “get next token” is a command sent from the parser to the lexical analyzer. On receipt of the command,
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 Lexical Analysis and Lexical Analyzer Generators Chapter 3 COP5621 Compiler Construction Copyright Robert van Engelen, Florida State University,
Pembangunan Kompilator.  A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and.
1.  It is the first phase of compiler.  In computer science, lexical analysis is the process of converting a sequence of characters into a sequence.
CSc 453 Lexical Analysis (Scanning)
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Fall 2003CS416 Compiler Design1 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lexical Analyzer CS308 Compiler Theory1. 2 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally.
Lexical Analysis – Part II EECS 483 – Lecture 3 University of Michigan Wednesday, September 13, 2006.
Lexical Analysis.
1st Phase Lexical Analysis
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Compilers Lexical Analysis 1. while (y < z) { int x = a + b; y += x; } 2.
COMP 3438 – Part II - Lecture 3 Lexical Analysis II Par III: Finite Automata Dr. Zili Shao Department of Computing The Hong Kong Polytechnic Univ. 1.
Lexical Analyzer in Perspective
Finite automate.
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Finite-State Machines (FSMs)
Compilers Welcome to a journey to CS419 Lecture5: Lexical Analysis:
CSc 453 Lexical Analysis (Scanning)
Introduction to Lexical Analysis
Two issues in lexical analysis
Recognizer for a Language
פרק 3 ניתוח לקסיקאלי תורת הקומפילציה איתן אביאור.
Lexical Analysis and Lexical Analyzer Generators
Recognition of Tokens.
Finite Automata.
4b Lexical analysis Finite Automata
CSc 453 Lexical Analysis (Scanning)
Presentation transcript:

Lexical Analyzer (Checker)

lec01-lexicalanalyzer April 23, 2017 Lexical Analyzer Lexical Analyzer reads the source program character by character to produce tokens. Normally a lexical analyzer doesn’t return a list of tokens at one shot, it returns a token when the parser asks a token from it.

Tokens, Lexemes, and Patterns Tokens include keywords, operators, identifiers, constants, literal strings, punctuation symbols e.g: identifier, number, addop, assgop A lexeme is a sequence of characters in the source program representing a token e.g: newval, oldval A pattern is a rule describing a set of lexemes that can represent a particular token e.g: Identifier represents a set of strings which start with a letter continues with letters and digits Regular expressions are widely used to specify patterns.

Attributes Since a token can represent more than one lexeme, attributes provide additional information about tokens For simplicity, a token may have a single attribute. For an identifier, attribute is a pointer to the symbol table Examples of some attributes: <id,attr> where attr is pointer to the symbol table <assgop,_> no attribute is needed (only one assignment operator) <num,val> where val is the actual value of the number. Token and its attribute uniquely identifies a lexeme.

Strings and Languages Alphabet – any finite set of symbols (e.g. ASCII, binary alphabet, or a set of tokens) String – A finite sequence of symbols drawn from an alphabet Language – A set of strings over a fixed alphabet

Operations on Languages Union: Concatenation: Kleene closure: Zero or more concatenations Positive closure: One or more concatenations

Regular Expressions Can give “names” to regular expressions Convention: names in boldface (to distinguish them from symbols) letter  A|B|…|Z|a|b|…|z digit  0|1|…|9 id  letter (letter | digit)*

Notational Shorthands One or more instances: r+ denotes rr* Zero or one Instance: r? denotes r|ε Character classes: [a-z] denotes [a|b|…|z] digit  [0-9] digits  digit+ optional_fraction  (. digits )? num  digits optional_fraction

Limitations Can not describe balanced or nested constructs Example, all valid strings of balanced parentheses This can be done with Context Free Grammar ( CFG)

Grammar Fragment (Pascal) stmt  if expr then stmt | if expr then stmt else stmt | ε expr  term relop term | term term  id | num

Related Regular Expression Definitions if  if then  then else  else relop  < | <= | = | <> | > | >= id  letter ( letter | digit )* num  digit+ (. digit+ )? ws  delim+ delim  blank | tab | newline

Tokens and Attributes Regular Expression Token Attribute Value ws - if then else id pointer to entry num < relop LT <= LE = EQ <> NE > GT => GE

Transition Diagrams A stylized flowchart Transition diagrams consist of states connected by edges Edges leaving a state s are labeled with input characters that may occur after reaching state s Assumed to be deterministic There is one start state and at least one accepting (final) state

Transition Diagram for “relop”

Identifiers and Keywords Share a transition diagram After reaching accepting state, code determines if lexeme is keyword or identifier

Numbers

Finding the Next Token token nexttoken(void) { while (1) { switch (state) { case 0: c = nextchar(); if (c == ' ' || c=='\t' || c == '\n') { state = 0; lexeme_beginning++; } else if (c == '<') state = 1; else if (c == '=') state = 5 else if (c == '>') state = 6 else state = next_td(); break; … /* other cases here */

Trying Transition Diagrams int next_td(void) { switch (start) { case 0: start = 9; break; case 9: start = 20; break; case 20: start = 25; break; case 25: recover(); break; default: error("invalid start state"); } /* Possibly additional actions here */ return start;

Finite Automata Generalized transition diagrams that act as “recognizer” for a language Can be nondeterministic (NFA) or deterministic (DFA) NFAs can have ε-transitions, DFAs can not NFAs can have multiple edges with same symbol leaving a state, DFAs can not Both can recognize exactly what regular expressions can denote

NFAs A set of states S A set of input symbols Σ (input alphabet) A transition function move that maps state, symbol pairs to a set of states A single start state s0 A set of accepting (or final) states F An NFA accepts a string s if and only if there exists a path from the start state to an accepting state such that the edge labels spell out s

NFA (Example) The language recognized by this NFA is (a|b) * ab 1 2 a b start 0 is the start state s0 {2} is the set of final states F = {a,b} S = {0,1,2} Transition graph of the NFA The language recognized by this NFA is (a|b) * ab

Transition Tables State Input Symbol a b {0,1} {0} 1 --- {2} 2 {3}

DFAs No state has an ε-transition For each state s and input symbol a, there as at most one edge labeled a leaving s

Example: r = (a|b)*abb

Functions ε-closure and move ε-closure(s) is the set of NFA states reachable from NFA state s on ε-transitions alone move(T,a) is the set of NFA states to which there is a transition on input a from any NFA state s in T

Constructed DFA

Simulating a DFA s := s0 c := nextchar while c != eof do s := move(s, c) end if s is in F then return “yes” else return “no”

Simulating an NFA S := ε-closure({s0}) a := nextchar while a != eof do S := ε-closure(move(S,a)) if S ∩ F != Ø return “yes” else return “no”

Space/Time Tradeoff (Worst Case) NFA O(|r|) O(|r|*|x|) DFA O(2|r|) O(|x|)

Simulating a Regular Expression First use Thompson’s Construction to convert RE to NFA Then there are two choices: Use subset construction to convert NFA to DFA, then simulate the DFA Simulate the NFA directly

Some Other Issues in Lexical Analyzer The lexical analyzer has to recognize the longest possible string. Ex: identifier newval -- n ne new newv newva newval What is the end of a token? Is there any character which marks the end of a token?

Some Other Issues in Lexical Analyzer (cont.) Skipping comments Normally we don’t return a comment as a token. So, the comments are only processed by the lexical analyzer, and don’t complicate the syntax of the language. Symbol table interface symbol table holds information about tokens (at least lexeme of identifiers) how to implement the symbol table, and what kind of operations. hash table – open addressing, chaining putting into the hash table, finding the position of a token from its lexeme. Positions of the tokens in the file (for the error handling).