Lexical Analysis, Regular Expressions & Finite State Machines.

Slides:



Advertisements
Similar presentations
1 2.Lexical Analysis 2.1Tasks of a Scanner 2.2Regular Grammars and Finite Automata 2.3Scanner Implementation.
Advertisements

Lesson 6 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.
Finite-State Machines with No Output Ying Lu
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
1 Week 2 Questions / Concerns Schedule this week: Homework1 & Lab1a due at midnight on Friday. Sherry will be in Klamath Falls on Friday Lexical Analyzer.
Chapter 5: Languages and Grammar 1 Compiler Designs and Constructions ( Page ) Chapter 5: Languages and Grammar Objectives: Definition of Languages.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
From Cooper & Torczon1 The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source language?
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Lexical Analysis The Scanner Scanner 1. Introduction A scanner, sometimes called a lexical analyzer A scanner : – gets a stream of characters (source.
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Topic #3: Lexical Analysis
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
1 Chapter 3 Scanning – Theory and Practice. 2 Overview Formal notations for specifying the precise structure of tokens are necessary –Quoted string in.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Lexical Analyzer (Checker)
COMP313A Programming Languages Lexical Analysis. Lecture Outline Lexical Analysis The language of Lexical Analysis Regular Expressions.
D. M. Akbar Hussain: Department of Software & Media Technology 1 Compiler is tool: which translate notations from one system to another, usually from source.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 3, 09/11/2003 Prof. Roy Levow.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
CPS 506 Comparative Programming Languages Syntax Specification.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
Lexical Analysis S. M. Farhad. Input Buffering Speedup the reading the source program Look one or more characters beyond the next lexeme There are many.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Deterministic Finite Automata Nondeterministic Finite Automata.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Regular Expressions Finite State Machines Lexical Analysis.
Lexical Analyzer in Perspective
Lecture 2 Lexical Analysis
Lexical Analysis.
Chapter 3 Lexical Analysis.
Lexical Analysis CSE 340 – Principles of Programming Languages
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
CSc 453 Lexical Analysis (Scanning)
Finite-State Machines (FSMs)
Finite-State Machines (FSMs)
Formal Language Theory
Recognizer for a Language
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Review: Compiler Phases:
Lecture 4: Lexical Analysis & Chomsky Hierarchy
Lecture 5 Scanning.
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

Lexical Analysis, Regular Expressions & Finite State Machines

Processing English Consider the following two sentences Hi, I am 22 years old. I come from Alabama. 22 come Alabama I, old from am. Hi years I. Are they both correct? How do you know? Same words, numbers and punctuation What did you do first? 1.Find words, numbers and punctuation 2.Then, check order (grammar rules)

Finding Words and Numbers How did you find words, numbers and punctuation? You have a definition of what each is, or looks like For example, what is a number? a word? Although your are a bit more agile, the process was: 1.Start with first character 2.If letter, assume word; if digit, assume number 3.Scan left to right 1 character at a time, until punctuation mark (space, comma, etc.) 4.Recognize word or number 5.If no more characters, done; otherwise return to 1

Processing Code How do you process the following? What are the main parts in which to break the input? void quote() { print( "To iterate is human, to recurse divine." + " - L. Peter Deutsch" ); } Schemes: childOf(X,Y) marriedTo(X,Y) Facts: marriedTo('Zed','Bea'). marriedTo('Jack','Jill'). childOf('Jill','Zed'). childOf('Sue','Jack'). Rules: childOf(X,Y) :- childOf(X,Z), marriedTo(Y,Z). marriedTo(X,Y) :- marriedTo(Y,X). Queries: marriedTo('Bea','Zed')? childOf('Jill','Bea')? def addABC(x): s = “ABC” return x + s addABC(input(“String: ”))

Example def addABC ( x ) : s = “ABC” return x + s addABC ( input ( “String: ” ) )

What are the Parts? They are called TOKENS Process similar to English processing Lexical Analysis Input: A program in some language Output: A list of tokens (type, value, location)

Example Revisited Sample Input:Sample Output: def addABC(x): s = “ABC” return x + s addABC(input(“String: ”)) (FUNDEF,”def”,1) (ID,”addABC”,1) (LEFT_PAREN,”(”,1) (ID,”x”,1) (RIGHT_PAREN,”)”,1) (COLON,”:”,1) (ID,”s”,2) (ASSIGN,”=”,2) (STRING,”’ABC’”,2) (FUNRET,”return”,3) (ID,”x”,3) (OPERATOR,”+”,3) (ID,”s”,3) (ID,”addABC”,4) (LEFT_PAREN,”(”,4) …

Program Compilation Lexical Analysis is first step of process Program Compiler Code Lexical Analyzer Program Parser Tokens Code Generator Internal DataCode Keywords String literals Variables … Error messages Syntax AnalysisOr Interpreter (Executed directly)

Token Specification Regular Expressions Pattern description for strings Concatenation: abc -> “abc” Boolean OR: ab|ac -> “ab”, “ac” Kleene closure: ab * -> “a”, “ab”, “abbb”, etc. Optional: ab?c -> “ac”, “abc” One or more: ab + -> “ab”, “abbb” Group using () (a|b)c -> “ac”, “bc” (a|b) * c -> “c”, “ac”, “bc”, “bac”, “abaaabbbabbaaaaac”, etc.

RegEx Extensions Exactly n: a 3 b + -> “aaab”, “aaabb”, … [A-Z] = A|B|…|Z [ABC] = A|B|C [~aA] = any character but “a” or “A” \ = escape character (e.g., \* -> “*”) Whitespace characters \s, \t, \n, \v

Token Recognition Finite State Machine A DFSM is a 5-tuple (Σ,S,s 0,δ,F) Σ: finite, non-empty set of symbols (input alphabet) S: finite, non-empty set of states s 0 : member of S designated as start state δ: state-transition function δ: S x Σ -> S F: subset of S (final states, may be empty)

FSM & RegEx abc a(b|c) ab* (a(b?c)) + abc Note the special double-circle designation of a final/accepting state. a a a b b b a c c c

Finite State Transducer Extended FSM: Γ: finite, non-empty set of symbols (output alphabet) δ: state-transition function δ: S x Σ -> S x Γ FST consumes input symbols and emits output symbols Lexical analyzer consume raw characters emit tokens

CS 236 Coolness Factor! Design our own language Subset of Datalog (LP-like) Build an interpreter for our language Lexical Analyzer (Project 1) Parser (Project 2) Interpreter (Projects 3 and 4) Optimization (Project 5)

Designing a Language Define the tokens Elements of the language, punctuation, etc. For example, what are they in C++? Recognize the tokens (lexical analysis) Define the grammar Forms of correct sentences For example, what are they in C++? Recognize the grammar (parsing) Interpret and execute the program C++ is a bit too complicated for us…

Varied World Views fct personlist siblings(person x) { return x’s siblings } fct int square(int x) { return x * x } fct boolean succeeds(person x) { if studies(x) return T else return F } fct boolean sibling(person x, person y) { if y is x’s sibling return T else return F } fct boolean square(int x, int y) { if y == x * x return T else return F } fct boolean succeeds(person x) { if studies(x) return T else return F } Look up table or oracle No concerns with efficiency

Logic Programming Assume: all functions are Boolean Compute using facts and rules Facts are the known true values of the functions Rules express relations among functions Example: studies(x), succeeds(x) Facts: studies(Matt), studies(Jenny) Rule: succeeds(x) :- studies(x) Closed-world Assumption

Logic Programming Computing is like issuing queries First check if it can be answered with facts Second check if rules can be applied Examples studies(Alex)? NO (neither facts nor rules to establish it) studies(Matt)? YES (there is fact about that) succeeds(Jenny)? YES (no fact, but a rule that if Jenny studies then she succeeds and a fact that Jenny studies)

Functions of Several Arguments Examples loves(x,y), parent(x,y), inclass(x,y) loves(x,y) :- married(x,y) Computing parent(Christophe, Samuel)? Yes, if there is a fact that matches parent(Christophe, X)? Yes, if there is a value of X that would cause it to match a fact – return value of X loves(X, Y)? Yes, if there are values of X and Y that would make this true, either by matching a fact or via rules (e.g., married(Christophe, Isabelle)) – return values of X and Y

When We Are Done Sample Program:Sample Execution: Schemes: snap(S,N,A,P) csg(C,S,G) cn(C,N) ncg(N,C,G) Facts: snap('12345','C. Brown','12 Apple St.',' '). snap('22222','P. Patty','56 Grape Blvd',' '). snap('33333','Snoopy','12 Apple St.',' '). csg('CS101','12345','A'). csg('CS101','22222','B'). csg('CS101','33333','C'). csg('EE200','12345','B+'). csg('EE200','22222','B'). Rules: cn(C,N) :- snap(S,N,A,P),csg(C,S,G). ncg(N,C,G) :- snap(S,N,A,P),csg(C,S,G). Queries: cn('CS101',Name)? ncg('Snoopy',Course,Grade)? cn('CS101',Name)? Yes(3) Name='C. Brown' Name='P. Patty' Name='Snoopy' ncg('Snoopy',Course,Grade)? Yes(1) Course='CS101', Grade='C' Demo…

Project 1: Lexical Analyzer Sample Input:Sample Output: Queries: IsInRoomAtDH('Snoopy',R,'M',H) #SchemesFactsRules. (QUERIES,"Queries",1) (COLON,":",1) (ID,"IsInRoomAtDH",2) (LEFT_PAREN,"(",2) (STRING,"'Snoopy'",2) (COMMA,",",2) (ID,"R",2) (COMMA,",",2) (STRING,"'M'",2) (COMMA,",",2) (ID,"H",2) (RIGHT_PAREN,")",2) (COMMENT,"#SchemesFactsRules",3) (PERIOD,".",4) Total Tokens = 14 Define and find the tokens

Basic FST for Project 1 and )> ‘ ‘ : string : … white space ident. - | | | eof error Special check for Keywords (Schemes, Facts, Rules, Queries) or :- or keywd. start :- error or

Implementing a FST State in Variable state = START; input = readChar(); while (state != ACCEPT) { if (state == START) { if (input == QUOTE) { input = readChar(); state = STRING; } else if (input ==...) {... other kinds of tokens... } } else if (state == STRING) { if (input == QUOTE) { input = readChar(); state = ACCEPT; } else { input = readChar(); state = STRING; } State in Position in Code input = readChar(); // begin in START state if (input == QUOTE) { input = readChar(); // now in STRING state while (input != QUOTE) { input = readChar(); // stay in STRING state } input = readChar(); // now in ACCEPT state } else if (input ==...) {... other kinds of tokens... }