Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.

Slides:



Advertisements
Similar presentations
CSE 311 Foundations of Computing I
Advertisements

4b Lexical analysis Finite Automata
Nondeterministic Finite Automata CS 130: Theory of Computation HMU textbook, Chapter 2 (Sec 2.3 & 2.5)
Finite Automata CPSC 388 Ellen Walker Hiram College.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Exercise: Build Lexical Analyzer Part For these two tokens, using longest match, where first has the priority: binaryToken ::= ( z |1) * ternaryToken ::=
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
Finite Automata and Non Determinism
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
CSC 361Finite Automata1. CSC 361Finite Automata2 Formal Specification of Languages Generators Grammars Context-free Regular Regular Expressions Recognizers.
FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY
Regular Languages A language is regular over  if it can be built from ;, {  }, and { a } for every a 2 , using operators union ( [ ), concatenation.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
TM Design Universal TM MA/CSSE 474 Theory of Computation.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Overview of Previous Lesson(s) Over View  An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Lexer input and Output i d 3 = 0 LF w i d 3 = 0 LF w id3 = 0 while ( id3 < 10 ) id3 = 0 while ( id3 < 10 ) lexer Stream of Char-s ( lazy List[Char] ) class.
CMSC 330: Organization of Programming Languages Finite Automata NFAs  DFAs.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
CS 208: Computing Theory Assoc. Prof. Dr. Brahim Hnich Faculty of Computer Sciences Izmir University of Economics.
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 1 Regular Languages Some slides are in courtesy.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
using Deterministic Finite Automata & Nondeterministic Finite Automata
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
Exercise: (aa)* | (aaa)* Construct automaton and eliminate epsilons.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Regular Languages Chapter 1 Giorgi Japaridze Theory of Computability.
1 Section 11.2 Finite Automata Can a machine(i.e., algorithm) recognize a regular language? Yes! Deterministic Finite Automata A deterministic finite automaton.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Deterministic Finite Automata Nondeterministic Finite Automata.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
1/29/02CSE460 - MSU1 Nondeterminism-NFA Section 4.1 of Martin Textbook CSE460 – Computability & Formal Language Theory Comp. Science & Engineering Michigan.
CIS Automata and Formal Languages – Pei Wang
Finite automate.
Lecture 2 Lexical Analysis
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
Deterministic Finite Automata
Two issues in lexical analysis
Recognizer for a Language
Chapter 2 FINITE AUTOMATA.
Deterministic Finite Automata
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
Non-Deterministic Finite Automata
Finite Automata.
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
CSE 311 Foundations of Computing I
Chapter 1 Regular Language
Lecture 5 Scanning.
Announcements - P1 part 1 due Today - P1 part 2 due on Friday Feb 1st
Presentation transcript:

Automating Construction of Lexers

Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code for lexer! But how does javacc do it?

Finite Automaton (Finite State Machine)  - alphabet Q - states (nodes in the graph) q 0 - initial state (with '>' sign in drawing)  - transitions (labeled edges in the graph) F - final states (double circles)

Numbers with Decimal Point digit digit*. digit digit* What if the decimal part is optional?

Exercise Design a DFA which accepts all the numbers written in binary and divisible by 6. For example your automaton should accept the words 0, 110 (6 decimal) and (18 decimal).

Deterministic:  is a function Kinds of Finite State Automata Otherwise: non-deterministic

Interpretation of Non-Determinism For a given word (string), a path in automaton lead to accepting, another to a rejecting state Does the automaton accept in such case? –yes, if there exists an accepting path in the automaton graph whose symbols give that word Epsilon transitions: traversing them does not consume anything (empty word) More generally, transitions labeled by a word: traversing such transition consumes that entire word at a time

Regular Expressions and Automata Theorem: If L is a set of words, then it is a value of a regular expression if and only if it is the set of words accepted by some finite automaton. Algorithms: regular expression  automaton (important!) automaton  regular expression (cool)

Recursive Constructions Union Concatenation Star

Eliminating Epsilon Transitions

Exercise: (aa)* | (aaa)* Construct automaton and eliminate epsilons

Determinization: Subset Construction –keep track of a set of all possible states in which automaton could be –view this finite set as one state of new automaton Apply to (aa)* | (aaa)* –can also eliminate epsilons during determinization

Remark: Relations and Functions Relation r  B x C r = {..., (b,c1), (b,c2),... } Corresponding function: f : B -> P (C)2 C f = {... (b,{c1,c2})... } f(b) = { c | (b,c)  r } Given a state, next-state function returns the set of new states –for deterministic automaton, the set has exactly 1 element

Running NFA in Scala def  (q : State, a : Char) : Set[States] = {... } def  '(S : Set[States], a : Char) : Set[States] = { for (q1 <- S, q2 <-  (q1,a)) yield q2 } def accepts(input : MyStream[Char]) : Boolean = { var S : Set[State] = Set(q0) // current set of states while (!input.EOF) { val a = input.current S =  '(S,a) // next set of states } !(S.intersect(finalStates).isEmpty) }

Minimization: Merge States Only limit the freedom to merge (prove !=) if we have evidence that they behave differently (final/non-final, or leading to states shown !=) When we run out of evidence, merge the rest –merge the states in the previous automaton for (aa)* | (aaa)* Very special case: if successors lead to same states on all symbols, we know immediately we can merge –but there are cases when we can merge even if successors lead to merged states

Minimization for example Start from all accepting disequal all non-accepting. Result: only {1} and {2,4} are merged. Here, the special case is sufficient, but in general, we need the above construction (take two copies of same automaton and union them).

Clarifications Non-deterministic state machines where a transition on some input is not defined We can apply determinization, and we will end up with –singleton sets –empty set (this becomes trap state) Trap state: a state that has self-loops for all symbols, and is non-accepting.

Exercise

Complementation, Inclusion, Equivalence Can compute complement: switch accepting and non-accepting states in deterministic machine (wrong for non-deterministic) We can compute intersection, inclusion, equivalence Intersection: complement union of complements Set difference: intersection with complement Inclusion: emptiness of set difference Equivalence: two inclusions

Exercise: first, nullable For each of the following languages find the first set. Determine if the language is nullable. –(a|b)* (b|d) ((c|a|d)* | a*) –language given by automaton:

Automated Construction of Lexers –let r 1, r 2,..., r n be regular expressions for token classes –consider combined regular expression: (r 1 | r 2 |... | r n )* –recursively map a regular expression to a non-deterministic automaton –eliminate epsilon transitions and determinize –optionally minimize A 3 to reduce its size  A 4 –the result only checks that input can be split into tokens, does not say how to split it

From (r 1 |r 2 |...|r n )* to a Lexer Construct machine for each r i labelling different accepting states differently for each accepting state of r i specify the token class i being recognized longest match rule: remember last token and input position for a last accepted state when no accepting state can be reached (effectively: when we are in a trap state) –revert position to last accepted state –return last accepted token

Exercise: Build Lexical Analyzer Part For these two tokens, using longest match, where first has the priority: binaryToken ::= ( z |1) * ternaryToken ::= (0|1|2) * 1111z1021z1 

Lexical Analyzer binaryToken ::= ( z |1) * ternaryToken ::= (0|1|2) * 1111z1021z1 

Exercise: first, nullable For each of the following languages find the first set. Determine if the language is nullable. –(a|b)*(b|d)((c|a|d)* | a*) –language given by automaton:

Exercise: Realistic Integer Literals Integer literals are in three forms in Scala: decimal, hexadecimal and octal. The compiler discriminates different classes from their beginning. –Decimal integers are started with a non-zero digit. –Hexadecimal numbers begin with 0x or 0X and may contain the digits from 0 through 9 as well as upper or lowercase digits A to F afterwards. –If the integer number starts with zero, it is in octal representation so it can contain only digits 0 through 7. –l or L at the end of the literal shows the number is Long. Draw a single DFA that accepts all the allowable integer literals. Write the corresponding regular expression.

Exercise Let L be the language of strings A = { 2. Construct a DFA that accepts L Describe how the lexical analyzer will tokenize the following inputs. 1) <===== 2) ==<==<==<==<== 3) <=====<

More Questions Find automaton or regular expression for: –Sequence of open and closed parentheses of even length? –as many digits before as after decimal point? –Sequence of balanced parentheses ( ( () ) ())- balanced ( ) ) ( ( ) - not balanced –Comment as a sequence of space,LF,TAB, and comments from // until LF –Nested comments like /*... /* */ … */