Lexer input and Output i d 3 = 0 LF w i d 3 = 0 LF w id3 = 0 while ( id3 < 10 ) id3 = 0 while ( id3 < 10 ) lexer Stream of Char-s ( lazy List[Char] ) class.

Slides:



Advertisements
Similar presentations
1 2.Lexical Analysis 2.1Tasks of a Scanner 2.2Regular Grammars and Finite Automata 2.3Scanner Implementation.
Advertisements

COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
4b Lexical analysis Finite Automata
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
Finite Automata CPSC 388 Ellen Walker Hiram College.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Compiler (scalac, gcc) Compiler (scalac, gcc) Id3 = 0 while (id3 < 10) { println(“”,id3); id3 = id3 + 1 } Id3 = 0 while (id3 < 10) { println(“”,id3); id3.
Exercise: Build Lexical Analyzer Part For these two tokens, using longest match, where first has the priority: binaryToken ::= ( z |1) * ternaryToken ::=
Winter 2007SEG2101 Chapter 81 Chapter 8 Lexical Analysis.
Lexical Analysis III Recognizing Tokens Lecture 4 CS 4318/5331 Apan Qasem Texas State University Spring 2015.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
1 Introduction to Parsing Lecture 5. 2 Outline Regular languages revisited Parser overview Context-free grammars (CFG’s) Derivations.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
1 Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Outline Informal sketch of lexical analysis –Identifies tokens in input string Issues in lexical analysis –Lookahead –Ambiguities Specifying lexers –Regular.
Lexical Analysis - An Introduction. The Front End The purpose of the front end is to deal with the input language Perform a membership test: code  source.
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
어휘분석 (Lexical Analysis). Overview Main task: to read input characters and group them into “ tokens. ” Secondary tasks: –Skip comments and whitespace;
Lexical Analysis - An Introduction Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Exercise 1 Consider a language with the following tokens and token classes: ident ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lexical Analyzer (Checker)
1 Course Overview PART I: overview material 1Introduction 2Language processors (tombstone diagrams, bootstrapping) 3Architecture of a compiler PART II:
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
1 Languages and Compilers (SProg og Oversættere) Lexical analysis.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Introduction to Parsing
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Exercise 1 Consider a language with the following tokens and token classes: ID ::= letter (letter|digit)* LT ::= " " shiftL ::= " >" dot ::= "." LP ::=
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
Compiler (scalac, gcc) Compiler (scalac, gcc) Id3 = 0 while (id3 < 10) { println(“”,id3); id3 = id3 + 1 } Id3 = 0 while (id3 < 10) { println(“”,id3); id3.
Exercise Solution for Exercise (a) {1,2} {3,4} a b {6} a {5,6,1} {6,2} {4} {3} {5,6} { } b a b a a b b a a b a,b b b a.
Given NFA A, find first(L(A))
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
1 Introduction to Parsing. 2 Outline l Regular languages revisited l Parser overview Context-free grammars (CFG ’ s) l Derivations.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Exercise: (aa)* | (aaa)* Construct automaton and eliminate epsilons.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
CMSC 330: Organization of Programming Languages Pushdown Automata Parsing.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Week 14 - Friday.  What did we talk about last time?  Simplifying FSAs  Quotient automata.
Department of Software & Media Technology
Finite automate.
Lecture 2 Lexical Analysis
Finite-State Machines (FSMs)
Lexical analysis Finite Automata
PROGRAMMING LANGUAGES
Finite-State Machines (FSMs)
Deterministic Finite Automata
Two issues in lexical analysis
Chapter 2 FINITE AUTOMATA.
Compiler Construction 2011, Lecture 2
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
4b Lexical analysis Finite Automata
4b Lexical analysis Finite Automata
Lecture 5 Scanning.
Presentation transcript:

Lexer input and Output i d 3 = 0 LF w i d 3 = 0 LF w id3 = 0 while ( id3 < 10 ) id3 = 0 while ( id3 < 10 ) lexer Stream of Char-s ( lazy List[Char] ) class CharStream(fileName : String){ val file = new BufferedReader( new FileReader(fileName)) var current : Char = ' ' var eof : Boolean = false def next = { if (eof) throw EndOfInput("reading" + file) val c = file.read() eof = (c == -1) current = c.asInstanceOf[Char] } next // init first char } Stream of Token-s sealed abstract class Token case class ID(content : String) // “id3” extends Token case class IntConst(value : Int) // 10 extends Token case class AssignEQ() ‘=‘ extends Token case class CompareEQ // ‘==‘ extends Token case class MUL() extends Token // ‘*’ case class PLUS() extends Token // + case clas LEQ extends Token // ‘<=‘ case class OPAREN extends Token //( case class CPAREN extends Token //)... case class IF extends Token // ‘if’ case class WHILE extends Token case class EOF extends Token // End Of File class Lexer(ch : CharStream) { var current : Token def next : Unit = { lexer code goes here }

We have seen how to... prove things about languages by induction use regular expressions to describe tokens compute first symbols of regular expressions build lexical analyzers (lexers) to recognize –identifiers –integer literals

Decision Tree to Map Symbols to Tokens ch.current match { case '(' => {current = OPAREN; ch.next; return} case ')' => {current = CPAREN; ch.next; return} case '+' => {current = PLUS; ch.next; return} case '/' => {current = DIV; ch.next; return} case '*' => {current = MUL; ch.next; return} case '=' => { // more tricky because there can be =, == ch.next if (ch.current == '=') {ch.next; current = CompareEQ; return} else {current = AssignEQ; return} } case ' { // more tricky because there can be <, <= ch.next if (ch.current == '=') {ch.next; current = LEQ; return} else {current = LESS; return} } What happens if we omit it? consider input '<= '

Skipping Comments if (ch.current='/') { ch.next if (ch.current='/') { while (!isEOL && !isEOF) { ch.next } } else { } Nested comments? /* foo /* bar */ baz */ // what do we set as the current token now?

Longest Match (Maximal Munch) Rule There are multiple ways to break input chars into tokens Consider language with identifiers - ID, <=, <, = Consider these input characters: interpreters <= compilers These are some ways to analyze it into tokens: ID(interpreters) LEQ ID(compilers) ID(inter) ID(preters) LESS AssignEQ ID(com) ID(pilers) ID(i) ID(nte) ID(rpre) ID(ter) LESS AssignEQ ID(co) ID(mpi) ID(lers) This is resolved by longest match rule: If multiple tokens could follow, take the longest token possible

Consequences of Longest Match Rule Consider language with three operators: For sequence ' ', lexer will report an error –Why? In practice, this is not a problem –we can always insert extra spaces

Longest Match Exercise Recall the maximal munch rule: lexical analyzer should eagerly accept the longest token that it can recognize from the current point. Consider the following specification of tokens, the numbers in parentheses gives the name of the token given by the regular expression. (1) a(ab)* (2) b*(ac)* (3) cba (4) c+ Use the maximal munch rule to tokenize the following strings according to the specification –c a c c a b a c a c c b a b c –c c c a a b a b a c c b a b c c b a b a c If we do not use the maximal munch rule, is another tokenization possible? Give an example of a regular expression and an input string, where the regular expression is able to split the input strings into tokens, but it is unable to do so if we use the maximal munch rule.

Token Priority What if our token classes intersect? Longest match rule does not help Example: a keyword is also an identifier Solution - priority: order all tokens, if overlap, take one with higher priority Example: if it looks both like keyword and like identifier, then it is a keyword (we say so)

Automating Construction of Lexers

Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code for lexer! But how does javacc do it?

Finite Automaton (Finite State Machine)  - alphabet Q - states (nodes in the graph) q 0 - initial state (with '>' sign in drawing)  - transitions (labeled edges in the graph) F - final states (double circles)

Numbers with Decimal Point digit digit*. digit digit* What if the decimal part is optional?

Exercise Design a DFA which accepts all the numbers written in binary and divisible by 6. For example your automaton should accept the words 0, 110 (6 decimal) and (18 decimal).

Deterministic:  is a function Kinds of Finite State Automata Otherwise: non-deterministic

Interpretation of Non-Determinism For a given word (string), a path in automaton lead to accepting, another to a rejecting state Does the automaton accept in such case? –yes, if there exists an accepting path in the automaton graph whose symbols give that word Epsilon transitions: traversing them does not consume anything (empty word) More generally, transitions labeled by a word: traversing such transition consumes that entire word at a time

Regular Expressions and Automata Theorem: If L is a set of words, then it is a value of a regular expression if and only if it is the set of words accepted by some finite automaton. Algorithms: regular expression  automaton (important!) automaton  regular expression (cool)

Recursive Constructions Union Concatenation Star

Eliminating Epsilon Transitions

Exercise: (aa)* | (aaa)* Construct automaton and eliminate epsilons

Determinization: Subset Construction –keep track of a set of all possible states in which automaton could be –view this finite set as one state of new automaton Apply to (aa)* | (aaa)*

Exercise

Exercise: first, nullable For each of the following languages find the first set. Determine if the language is nullable. –(a|b)*(b|d)((c|a|d)* | a*) –language given by automaton:

Minimization: Merge States We should only limit the freedom of merge if we have evidence that they behave differently (acceptance, or leading to different states) When we run out of evidence, merge the rest –merge the states in the previous automaton for (aa)* | (aaa)*

Automated Construction of Lexers –let r 1, r 2,..., r n be regular expressions for token classes –consider combined regular expression: (r 1 | r 2 |... | r n )* –recursively map a regular expression to a non-deterministic automaton A 1 –eliminate epsilon transitions from A 1 by adding more edges  A 2 –determinize A 2 using the subset construction  A 3 –minimize A 3 to reduce its size  A 4 –the result only checks that input can be split into tokens, does not say how! –so, must maintain different final states for different tokens –and, must remember last accepting state to know when to return token and to implement the longest match rule –we can implement an automaton using a big array or a map for transitions

Making (r 1 |r 2 |...|r n ) work as Lexer Modify state machine for each accepting state specify the token recognized (but do not stop) –use token class priority if multiple options for longest match rule: remember last token and input position for last accepted state when no accepting state can be reached (effectively: when we are in a trap state) –revert position to last accepted state –return last accepted token

Exercise: Build Lexical Analyzer Part For these two tokens, using longest match, where first has the priority: binaryToken ::= ( z |1) * ternaryToken ::= (0|1|2) * 1111z1021z1 

Lexical Analyzer binaryToken ::= ( z |1) * ternaryToken ::= (0|1|2) * 1111z1021z1 

Exercise: Realistic Integer Literals Integer literals are in three forms in Scala: decimal, hexadecimal and octal. The compiler discriminates different classes from their beginning. –Decimal integers are started with a non-zero digit. –Hexadecimal numbers begin with 0x or 0X and may contain the digits from 0 through 9 as well as upper or lowercase digits A to F afterwards. –If the integer number starts with zero, it is in octal representation so it can contain only digits 0 through 7. –l or L at the end of the literal shows the number is Long. Draw a single DFA that accepts all the allowable integer literals. Write the corresponding regular expression.

Exercise Let L be the language of strings A = { 2. Construct a DFA that accepts L Describe how the lexical analyzer will tokenize the following inputs. 1) <===== 2) ==<==<==<==<== 3) <=====<

More Questions Find automaton or regular expression for: –Sequence of open and closed parentheses of even length? –as many digits before as after decimal point? –Sequence of balanced parentheses ( ( () ) ())- balanced ( ) ) ( ( ) - not balanced –Comment as a sequence of space,LF,TAB, and comments from // until LF –Nested comments like /*... /* */ … */

Automaton that Claims to Recognize { a n b n | n >= 0 } We can make it deterministic Let the result have K states Feed it a, aa, aaa, …. consider the states it ends up in

Limitations of Regular Languages Every automaton can be made deterministic Automaton has finite memory, cannot count Deterministic automaton from a given state behaves always the same If a string is too long, deterministic automaton will repeat its behavior –say A accepted a n b n for all n, and has K states

Context-Free Grammars Σ - terminals Symbols with recursive defs - nonterminals Rules are of form N ::= v v is sequence of terminals and non-terminals Derivation starts from a starting symbol Replaces non-terminals with right hand side –terminals and –non-terminals

Balanced Parentheses Grammar Sequence of balanced parentheses ( ( () ) ())- balanced ( ) ) ( ( ) - not balanced

Remember While Syntax program ::= statmt* statmt ::= println( stringConst, ident ) | ident = expr | if ( expr ) statmt (else statmt) ? | while ( expr ) statmt | { statmt* } expr ::= intLiteral | ident | expr (&& | < | == | + | - | * | / | % ) expr | ! expr | - expr

Eliminating Additional Notation Grouping alternatives s ::= P | Q instead ofs ::= P s ::= Q Parenthesis notation expr (&& | < | == | + | - | * | / | % ) expr Kleene star within grammars { statmt* } Optional parts if ( expr ) statmt (else statmt) ?