PEGs, Treetop, and Converting Regular Expressions to NFAs Jason Dew and Gary Fredericks.

Slides:



Advertisements
Similar presentations
LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Advertisements

CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)
Lexical Analysis Lexical analysis is the first phase of compilation: The file is converted from ASCII to tokens. It must be fast!
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture5: Context Free Languages Prof. Amos Israeli.
Courtesy Costas Busch - RPI1 Non Deterministic Automata.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Regular.
Context-Free Grammars Lecture 7
CS5371 Theory of Computation Lecture 6: Automata Theory IV (Regular Expression = NFA = DFA)
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Fall 2006Costas Busch - RPI1 Non-Deterministic Finite Automata.
Chapter 3: Formal Translation Models
Specifying Languages CS 480/680 – Comparative Languages.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
1 Non-Deterministic Finite Automata. 2 Alphabet = Nondeterministic Finite Automaton (NFA)
1 Syntax and Semantics The Purpose of Syntax Problem of Describing Syntax Formal Methods of Describing Syntax Derivations and Parse Trees Sebesta Chapter.
Chpater 3. Outline The definition of Syntax The Definition of Semantic Most Common Methods of Describing Syntax.
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
Winter 2007SEG2101 Chapter 71 Chapter 7 Introduction to Languages and Compiler.
A sentence (S) is composed of a noun phrase (NP) and a verb phrase (VP). A noun phrase may be composed of a determiner (D/DET) and a noun (N). A noun phrase.
Context-Free Grammars
Grammars CPSC 5135.
PART I: overview material
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong NFA to DFA.
CS412/413 Introduction to Compilers Radu Rugina Lecture 4: Lexical Analyzers 28 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
CMSC 330: Organization of Programming Languages Context-Free Grammars.
Bernd Fischer RW713: Compiler and Software Language Engineering.
CFG1 CSC 4181Compiler Construction Context-Free Grammars Using grammars in parsers.
Lexer input and Output i d 3 = 0 LF w i d 3 = 0 LF w id3 = 0 while ( id3 < 10 ) id3 = 0 while ( id3 < 10 ) lexer Stream of Char-s ( lazy List[Char] ) class.
CPS 506 Comparative Programming Languages Syntax Specification.
Pushdown Automata Chapters Generators vs. Recognizers For Regular Languages: –regular expressions are generators –FAs are recognizers For Context-free.
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Lecture # 12. Nondeterministic Finite Automaton (NFA) Definition: An NFA is a TG with a unique start state and a property of having single letter as label.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
CSC3315 (Spring 2009)1 CSC 3315 Lexical and Syntax Analysis Hamid Harroud School of Science and Engineering, Akhawayn University
Introduction Finite Automata accept all regular languages and only regular languages Even very simple languages are non regular (  = {a,b}): - {a n b.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Regular Grammars Reading: 3.3. What we know so far…  FSA = Regular Language  Regular Expression describes a Regular Language  Every Regular Language.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Scribing K SAMPATH KUMAR 11CS10022 scribing. Definition of a Regular Expression R is a regular expression if it is: 1.a for some a in the alphabet ,
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
COMP3190: Principle of Programming Languages DFA and its equivalent, scanner.
Chapter 3 – Describing Syntax CSCE 343. Syntax vs. Semantics Syntax: The form or structure of the expressions, statements, and program units. Semantics:
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
CS 3304 Comparative Languages
CS510 Compiler Lecture 2.
Non Deterministic Automata
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
CSE 105 theory of computation
Formal Language Theory
Context-Free Grammars
Nondeterministic Finite Automata
CSC 4181Compiler Construction Context-Free Grammars
Theory of Computation Languages.
CS 3304 Comparative Languages
CSC 4181 Compiler Construction Context-Free Grammars
Context-Free Grammars
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Context-Free Grammars
Lecture 5 Scanning.
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
PZ02B - Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section PZ02B.
Presentation transcript:

PEGs, Treetop, and Converting Regular Expressions to NFAs Jason Dew and Gary Fredericks

Parsing Expression Grammars and Treetop Jason Dew

Outline 1.Introduction to PEGs 2.Introduction to Treetop 3.References and Questions

PEGs PEG := parsing expression grammar A generalization of regular expressions Similar to context-free grammars Unlike BNF, parse trees are unambiguous

Formal Definition N: a finite set of non-terminal symbols ∑: a finite set of terminal symbols P: a finite set of parsing rules e s : the starting expression

Formal Definition Parsing rules take the form: A := e non-terminalparsing expression (or some combination)

Parsing Expressions Several ways to combine expressions: sequence: "foo" "bar” ordered choice: "foo" / "bar” zero or more: "foo"* one or more: "foo"+ optional: "foo"?

Parsing Expressions Lookahead assertions (these do not consume any input) : positive lookahead: "foo" &"bar” negative lookahead: "foo" !"baz"

Implementations Java: parboiled, rats! C: peg, leg, pegc C++: boost Python: ppeg, pypeg, pijnu Javascript: kouprey Perl 6: (part of the language) Erlang: neotoma Clojure: clj-peg F#: fparsec and finally... Ruby has Treetop

Treetop A DSL (domain-specific language) written in Ruby for implementing PEGs

Syntax Two main keywords in the DSL: grammar and rule

Semantics Consider the following PEG and the input string (((a))): the resulting parse tree:

And now for the cool part each of the nodes are instances of Treetop::Runtime::SyntaxNode semantics get defined here all of Ruby is available to you

Example

Example (sans code duplication)

Treetop::Runtime::SyntaxNode Methods available: #terminal? : true if this node corresponds to a terminal symbol, false otherwise #non_terminal? : true if this node corresponds to a non-terminal symbol, false otherwise #text_value : returns the matched text #elements : returns the child nodes (only for non- terminal nodes)

References and Questions

RE → εNFA Gary Fredericks

Plan 1.Demonstrate Application 2.Show Treetop Parse Tree 3.Class NFA 1.Simple one-character NFA 2.Combined NFAs 1.Question Mark 2.Kleene Star 3.Concatenation 4.Or 4.Optimizations

Application Temporary shortcut: gfredericks.com/531

Treetop Grammar grammar Regex rule reg expression end rule expression term ("|" term)* end rule term modified_factor+ end

Treetop Grammar (cont) rule modified_factor factor modifier? end rule factor "(" expression ")" / literal / characterClass end rule modifier optional / one_or_more / zero_or_more / specified_number end

Treetop Grammar (cont) rule optional "?" end rule one_or_more "+" end rule zero_or_more "*" end rule specified_number "{" [0-9]+ ("," [0-9]* )? "}" end …

Treetop Example a?d|(bc*){2} matches  ad  d  bb  bcb  bbccccccc  bccccccccbccccccc

Treetop Syntax Tree a?d|(bc*){2}

Simplifying Assumptions Every NFA has a start state with only outgoing transitions Every NFA has a single accepting state with only incoming transitions (This means no self-transitions in either case)

NFA.simple(char) my_simple = NFA.simple(“c”)

NFA.simple(char) def NFA.simple(alphabet,symbol) NFA.new( [:init,:a], # states alphabet, # alphabet lambda do |st,sym| # transition (st==:init and sym==symbol) ? [:a] : [] end, :init, # start state [:a]) # accepting states end

NFA::question_mark my_simple.question_mark

NFA::question_mark def question_mark trans = lambda do |st, sym| original if(st and sym.nil?) original end return original trans, end

What’s going on here? (closures) new_nfa = my_simple.question_mark new_nfa transition my_simple

NFA::star my_simple.star

NFA::star def star a = self.wrap(“m”) states = a.states + [:init] transition = lambda do |st,sym| if(state==:init) if(symbol.nil?) return [a.start]+a.accepting else ret = a.transition.call(st,sym) if(a.accepting.any?{|s|ret.include?(s)}) ret << a.start end return ret end

-- NFA::star transition, :init, a.accepting) end

NFA::concatenate my_simple.concatenate(my_simple)

NFA::concatenate(other) def concatenate(other) a=self.wrap(“m”) b=other.wrap(“n”) states = a.states-a.accepting+b.states transition = lambda do |st, sym| if(a.states.include?(state)) a.transition.call(state,symbol). map{|s|a.accepting.include?(s) ? b.start : s} else b.transition.call(state,symbol) end end # continuing…

-- NFA::concatenate(other) transition, a.start, b.accepting) end

NFA::or NFA.simple(“c”).or(NFA.simple(“d”))

NFA::or(other) def or(other) a = self.wrap(“m”) b = self.wrap(“n”) states = a.states + b.states + [:init, :accept] – [a.start, b.start, a.accepting, b.accepting]

-- NFA::or(other) transition = lambda do |st, sym| ret= if(st==:init) a.transition.call(a.start,sym)+ b.transition.call(b.start,sym) elsif(a.states.include?(st)) a.transition.call(st,sym) else b.transition.call(st,sym) end return ret.map do |s| [a.accepting+b.accepting]. include?(s) ? :accept : s end

-- NFA::or(other) transition, :init, [:accept]) end

Syntax Tree Translation a?d|(bc*){2}

Conclusion I enjoyed making this. Questions?

Optimization: Repetition What do we do for the regular expression b{200} ? Naïve: my_b = NFA.simple(“b”) res=my_b 199.times do res=res.concatenate(my_b) end

Naïve Approach Result res my_b

Better Idea – Divide and Conquer def times(n) return self if(n==1) a = self.times(n/2) b = self.times(n-n/2) return a.concatenate(b) end

Divide and Conquer Result

Conclusion I enjoyed making this. Questions?