Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing.

Slides:



Advertisements
Similar presentations
Determinization of Büchi Automata
Advertisements

 Dr. Vered Gafni 1 LTL Decidability Enables consistency check, but also base for verification.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
1 FORMAL LANGUAGES, AUTOMATA AND COMPUTABILITY (For next time: Read Chapter 1.3 of the book)
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Validating Streaming XML Documents Luc Segoufin & Victor Vianu Presented by Harel Paz.
1 Single Final State for NFAs and DFAs. 2 Observation Any Finite Automaton (NFA or DFA) can be converted to an equivalent NFA with a single final state.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
Fall 2006Costas Busch - RPI1 Non-Deterministic Finite Automata.
Normal forms for Context-Free Grammars
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
FSA Lecture 1 Finite State Machines. Creating a Automaton  Given a language L over an alphabet , design a deterministic finite automaton (DFA) M such.
1 Regular Languages Finite Automata eg. Supermarket automatic door: exit or entrance.
Theory of Computing Lecture 22 MAS 714 Hartmut Klauck.
1 A Single Final State for Finite Accepters. 2 Observation Any Finite Accepter (NFA or DFA) can be converted to an equivalent NFA with a single final.
Regular Model Checking Ahmed Bouajjani,Benget Jonsson, Marcus Nillson and Tayssir Touili Moran Ben Tulila
CSE 3813 Introduction to Formal Languages and Automata Chapter 8 Properties of Context-free Languages These class notes are based on material from our.
Regular Expressions. Notation to specify a language –Declarative –Sort of like a programming language. Fundamental in some languages like perl and applications.
Theory of Computation, Feodor F. Dragan, Kent State University 1 Regular expressions: definition An algebraic equivalent to finite automata. We can build.
Context-free Grammars Example : S   Shortened notation : S  aSaS   | aSa | bSb S  bSb Which strings can be generated from S ? [Section 6.1]
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
XML Data Management 10. Deterministic DTDs and Schemas Werner Nutt.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
CHAPTER 1 Regular Languages
January 9, 2015CS21 Lecture 31 CS21 Decidability and Tractability Lecture 3 January 9, 2015.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Chapter 6 Properties of Regular Languages. 2 Regular Sets and Languages  Claim(1). The family of languages accepted by FSAs consists of precisely the.
CS 203: Introduction to Formal Languages and Automata
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
Grammars A grammar is a 4-tuple G = (V, T, P, S) where 1)V is a set of nonterminal symbols (also called variables or syntactic categories) 2)T is a finite.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Algorithms for hard problems Automata and tree automata Juris Viksna, 2015.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Chapter 5 Finite Automata Finite State Automata n Capable of recognizing numerous symbol patterns, the class of regular languages n Suitable for.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
1 Language Recognition (11.4) Longin Jan Latecki Temple University Based on slides by Costas Busch from the courseCostas Busch
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
1 Introduction to the Theory of Computation Regular Expressions.
Complexity and Computability Theory I Lecture #5 Rina Zviel-Girshin Leah Epstein Winter
What do we know? DFA = NFA =  -NFA We have seen algorithms to transform DFA to NFA (trival) NFA to  NFA (trivial) NFA to DFA (subset construction)
Lecture #5 Advanced Computation Theory Finite Automata.
Context-Free Languages & Grammars (CFLs & CFGs)
PROPERTIES OF REGULAR LANGUAGES
Regular Expressions.
CSE 105 theory of computation
Chapter 2 FINITE AUTOMATA.
Language Recognition (12.4)
Context-Free Languages
REGULAR LANGUAGES AND REGULAR GRAMMARS
Hierarchy of languages
Alternating tree Automata and Parity games
4. Properties of Regular Languages
Non-Deterministic Finite Automata
RegExp = (DFA,NFA,NFAe) Sipser pages
Finite Automata and Formal Languages
CS21 Decidability and Tractability
Language Recognition (12.4)
CS21 Decidability and Tractability
Instructor: Aaron Roth
CSE 105 theory of computation
Chapter 1 Regular Language
Finite-State Machines with No Output
CSCI 2670 Introduction to Theory of Computing
CSE 105 theory of computation
Presentation transcript:

Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing

Outline Building the Glushkov automaton in O((size of E) 2 ) Defining the Star Normal Form Building the Glushkov automaton in O(size of E) for deterministic regular expressions Strong and weak unambiguity Quadratic time decision algorithm for weak unambiguity

General definitions E – regular expression L(E) – the language specified by the regular expression E The size of a regular expression E The number of symbols it contain, including the syntactic symbols such as brackets, +,., and * The size of an NFA The number of its transitions

pos( E ),  ( x ) (a+b)*a(ab)*  (a 1 +b 2 )*a 3 (a 4 b 5 )* pos( E ) – the set of subscripted symbols in an expression E x, y, z are used to denote positions a, b, c are used for elements of  For a position x,  (x) is the corresponding symbol of 

Positions sets: first( E ), last( E ) inductive definition first (E) = last (E) =  [E =  or  ] first (E) = last (E) = { x } [E = x] first (E) = first (F)  first (G) last (E) = last (F)  last (G) [E = F + G] first (E) = first (F)  first (G) if  ∈ L(F) first (F) otherwise last (E) = last (F)  last (G) if  ∈ L(G) last (G) otherwise [E = FG] first (E) = first (F) last (E) = last (F) [E = F * ]

Positions sets: follow( E,x ) inductive definition E has no positions [E =  or  ] follow( E,x ) =  [E = x] follow( E,x ) = follow( F,x ) if x ∈ pos( F ) [E = F + G] follow( G,x ) if x ∈ pos( G ) follow( E,x ) = follow( F,x ) if x ∈ pos( F )\ last( F ) [E = FG] follow( F,x )  first( G ) if x ∈ last( F ) follow( G,x ) if x ∈ pos( G ) follow( E,x ) = follow( F,x ) if x ∈ pos( F )\ last( F ) [E = F*] follow( F,x )  first( F ) if x ∈ last( F )

The Glushkov Automaton (NFA) M E = (Q E  {q I }, ,  E, q I, F E ) Q E = pos(E) For a ∈ , let  E (q I,a) = {x| x ∈ first(E),  (x)=a} For x ∈ pos(E), a ∈ , let  E (x,a) = {y| y ∈ follow(E,x),  (y)=a} F E = last(E)  {q I } if  ∈ L(M E ) last(E) otherwise L(M E ) = L(E) Proposition 2.1 L(M E ) = L(E) Example (a*+ba)* = (a 1 *+b 2 a 3 )* b b b a a a a 1 2 3

The canonical method ( O(n 3 ) ) for computing first, last & follow Converting E into a syntax tree Leafs are labeled with: ,  or positions of E Internal nodes: +,. or * Building time: O(n) (n = size of E) E v Each node v in the syntax tree corresponds to a subexpression E v of E. Postorder traversal of the syntax tree computing: nullable(v) nullable(v): Boolean – can E v contain  first(v)last(v) first(v), last(v): 2 pos(E) follow(x) For each x  pos(E) there is a global variable : follow(x): 2 pos(E)  O(n 3 )

case v is a node labeled  : nullable (v) := false; first(v) :=  ; last(v) :=  ; v is a node labeled  : nullable (v) := true; first(v) :=  ; last(v) :=  ; v is a node labeled x: nullable (v) := false; follow (x) :=  ; first(v) := {x}; last(v) := {x}; if nullable(rightchild) then last(v) := last(leftchild )  last(rightchild ) ( ) else last(v) := last(rightchild ); v is a node labeled *: nullable (v) := true; for each x in last(child) do follow (x) := follow (x)  first(child ); ( ) first(v) := first(child ); last(v) := last(child ); end case; v is a node labeled +: nullable (v) := nullable (leftchild ) or nullable (rightchild ); first(v) := first(leftchild )  first(rightchild ); ( ) last(v) := last(leftchild )  last(rightchild ); ( ) v is a node labeled. : nullable (v) := nullable (leftchild ) and nullable (rightchild ); for each x in last(leftchild) do follow (x) := follow (x)  first(rightchild ); ( ) if nullable(leftchild) then first(v) := first(leftchild )  first(rightchild ) ( ) else first(v) := first(leftchild );

Lemma 2.5 The following invariant holds after node v has been visited. 1. nullable (v) is true if and only if  ∈ L(E v ). 2. first(v) = first(E v ), last(v) = last(E v ). Furthermore, if node v has been visited but the parent of v has not, then 3. follow (x) = follow (E v, x) for x ∈ pos(E v ). Especially, for the root note v 0, 1. first(v 0 ) = first(E), last(v 0 ) = last(E). 2. follow (x) = follow (E, x), for x ∈ pos(E).

Observations All unions labeled ( ) or ( ) are disjoint pos(F)  pos(G) =  Only unions labeled ( ) are not necessarily disjoint Example: E =( a*b* ) *, H = a*b* Elements of first(H) are added to follow(H,x) for x ∈ last(H), but some elements of first(H) may already belong to follow(H,x) for some x ∈ last(H). O(n 3 ) for computing first(E), last(E) and follow(E, x)

Computing first, last & follow in a better time bound ( O(n 2 ) ) General Strategy: We only consider expressions for which all unions, including the ones of type ( ), are disjoint. Such expressions are in star normal form (SNF). Then we show that our algorithm runs in time O(size(M E )) for expressions E in star normal form. Finally, we show why the restriction to star normal form is justified.

Star Normal Form - Definition A regular expression is in star normal form if for each starred subexpression H* of E the SNF-conditions: follow(H, last(H))  first(H) =  and  ∉ L(H) hold.

Lemma 2.7 Let E be a regular expression in star normal form.  M E can be computed from E in time O( size (E) + size (M E )) Proof ( ) takes constant time (list concatenation). ( ) or ( ): Observation: For any subexp. F of subexp. G of E, x ∈ pos ( F ) follow ( F, x )  follow(G,x )  follow ( E, x ) Run time for ( ) or ( ) in a node v and for position x is proportional to the number of positions in follow ( E v, x ) that are not present in any of the subexpressions of E v. Total run time spent in instructions ( ) or ( ):  x ∈ pos(E) | follow(E, x) | disjoint unions (SNF) Which is less or equal to the number of transitions in M E

Why the restriction to star normal form is justified Theorem 3.1 For each regular expression E, there is a regular expression E such that M E = M E (Glushkov Automaton) E is in star normal form E can be computed from E in linear time.

From starred expression E* into E o * Goal : SNF conditions fulfilled for E o Observation After removing from M E all “feedback” transitions leading from a final states (apart from q i ) to states that q i is directly connected to, and changing q i to be non final The resulting NFA is the Glushkov automaton of E  with follow(E ,last(E  ))  first(E  )= . Example E = (a 1 *b 2 *)* b b a a 1 2 a b E o = (a 1 +b 2 ) b 1 2 a

E - inductive definition E o =  [E =  or  ] E o = E[E = a] E o = F o + G o [E = F + G] FG if  ∉ L(F)  ∉ L(G) E o = F o G if  ∉ L(F)  ∈ L(G) [E = FG] FG o if  ∈ L(F)  ∉ L(G) F o + G o (!) if  ∈ L(F)  ∈ L(G) E o = F o (!) [E = F*] Example E = (a 1 *b 2 *)* b b a a 1 2 a b E o = (a 1 +b 2 ) b 1 2 a

Lemma size( E o ) ≤ size( E ). 2.  ∉ L( E o ) 3. pos( E o ) = pos( E ). 4. first( E o ) = first( E ), last( E o ) = last( E ). 5. follow ( E o, x) = follow ( E, x), for all x ∈ pos( E ) \ last( E ). 6. follow ( E o, x) = follow ( E, x) \ first( E ), for all x ∈ last( E ),  follow ( E o, last( E o ))  first( E o ) =  7. follow ( E o *, x) = follow ( E*, x), for all x ∈ pos( E ). 8. M E* = M E * o The proof is in induction on E Claims 7, 8 follow directly from 5 and 6

From E  to E  If we substitute in E each starred subexpression H * with H  * Proceeding bottom up in E We can expect to get an expression E  in star normal form with M E =M E 

E  - inductive definition Example E = (a 1 *b 2 *)* b b a a 1 2 a b E o = (a 1 +b 2 ) b 1 2 a E  = E [E = a,  or  ] E = F + GE = F + G [E = F + G] FGFG [E = FG] E  = F   * [E = F*] E=(a*b*)* E  =(a*b*)   * = (a   *b   *)  * = (a   +b   )* = (a+b)*

M E  = M E Lemma 3.5 L(E) = L(E  ) size(E  )  size(E) pos(E  ) = pos(E) first(E  ) = first(E) last(E  ) = last(E) follow(E , x) = follow(E,x), for x ∈ pos(E) q I ∈ F E  if and only if q I ∈ F E These claims imply the first part of Theorem 3.1, M E  = M E

E  in SNF The proof is by induction on the size of E. The star case [ E = F* ]  E  = F   * SNF conditions hold for F   (Lemma 3.3) F   in SNF, by induction hypothesis Need to show that F   = F   follow(H, last(H ))  first(H ) =   ∉ L(H)

Lemma 3.6 E  = E  E   = E   E  = E  (1) E  = F  = F  = E  Proof – by induction on E The star case [E = F*] (2) E   = F   *  = F   = F   = F   = E   (3) E  = F   *  = F     * = F   * = F   * = E  def  def  indu def  def  &  (1) indu def  (2) indu & (1) def 

Compute E  from E in linear time For H subexpression of E, we need H  and H   for computing E  H  and H   are computed simultaneously during the postorder traversal Left to prove that at each node only a constant amount of time is spent

Lemma 3.7    =  =    [E =  or  ] E   = E[E = a] E = F + GE = F + G [E = F + G] F  G  if  ∉ L(F)  ∉ L(G) E   = F   G  if  ∉ L(F)  ∈ L(G) [E = FG] F  G   if  ∈ L(F)  ∉ L(G) F   + G   if  ∈ L(F)  ∈ L(G) E   = F   [E = F*]

Example (a*b*)*  = (a*b*)   * by definition Repeated application of Lemma 3.7 yields: (a*b*)   * = (a*   +b*   )* = (a   +b   )* = (a+b)* (a + b)* is the star normal form of (a*b*)*. Both expressions have the same Glushkov NFA b a a b a b

Conclusions so far Theorem 3.9 The Glushkov automaton M E can be computed from a regular expression E in time linear in size(E)+size(M E ) Proof E  is computed from E in linear time. E  is in star normal form  M E can be computed from E in time O(size(E)+size(M E ))

Deterministic regular expression A regular expression E is deterministic if the corresponding NFA M E is deterministic. Theorem It can be decided in linear time whether a regular expression E is deterministic. 2. If E is deterministic, then the deterministic finite automaton M E can be computed from E in linear time.

Theorem Proof E is deterministic if and only if E  is Isomorphic Glushkov automata  we can assume that E is in star normal form. We start to compute first(E), last(E), and follow (E,x) for x  pos(E) incrementally keeping track of the follow(E,x) in a |pos(E)|  |  | ­ matrix E= (a 1 +b 2 )*E= (a 1 +b 2 )*a 3 ba b2b2 a1a1 1 b2b2 a1a1 2  pos ba b2b2 a 1 & a 3 1 b2b2 2 3  pos E is deterministic E is nondeterministic

Ambiguity in automata and expressions Unambiguous  ­ NFA – definition: for each word w, there is at most one path from the initial state to a final state that spells out w. Weakly unambiguous Intuition Each word of E has a unique path through E Definition A regular expression E is weakly unambiguous if and only if the NFA M E is unambiguous. Strongly unambiguous Intuition Each word of E can be uniquely decomposed into subwords of E

Strongly unambiguous E is strongly unambiguous [E =  or a] E is strongly unambiguous if F and G are strongly unambiguous and L(F) and L(G) are disjoint. [E = F + G] E is strongly unambiguous if F and G are strongly unambiguous and the concatenation of L(F) and L(G) is unambiguous [E = FG] E is strongly unambiguous if F is strongly unambiguous and the star of L(F) is unambiguous. [E = F*] Concatenation – L. L ’ is unambiguous if v,w  L, v ’,w ’  L ’, vv ’ =ww ’  v=w and v ’ =w ’. L* is unambiguous if v 1...v m  L, w 1 … w n  L, m,n  0, v 1 … v m =w 1 … w n  m=n and v i =w i for 1  i  m.

Strongly unambiguous In terms of automata Let M’ E be the  ­ NFA recognizing L(E) according to any of the standard constructions Lemma 4.5 E is strongly unambiguous if and only if M’ E is unambiguous Lemma 4.6 If E is strongly unambiguous, then E is weakly unambiguous Proof Elimination of  transitions transforms M’ E into M E. Different paths in M ’ E spelling out a word w correspond to different paths in M E doing the same. Unambiguity of M ’ E (Lemma 4.5)  unambiguity of M E

Lemma 4.7 – weakly unambiguous E is weakly unambiguous [E =  or a] E is weakly unambiguous if and only if F and G are weakly unambiguous and at most  is both in L(F ) and L(G). [E = F + G] E is weakly unambiguous if and only if F and G are weakly unambiguous and the concatenation of L(F ) and L(G) is unambiguous [E = FG] Let follow (F,last(F))  first(F) = ,   L(F ). Then, E is weakly unambiguous if and only if F is weakly unambiguous and the star of L(F ) is unambiguous [E = F*]

Lemma 4.7 proof Since Glushkov automata have no  transitions, the only path denoting the empty word is the empty path. Furthermore, any path through F or through G is also a path through E, and any non ­ empty path through F is different from any path through G. [E = F+G] Let's assume that E is weakly unambiguous. Since L(F )    L(G), each path through F or G can be completed to a path through E. Thus, F and G are weakly unambiguous. Each decomposition of a word w  L(F )L(G), w=vw=v 0 w 0 with v,v 0  L(F ), w,w 0  L(G), corresponds to paths x 1 … x m y 1 … y n and x ’ 1 … x ’ m ’ y ’ 1 … y ’ n ’ of E, where the x ­ positions belong to F and the y ­ positions to G. Since E is weakly unambiguous, the paths through E are identical. Since the positions of F and G are disjoint, we have m = m’ and n=n’, i.e. v=v’, w=w ’. Thus, the concatenation of L(F) and L(G) is unambiguous. [E = FG] Since   L(E), the empty word is uniquely decomposed into a sequence of words in L(F). Any non ­ empty path through M E is determined by a sequence of positions x 1, …,x n, n  1, which consists of a sequence of paths through M F. Because follow (F,last(F))  first(F)= , the starting positions of those paths are uniquely determined. Hence, if E is weakly unambiguous, then the star of F is unambiguous. The other direction is obvious. [E = F*]

Epsilon Normal Form Epsilon Normal Form condition: No subexpression of E denotes the empty word umbiguously E is in epsilon normal form [E =  or a] E is in epsilon normal form if F and G are in epsilon normal form and   L(F)  L(G) [E = F + G] E is in epsilon normal form if F and G are in epsilon normal form [E = FG] E is in epsilon normal form if F is in epsilon normal form and   L(F) [E = F*]

Strongly unambiguous expressions are in star and in epsilon normal form Lemma 4.10 If E* is strongly unambiguous, then follow ( E, last ( E ))  first ( E ) =  Proof Assume that there exist x  last ( E ), y  follow ( E, x )  first ( E ), z  last ( E )  x is a final state in M E. (and also z ) x 1...x n x yy 1 …y m z is a path through M E But this path is also the composition of two paths through M E  This makes L(E)* ambiguous.

Theorem 4.9 E is strongly unambiguous if and only if 1. E is weakly unambiguous 2. E is in star normal form 3. E is in epsilon normal form Proof For expressions in star and epsilon normal form, weak and strong unambiguity are identical (using Lemma 4.7) Strongly unambiguous expressions are in star and in epsilon normal form. (Lemma 4.10)

Test for weak unambiguity in quadratic time  Theorem 4.11  Regular expressions in epsilon normal form can be tested for weak unambiguity in quadratic time.  Proof Let E be in epsilon normal form. E can be transformed into star normal form E  without changing the Glushkov automaton linear time. E  is also in epsilon normal form. E is weakly unambiguous if and only if E  is  if and only if E  is strongly unambiguous. strong unambiguity of expressions can be decided in quadratic time

Open problems It is easy to see that a regular expression can be tested for epsilon normal form in linear time.  Can a given regular expression be transformed into epsilon normal form in linear time? Our transformation into star normal form can deal with starred subexpressions. Hence, the crucial point is how expressions E = F+G with  L(F)  L(G) can be handled. A straight ­ forward approach would eliminate the empty string either from L(F) or from L(G). This opens up another question:  Is there a linear ­ time algorithm transforming a regular expression E into an expression E ’ with L(E ’ ) = L(E)\{  }?

The End

From symbol               

case v is a node labeled  : nullable (v) := false; first(v) :=  ; last(v) :=  ; v is a node labeled  : nullable (v) := true; first(v) :=  ; last(v) :=  ; v is a node labeled x: nullable (v) := false; follow (x) :=  ; first(v) := {x}; last(v) := {x}; if nullable(rightchild) then last(v) := last(leftchild )  last(rightchild ) ( ) else last(v) := last(rightchild ); v is a node labeled *: nullable (v) := true; for each x in last(child) do follow (x) := follow (x)  first(child ); ( ) first(v) := first(child ); last(v) := last(child ); end case; v is a node labeled +: nullable (v) := nullable (leftchild ) or nullable (rightchild ); first(v) := first(leftchild )  first(rightchild ); ( ) last(v) := last(leftchild )  last(rightchild ); ( ) v is a node labeled. : nullable (v) := nullable (leftchild ) and nullable (rightchild ); for each x in last(leftchild) do follow (x) := follow (x)  first(rightchild ); ( ) if nullable(leftchild) then first(v) := first(leftchild )  first(rightchild ) ( ) else first(v) := first(leftchild );