Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing.

Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing

Outline Building the Glushkov automaton in O((size of E) 2 ) Defining the Star Normal Form Building the Glushkov automaton in O(size of E) for deterministic regular expressions Strong and weak unambiguity Quadratic time decision algorithm for weak unambiguity

General definitions E – regular expression L(E) – the language specified by the regular expression E The size of a regular expression E The number of symbols it contain, including the syntactic symbols such as brackets, +,., and * The size of an NFA The number of its transitions

pos( E ),  ( x ) (a+b)*a(ab)*  (a 1 +b 2 )*a 3 (a 4 b 5 )* pos( E ) – the set of subscripted symbols in an expression E x, y, z are used to denote positions a, b, c are used for elements of  For a position x,  (x) is the corresponding symbol of 

Positions sets: first( E ), last( E ) inductive definition first (E) = last (E) =  [E =  or  ] first (E) = last (E) = { x } [E = x] first (E) = first (F)  first (G) last (E) = last (F)  last (G) [E = F + G] first (E) = first (F)  first (G) if  ∈ L(F) first (F) otherwise last (E) = last (F)  last (G) if  ∈ L(G) last (G) otherwise [E = FG] first (E) = first (F) last (E) = last (F) [E = F * ]

Positions sets: follow( E,x ) inductive definition E has no positions [E =  or  ] follow( E,x ) =  [E = x] follow( E,x ) = follow( F,x ) if x ∈ pos( F ) [E = F + G] follow( G,x ) if x ∈ pos( G ) follow( E,x ) = follow( F,x ) if x ∈ pos( F )\ last( F ) [E = FG] follow( F,x )  first( G ) if x ∈ last( F ) follow( G,x ) if x ∈ pos( G ) follow( E,x ) = follow( F,x ) if x ∈ pos( F )\ last( F ) [E = F*] follow( F,x )  first( F ) if x ∈ last( F )

The Glushkov Automaton (NFA) M E = (Q E  {q I }, ,  E, q I, F E ) Q E = pos(E) For a ∈ , let  E (q I,a) = {x| x ∈ first(E),  (x)=a} For x ∈ pos(E), a ∈ , let  E (x,a) = {y| y ∈ follow(E,x),  (y)=a} F E = last(E)  {q I } if  ∈ L(M E ) last(E) otherwise L(M E ) = L(E) Proposition 2.1 L(M E ) = L(E) Example (a*+ba)* = (a 1 *+b 2 a 3 )* b b b a a a a 1 2 3

The canonical method ( O(n 3 ) ) for computing first, last & follow Converting E into a syntax tree Leafs are labeled with: ,  or positions of E Internal nodes: +,. or * Building time: O(n) (n = size of E) E v Each node v in the syntax tree corresponds to a subexpression E v of E. Postorder traversal of the syntax tree computing: nullable(v) nullable(v): Boolean – can E v contain  first(v)last(v) first(v), last(v): 2 pos(E) follow(x) For each x  pos(E) there is a global variable : follow(x): 2 pos(E)  O(n 3 )

case v is a node labeled  : nullable (v) := false; first(v) :=  ; last(v) :=  ; v is a node labeled  : nullable (v) := true; first(v) :=  ; last(v) :=  ; v is a node labeled x: nullable (v) := false; follow (x) :=  ; first(v) := {x}; last(v) := {x}; if nullable(rightchild) then last(v) := last(leftchild )  last(rightchild ) ( ) else last(v) := last(rightchild ); v is a node labeled *: nullable (v) := true; for each x in last(child) do follow (x) := follow (x)  first(child ); ( ) first(v) := first(child ); last(v) := last(child ); end case; v is a node labeled +: nullable (v) := nullable (leftchild ) or nullable (rightchild ); first(v) := first(leftchild )  first(rightchild ); ( ) last(v) := last(leftchild )  last(rightchild ); ( ) v is a node labeled. : nullable (v) := nullable (leftchild ) and nullable (rightchild ); for each x in last(leftchild) do follow (x) := follow (x)  first(rightchild ); ( ) if nullable(leftchild) then first(v) := first(leftchild )  first(rightchild ) ( ) else first(v) := first(leftchild );

Lemma 2.5 The following invariant holds after node v has been visited. 1. nullable (v) is true if and only if  ∈ L(E v ). 2. first(v) = first(E v ), last(v) = last(E v ). Furthermore, if node v has been visited but the parent of v has not, then 3. follow (x) = follow (E v, x) for x ∈ pos(E v ). Especially, for the root note v 0, 1. first(v 0 ) = first(E), last(v 0 ) = last(E). 2. follow (x) = follow (E, x), for x ∈ pos(E).

Observations All unions labeled ( ) or ( ) are disjoint pos(F)  pos(G) =  Only unions labeled ( ) are not necessarily disjoint Example: E =( a*b* ) *, H = a*b* Elements of first(H) are added to follow(H,x) for x ∈ last(H), but some elements of first(H) may already belong to follow(H,x) for some x ∈ last(H). O(n 3 ) for computing first(E), last(E) and follow(E, x)

Computing first, last & follow in a better time bound ( O(n 2 ) ) General Strategy: We only consider expressions for which all unions, including the ones of type ( ), are disjoint. Such expressions are in star normal form (SNF). Then we show that our algorithm runs in time O(size(M E )) for expressions E in star normal form. Finally, we show why the restriction to star normal form is justified.

Star Normal Form - Definition A regular expression is in star normal form if for each starred subexpression H* of E the SNF-conditions: follow(H, last(H))  first(H) =  and  ∉ L(H) hold.

Lemma 2.7 Let E be a regular expression in star normal form.  M E can be computed from E in time O( size (E) + size (M E )) Proof ( ) takes constant time (list concatenation). ( ) or ( ): Observation: For any subexp. F of subexp. G of E, x ∈ pos ( F ) follow ( F, x )  follow(G,x )  follow ( E, x ) Run time for ( ) or ( ) in a node v and for position x is proportional to the number of positions in follow ( E v, x ) that are not present in any of the subexpressions of E v. Total run time spent in instructions ( ) or ( ):  x ∈ pos(E) | follow(E, x) | disjoint unions (SNF) Which is less or equal to the number of transitions in M E

Why the restriction to star normal form is justified Theorem 3.1 For each regular expression E, there is a regular expression E such that M E = M E (Glushkov Automaton) E is in star normal form E can be computed from E in linear time.

From starred expression E* into E o * Goal : SNF conditions fulfilled for E o Observation After removing from M E all “feedback” transitions leading from a final states (apart from q i ) to states that q i is directly connected to, and changing q i to be non final The resulting NFA is the Glushkov automaton of E  with follow(E ,last(E  ))  first(E  )= . Example E = (a 1 *b 2 *)* b b a a 1 2 a b E o = (a 1 +b 2 ) b 1 2 a

E - inductive definition E o =  [E =  or  ] E o = E[E = a] E o = F o + G o [E = F + G] FG if  ∉ L(F)  ∉ L(G) E o = F o G if  ∉ L(F)  ∈ L(G) [E = FG] FG o if  ∈ L(F)  ∉ L(G) F o + G o (!) if  ∈ L(F)  ∈ L(G) E o = F o (!) [E = F*] Example E = (a 1 *b 2 *)* b b a a 1 2 a b E o = (a 1 +b 2 ) b 1 2 a

Lemma 3.3 1. size( E o ) ≤ size( E ). 2.  ∉ L( E o ) 3. pos( E o ) = pos( E ). 4. first( E o ) = first( E ), last( E o ) = last( E ). 5. follow ( E o, x) = follow ( E, x), for all x ∈ pos( E ) \ last( E ). 6. follow ( E o, x) = follow ( E, x) \ first( E ), for all x ∈ last( E ),  follow ( E o, last( E o ))  first( E o ) =  7. follow ( E o *, x) = follow ( E*, x), for all x ∈ pos( E ). 8. M E* = M E * o The proof is in induction on E Claims 7, 8 follow directly from 5 and 6

From E  to E  If we substitute in E each starred subexpression H * with H  * Proceeding bottom up in E We can expect to get an expression E  in star normal form with M E =M E 

E  - inductive definition Example E = (a 1 *b 2 *)* b b a a 1 2 a b E o = (a 1 +b 2 ) b 1 2 a E  = E [E = a,  or  ] E = F + GE = F + G [E = F + G] FGFG [E = FG] E  = F   * [E = F*] E=(a*b*)* E  =(a*b*)   * = (a   *b   *)  * = (a   +b   )* = (a+b)*

M E  = M E Lemma 3.5 L(E) = L(E  ) size(E  )  size(E) pos(E  ) = pos(E) first(E  ) = first(E) last(E  ) = last(E) follow(E , x) = follow(E,x), for x ∈ pos(E) q I ∈ F E  if and only if q I ∈ F E These claims imply the first part of Theorem 3.1, M E  = M E

E  in SNF The proof is by induction on the size of E. The star case [ E = F* ]  E  = F   * SNF conditions hold for F   (Lemma 3.3) F   in SNF, by induction hypothesis Need to show that F   = F   follow(H, last(H ))  first(H ) =   ∉ L(H)

Lemma 3.6 E  = E  E   = E   E  = E  (1) E  = F  = F  = E  Proof – by induction on E The star case [E = F*] (2) E   = F   *  = F   = F   = F   = E   (3) E  = F   *  = F     * = F   * = F   * = E  def  def  indu def  def  &  (1) indu def  (2) indu & (1) def 

Compute E  from E in linear time For H subexpression of E, we need H  and H   for computing E  H  and H   are computed simultaneously during the postorder traversal Left to prove that at each node only a constant amount of time is spent

Lemma 3.7    =  =    [E =  or  ] E   = E[E = a] E = F + GE = F + G [E = F + G] F  G  if  ∉ L(F)  ∉ L(G) E   = F   G  if  ∉ L(F)  ∈ L(G) [E = FG] F  G   if  ∈ L(F)  ∉ L(G) F   + G   if  ∈ L(F)  ∈ L(G) E   = F   [E = F*]

Example (a*b*)*  = (a*b*)   * by definition Repeated application of Lemma 3.7 yields: (a*b*)   * = (a*   +b*   )* = (a   +b   )* = (a+b)* (a + b)* is the star normal form of (a*b*)*. Both expressions have the same Glushkov NFA b a a b a b

Conclusions so far Theorem 3.9 The Glushkov automaton M E can be computed from a regular expression E in time linear in size(E)+size(M E ) Proof E  is computed from E in linear time. E  is in star normal form  M E can be computed from E in time O(size(E)+size(M E ))

Deterministic regular expression A regular expression E is deterministic if the corresponding NFA M E is deterministic. Theorem 3.11 1. It can be decided in linear time whether a regular expression E is deterministic. 2. If E is deterministic, then the deterministic finite automaton M E can be computed from E in linear time.

Theorem 3.11 - Proof E is deterministic if and only if E  is Isomorphic Glushkov automata  we can assume that E is in star normal form. We start to compute first(E), last(E), and follow (E,x) for x  pos(E) incrementally keeping track of the follow(E,x) in a |pos(E)|  |  | matrix E= (a 1 +b 2 )*E= (a 1 +b 2 )*a 3 ba b2b2 a1a1 1 b2b2 a1a1 2  pos ba b2b2 a 1 & a 3 1 b2b2 2 3  pos E is deterministic E is nondeterministic

Ambiguity in automata and expressions Unambiguous  NFA – definition: for each word w, there is at most one path from the initial state to a final state that spells out w. Weakly unambiguous Intuition Each word of E has a unique path through E Definition A regular expression E is weakly unambiguous if and only if the NFA M E is unambiguous. Strongly unambiguous Intuition Each word of E can be uniquely decomposed into subwords of E

Strongly unambiguous E is strongly unambiguous [E =  or a] E is strongly unambiguous if F and G are strongly unambiguous and L(F) and L(G) are disjoint. [E = F + G] E is strongly unambiguous if F and G are strongly unambiguous and the concatenation of L(F) and L(G) is unambiguous [E = FG] E is strongly unambiguous if F is strongly unambiguous and the star of L(F) is unambiguous. [E = F*] Concatenation – L. L ’ is unambiguous if v,w  L, v ’,w ’  L ’, vv ’ =ww ’  v=w and v ’ =w ’. L* is unambiguous if v 1...v m  L, w 1 … w n  L, m,n  0, v 1 … v m =w 1 … w n  m=n and v i =w i for 1  i  m.

Strongly unambiguous In terms of automata Let M’ E be the  NFA recognizing L(E) according to any of the standard constructions Lemma 4.5 E is strongly unambiguous if and only if M’ E is unambiguous Lemma 4.6 If E is strongly unambiguous, then E is weakly unambiguous Proof Elimination of  transitions transforms M’ E into M E. Different paths in M ’ E spelling out a word w correspond to different paths in M E doing the same. Unambiguity of M ’ E (Lemma 4.5)  unambiguity of M E

Lemma 4.7 – weakly unambiguous E is weakly unambiguous [E =  or a] E is weakly unambiguous if and only if F and G are weakly unambiguous and at most  is both in L(F ) and L(G). [E = F + G] E is weakly unambiguous if and only if F and G are weakly unambiguous and the concatenation of L(F ) and L(G) is unambiguous [E = FG] Let follow (F,last(F))  first(F) = ,   L(F ). Then, E is weakly unambiguous if and only if F is weakly unambiguous and the star of L(F ) is unambiguous [E = F*]

Lemma 4.7 proof Since Glushkov automata have no  transitions, the only path denoting the empty word is the empty path. Furthermore, any path through F or through G is also a path through E, and any non empty path through F is different from any path through G. [E = F+G] Let's assume that E is weakly unambiguous. Since L(F )    L(G), each path through F or G can be completed to a path through E. Thus, F and G are weakly unambiguous. Each decomposition of a word w  L(F )L(G), w=vw=v 0 w 0 with v,v 0  L(F ), w,w 0  L(G), corresponds to paths x 1 … x m y 1 … y n and x ’ 1 … x ’ m ’ y ’ 1 … y ’ n ’ of E, where the x positions belong to F and the y positions to G. Since E is weakly unambiguous, the paths through E are identical. Since the positions of F and G are disjoint, we have m = m’ and n=n’, i.e. v=v’, w=w ’. Thus, the concatenation of L(F) and L(G) is unambiguous. [E = FG] Since   L(E), the empty word is uniquely decomposed into a sequence of words in L(F). Any non empty path through M E is determined by a sequence of positions x 1, …,x n, n  1, which consists of a sequence of paths through M F. Because follow (F,last(F))  first(F)= , the starting positions of those paths are uniquely determined. Hence, if E is weakly unambiguous, then the star of F is unambiguous. The other direction is obvious. [E = F*]

Epsilon Normal Form Epsilon Normal Form condition: No subexpression of E denotes the empty word umbiguously E is in epsilon normal form [E =  or a] E is in epsilon normal form if F and G are in epsilon normal form and   L(F)  L(G) [E = F + G] E is in epsilon normal form if F and G are in epsilon normal form [E = FG] E is in epsilon normal form if F is in epsilon normal form and   L(F) [E = F*]

Strongly unambiguous expressions are in star and in epsilon normal form Lemma 4.10 If E* is strongly unambiguous, then follow ( E, last ( E ))  first ( E ) =  Proof Assume that there exist x  last ( E ), y  follow ( E, x )  first ( E ), z  last ( E )  x is a final state in M E. (and also z ) x 1...x n x yy 1 …y m z is a path through M E But this path is also the composition of two paths through M E  This makes L(E)* ambiguous.

Theorem 4.9 E is strongly unambiguous if and only if 1. E is weakly unambiguous 2. E is in star normal form 3. E is in epsilon normal form Proof For expressions in star and epsilon normal form, weak and strong unambiguity are identical (using Lemma 4.7) Strongly unambiguous expressions are in star and in epsilon normal form. (Lemma 4.10)

Test for weak unambiguity in quadratic time  Theorem 4.11  Regular expressions in epsilon normal form can be tested for weak unambiguity in quadratic time.  Proof Let E be in epsilon normal form. E can be transformed into star normal form E  without changing the Glushkov automaton linear time. E  is also in epsilon normal form. E is weakly unambiguous if and only if E  is  if and only if E  is strongly unambiguous. strong unambiguity of expressions can be decided in quadratic time

Open problems It is easy to see that a regular expression can be tested for epsilon normal form in linear time.  Can a given regular expression be transformed into epsilon normal form in linear time? Our transformation into star normal form can deal with starred subexpressions. Hence, the crucial point is how expressions E = F+G with  L(F)  L(G) can be handled. A straight forward approach would eliminate the empty string either from L(F) or from L(G). This opens up another question:  Is there a linear time algorithm transforming a regular expression E into an expression E ’ with L(E ’ ) = L(E)\{  }?

The End

From symbol               

case v is a node labeled  : nullable (v) := false; first(v) :=  ; last(v) :=  ; v is a node labeled  : nullable (v) := true; first(v) :=  ; last(v) :=  ; v is a node labeled x: nullable (v) := false; follow (x) :=  ; first(v) := {x}; last(v) := {x}; if nullable(rightchild) then last(v) := last(leftchild )  last(rightchild ) ( ) else last(v) := last(rightchild ); v is a node labeled *: nullable (v) := true; for each x in last(child) do follow (x) := follow (x)  first(child ); ( ) first(v) := first(child ); last(v) := last(child ); end case; v is a node labeled +: nullable (v) := nullable (leftchild ) or nullable (rightchild ); first(v) := first(leftchild )  first(rightchild ); ( ) last(v) := last(leftchild )  last(rightchild ); ( ) v is a node labeled. : nullable (v) := nullable (leftchild ) and nullable (rightchild ); for each x in last(leftchild) do follow (x) := follow (x)  first(rightchild ); ( ) if nullable(leftchild) then first(v) := first(leftchild )  first(rightchild ) ( ) else first(v) := first(leftchild );

Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing.

Similar presentations

Presentation on theme: "Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing.

Similar presentations

Presentation on theme: "Regular Expressions into Finite Automata Anne Bruggemann-Klein Presenting: Rutie Mesing."— Presentation transcript:

Similar presentations

About project

Feedback