languages & relations regular expressions finite-state networks Lauri Karttunen XRCE
overview regular expressions and networks Simple examples Common regular expression operators negation, union, intersection, composition Symbols Interpretation of ? Xerox operators contains, restriction, replacement
encodes denotes compiles into LANGUAGE/RELATION {<“a”, “b”>} FINITE-STATE NETWORK a:b REGULAR EXPRESSION a:b compiles into
simple expressions and networks Regular expression Finite-state network the empty string language. ?* ? the universal language. (the universal identity relation). ~[?*] the empty language
common regular expression operators concatenation * + iteration | union & intersection* ~ \ - complementation*, minus* .x. crossproduct .o. composition * = not applicable to regular relations because the result may not be encodable by a finite-state network.
Xerox extensions $ containment => restriction -> @-> replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
\ term negation a ? Sigma: ? a \a The interpretation of ? in a network is determined by the sigma alphabet of the network. \A is equivalent to [? - A].
~ complementation a ? a ? a ~A is equivalent to [?* - A]. completed negated a original ~A is equivalent to [?* - A].
union a b a b a a | b b New start state. After epsilon removal New start state. After epsilon removal and minimization. a | b a b
intersection [a | b] [c | d] c => b _ a b c d A ? B A B C A & B 1 2 c d [a | b] [c | d] A ? c => b _ B A B C A & B Each state in the intersection corresponds to a pair of states in the original networks. <0,0> 0 1 a <1,0> 1 2 d <2,0> 2 c d 3 b <1,1> 3
composition Similar to A B C intersection. Matching lower side 1 2 c d [a | b] [c | d] A a d -> d a B a:d ? d:a 1 A .o. B C Similar to intersection. Matching lower side of A with upper side of B. A B C < 0, 0> 0 2 d:a <2,0> 2 1 a:d <1,1> 1 c d 4 b <1,0> 4 3 a <1,2> 3 c
In general, it’s a bad idea to have symbols like ab ordinary symbols single character symbols a, b, c, … multicharacter symbols [Verb] [Sg] +Sg, +Verb, ^Fin, @U.H.b@, … Be careful to distinguish between them! ab a b a b In general, it’s a bad idea to have symbols like ab
special symbols in regular expressions 0 (EPSILON) represents the empty string. ? (ANY) Represents any known symbol and any unknown single character symbol. ? denotes an infinite language.
symbols vs. symbol pairs In general, no distinction is made between a the language {“a”} a:a the identity relation {<“a”, “a”>} but we have to make a distinction between ? a language or identity relation ?:? any mapping between between any two symbols The term label is a common name for symbols and symbol pairs.
more about ? In regular expressions a:? ?:? ? represents any symbol whatsoever. In networks a:? a ?:? ? ? represents any unknown symbol.
why is ? so complicated? a ? ? a Consider the concatenation of [a] and[?]... a ? a ?
? \a Sigma: ? a a a ? … vs. the concatenation of [a] and [\a]. a ? The interpretation of ? in a network is determined by the sigma alphabet of the network.
xfst[1]> compact sigma … vs. the union [a | \a]. a ? Sigma: ? a ? a redundant result: xfst[1]> compact sigma ? best result:
some equivalencies [0 a] is equivalent to a.
Xerox extensions (abbreviations) $ containment => restriction -> @-> replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
Containment operator $ ? $a [?* a ?*] Equivalent expression
Restriction operator => b a => b _ c b ? a c “Any a must be preceded by b and followed by c.” ? c c ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression
Replacement operator -> a:b b a ? b:a a b -> b a “Replace ‘ab’ by ‘ba’.” [[~$[a b] [[a b] .x. [b a]]]* ~$[a b]] Equivalent expression
Marking (a kind of double insertion) a|e|i|o|u -> %[ ... %] 0:[ [ 0:] ? a e i o u ] p o t a t o p[o]t[a]t[o]
observations Transducers derived from -> expressions are generally meant to be applied in a given direction for example, a b -> b a yields a transducer suitable for downward application: apply down> ab apply up> ba ba ab ba The input language (upper in this case) is the universal language; that is, the transducer never fails on any input in this direction. If the rule does not match, the input is mapped without change to the output
variants Optional replacement operator (->) Inverse replacement (more upward-oriented) <- Parallel replacement a -> b, b -> a Directed replacement (conceptually left-to-right) @-> @> Conditional replacement (context-dependent)
Four “factorizations” of the input string. multiple results a b | b | b a | a b a -> x applied in a downward direction to the string “aba” a b a a b a a b a a b a a x a a x x a x Four “factorizations” of the input string.
directed replace operators Guarantee a unique result by constraining the factorization of the input string by Direction of the match (rightward or leftward) Length (longest or shortest) N.B. such rules seem to work algorithmically, but like all regular expressions, they compile into finite-state networks
@-> left-to-right, longest match replacement a b | b | b a | a b a @-> x applied to “aba”, effectively prefers the longest match, as if the matching were being done left-to-right a b a a b a a b a a b a a x a a x x a x
conditional replacement A -> B Replacement Context The relation that replaces A by B between L and R leaving everything else unchanged. Sources of complexity: Replacements and contexts may overlap Alternative ways of interpreting “between left and right.”
both contexts on the input side, || operator A -> B || L _ R a b -> x || a b _ a a b a b a b a .o. a b a b a b a a b x x a yields In practice, this is the most-used type of Replace Rule
L on input side, R on the output side, // operator A -> B // L _ R a b -> x // a b _ a a b a b a b a .o. a b a b a b a a b x a b a yields // rules can be useful for handling vowel harmony
Two languages with vowel shortening V: -> V || V: C* _ Left context on the input side Slovak v o l + a: v + a: m e: v o l + a: v + a m e we call often Gidabal g u n u: m + b a: + d a: ng + b e: + g u n u: m + b a + d a: ng + b e + is certainly right on the stump V: -> V // V: C* _ Left context on the output side
syllabification define C [ b | c | d | f ... define V [ a | e | i | o | u ]; [C* V+ C*] @-> ... "-" || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” s t r u k t u r a l i s m i s t r u k - t u - r a - l i s - m i
Syntactic “chunking” recognizing dates Today is Wednesday, [March 14, 2000]. Today is [Wednesday, March 14], 2000. Today is Wednesday, [March 14], 2000. Today is [Wednesday], March 14, 2000. Best result Bad results Today is [Wednesday, March 14, 2000]. Need left-to-right, longest-match constraints.
Defining the language of dates Day = Monday | Tuesday | ... | Sunday Month = January| February | ... | December Date = 1 | 2 | 3 | ... | 3 1 Year = 1To9 (%0To9 (%0To9 (%0To9))) from 1 to 9999 AllDates = Day | (Day “, “) Month “ “ Date (“, “ Year))
All dates from 1.1.1 to 31.12.9999 13 states, 96 arcs Jan 2 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Tue Mon Wed Fri Sat Sun Thu Feb 1 2 3 4 5 6 7 8 9 Mar Apr 4 5 6 7 8 9 May Jun 3 Jul Aug Sep Oct Nov 1 Dec , 13 states, 96 arcs 29 760 007 date expressions , Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Parser for dates How would we write a similar rule to do XML markup? Compiles into an unambiguous transducer (23 states, 332 arcs). AllDates @-> “[DE “ ... “]“ Today is [DE Wednesday, March 15, 2000] because yesterday was [DE Tuesday] and it was [DE March 14] so tomorrow must be [DE Thursday, March 16] and not [DE March 15] as it says on the program. How would we write a similar rule to do XML markup?
Problem of reference Wednesday, March 15, 2000 Valid Wednesday, March 15, 2000 Tuesday, February 29, 2000 Monday, September 16, 1996 Invalid Wednesday, April 31, 1996 Thursday, February 29, 1900 Tuesday, September 16, 1996
refinement by intersection AllDates LeapYears Feb 29 => _ ... MaxDays In Month 31 => Jan | Mar … _ 30 => Apr | Jun … _ WeekdayDate Valid Dates
ValidDates: 805 states, 6472 arcs defining valid dates AllDates: 13 states, 96 arcs 29 760 007 date expressions AllDates & MaxDaysInMonth LeapYears WeekdayDates = ValidDates ValidDates: 805 states, 6472 arcs 7 307 053 date expressions
Parser for valid and invalid dates [AllDates - ValidDates] @-> “[DE ” ... “]” , ValidDates @-> “[DT ” ... “]” 2688 states, 20439 arcs Today is [DT Wednesday, March 15, 2000], not [DE Tuesday, March 17, 2000]. valid date invalid date
observations For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar. Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them. This is a fundamental advantage over higher-level formalisms.