languages & relations regular expressions finite-state networks

languages & relations regular expressions finite-state networks
Lauri Karttunen XRCE

overview regular expressions and networks
Simple examples Common regular expression operators negation, union, intersection, composition Symbols Interpretation of ? Xerox operators contains, restriction, replacement

encodes denotes compiles into LANGUAGE/RELATION {<“a”, “b”>}
FINITE-STATE NETWORK a:b REGULAR EXPRESSION a:b compiles into

simple expressions and networks
Regular expression Finite-state network the empty string language. ?* ? the universal language. (the universal identity relation). ~[?*] the empty language

common regular expression operators
concatenation * + iteration | union & intersection* ~ \ - complementation*, minus* .x. crossproduct .o. composition * = not applicable to regular relations because the result may not be encodable by a finite-state network.

Xerox extensions $ containment => restriction
replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

\ term negation a ? Sigma: ? a \a The interpretation of ? in a network is determined by the sigma alphabet of the network. \A is equivalent to [? - A].

~ complementation a ? a ? a ~A is equivalent to [?* - A]. completed
negated a original ~A is equivalent to [?* - A].

union a b a b a a | b b New start state. After epsilon removal
New start state. After epsilon removal and minimization. a | b a b

intersection [a | b] [c | d] c => b _ a b c d A ? B A B C A & B
1 2 c d [a | b] [c | d] A ? c => b _ B A B C A & B Each state in the intersection corresponds to a pair of states in the original networks. <0,0> 0 1 a <1,0> 1 2 d <2,0> 2 c d 3 b <1,1> 3

composition Similar to A B C intersection. Matching lower side
1 2 c d [a | b] [c | d] A a d -> d a B a:d ? d:a 1 A .o. B C Similar to intersection. Matching lower side of A with upper side of B. A B C < 0, 0> 0 2 d:a <2,0> 2 1 a:d <1,1> 1 c d 4 b <1,0> 4 3 a <1,2> 3 c

In general, it’s a bad idea to have symbols like ab
ordinary symbols single character symbols a, b, c, … multicharacter symbols [Verb] [Sg] +Sg, +Verb, ^Fin, … Be careful to distinguish between them! ab a b a b In general, it’s a bad idea to have symbols like ab

special symbols in regular expressions
0 (EPSILON) represents the empty string. ? (ANY) Represents any known symbol and any unknown single character symbol. ? denotes an infinite language.

symbols vs. symbol pairs
In general, no distinction is made between a the language {“a”} a:a the identity relation {<“a”, “a”>} but we have to make a distinction between ? a language or identity relation ?:? any mapping between between any two symbols The term label is a common name for symbols and symbol pairs.

more about ? In regular expressions a:? ?:? ? represents any
symbol whatsoever. In networks a:? a ?:? ? ? represents any unknown symbol.

why is ? so complicated? a ? ? a
Consider the concatenation of [a] and[?]... a ? a ?

? \a Sigma: ? a a a ? … vs. the concatenation of [a] and [\a].
a ? The interpretation of ? in a network is determined by the sigma alphabet of the network.

xfst[1]> compact sigma
… vs. the union [a | \a]. a ? Sigma: ? a ? a redundant result: xfst[1]> compact sigma ? best result:

some equivalencies [0 a] is equivalent to a.

Xerox extensions (abbreviations)
$ containment => restriction replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

Containment operator $
? $a [?* a ?*] Equivalent expression

Restriction operator =>
b a => b _ c b ? a c “Any a must be preceded by b and followed by c.” ? c c ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression

Replacement operator ->
a:b b a ? b:a a b -> b a “Replace ‘ab’ by ‘ba’.” [[~$[a b] [[a b] .x. [b a]]]* ~$[a b]] Equivalent expression

Marking (a kind of double insertion)
a|e|i|o|u -> %[ ... %] 0:[ [ 0:] ? a e i o u ] p o t a t o p[o]t[a]t[o]

observations Transducers derived from -> expressions are generally meant to be applied in a given direction for example, a b -> b a yields a transducer suitable for downward application: apply down> ab apply up> ba ba ab ba The input language (upper in this case) is the universal language; that is, the transducer never fails on any input in this direction. If the rule does not match, the input is mapped without change to the output

variants Optional replacement operator
(->) Inverse replacement (more upward-oriented) <- Parallel replacement a -> b, b -> a Directed replacement (conceptually left-to-right) @-> @> Conditional replacement (context-dependent)

Four “factorizations” of the input string.
multiple results a b | b | b a | a b a -> x applied in a downward direction to the string “aba” a b a a b a a b a a b a a x a a x x a x Four “factorizations” of the input string.

directed replace operators
Guarantee a unique result by constraining the factorization of the input string by Direction of the match (rightward or leftward) Length (longest or shortest) N.B. such rules seem to work algorithmically, but like all regular expressions, they compile into finite-state networks

@-> left-to-right, longest match replacement
a b | b | b a | a b x applied to “aba”, effectively prefers the longest match, as if the matching were being done left-to-right a b a a b a a b a a b a a x a a x x a x

conditional replacement
A -> B Replacement Context The relation that replaces A by B between L and R leaving everything else unchanged. Sources of complexity: Replacements and contexts may overlap Alternative ways of interpreting “between left and right.”

both contexts on the input side, || operator
A -> B || L _ R a b -> x || a b _ a a b a b a b a .o. a b a b a b a a b x x a yields In practice, this is the most-used type of Replace Rule

L on input side, R on the output side, // operator
A -> B // L _ R a b -> x // a b _ a a b a b a b a .o. a b a b a b a a b x a b a yields // rules can be useful for handling vowel harmony

Two languages with vowel shortening
V: -> V || V: C* _ Left context on the input side Slovak v o l + a: v + a: m e: v o l + a: v + a m e we call often Gidabal g u n u: m + b a: + d a: ng + b e: + g u n u: m + b a + d a: ng + b e + is certainly right on the stump V: -> V // V: C* _ Left context on the output side

syllabification define C [ b | c | d | f ...
define V [ a | e | i | o | u ]; [C* V+ ... "-" || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” s t r u k t u r a l i s m i s t r u k - t u - r a - l i s - m i

Syntactic “chunking” recognizing dates
Today is Wednesday, [March 14, 2000]. Today is [Wednesday, March 14], 2000. Today is Wednesday, [March 14], 2000. Today is [Wednesday], March 14, 2000. Best result Bad results Today is [Wednesday, March 14, 2000]. Need left-to-right, longest-match constraints.

Defining the language of dates
Day = Monday | Tuesday | ... | Sunday Month = January| February | ... | December Date = 1 | 2 | 3 | ... | 3 1 Year = 1To9 (%0To9 (%0To9 (%0To9))) from 1 to 9999 AllDates = Day | (Day “, “) Month “ “ Date (“, “ Year))

All dates from 1.1.1 to 31.12.9999 13 states, 96 arcs
Jan 2 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Tue Mon Wed Fri Sat Sun Thu Feb 1 2 3 4 5 6 7 8 9 Mar Apr 4 5 6 7 8 9 May Jun 3 Jul Aug Sep Oct Nov 1 Dec , 13 states, 96 arcs date expressions , Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Parser for dates How would we write a similar rule to do XML markup?
Compiles into an unambiguous transducer (23 states, 332 arcs). “[DE “ ... “]“ Today is [DE Wednesday, March 15, 2000] because yesterday was [DE Tuesday] and it was [DE March 14] so tomorrow must be [DE Thursday, March 16] and not [DE March 15] as it says on the program. How would we write a similar rule to do XML markup?

Problem of reference Wednesday, March 15, 2000
Valid Wednesday, March 15, 2000 Tuesday, February 29, 2000 Monday, September 16, 1996 Invalid Wednesday, April 31, 1996 Thursday, February 29, 1900 Tuesday, September 16, 1996

refinement by intersection
AllDates LeapYears Feb 29 => _ ... MaxDays In Month 31 => Jan | Mar … _ 30 => Apr | Jun … _ WeekdayDate Valid Dates

ValidDates: 805 states, 6472 arcs
defining valid dates AllDates: 13 states, 96 arcs date expressions AllDates & MaxDaysInMonth LeapYears WeekdayDates = ValidDates ValidDates: 805 states, 6472 arcs date expressions

Parser for valid and invalid dates
[AllDates - “[DE ” ... “]” , “[DT ” ... “]” 2688 states, 20439 arcs Today is [DT Wednesday, March 15, 2000], not [DE Tuesday, March 17, 2000]. valid date invalid date

observations For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar. Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them. This is a fundamental advantage over higher-level formalisms.

languages & relations regular expressions finite-state networks

Similar presentations

Presentation on theme: "languages & relations regular expressions finite-state networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

languages & relations regular expressions finite-state networks

Similar presentations

Presentation on theme: "languages & relations regular expressions finite-state networks"— Presentation transcript:

Similar presentations

About project

Feedback