languages & relations regular expressions finite-state networks

Slides:



Advertisements
Similar presentations
Intro to NLP - J. Eisner1 Finite-State Programming Some Examples.
Advertisements

October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
COP4020 Programming Languages
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Grammars CPSC 5135.
PART I: overview material
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Regular Grammars Chapter 7. Regular Grammars A regular grammar G is a quadruple (V, , R, S), where: ● V is the rule alphabet, which contains nonterminals.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
SPOUSE LEADERSHIP DEVELOPMENT COURSE (SLDC) CLASS 68
Jan 2016 Solar Lunar Data.
Language Information Elaborado por: Mtra. Maribel Pérez Pérez
Context-Free Grammars: an overview
Introduction to Parsing (adapted from CS 164 at Berkeley)
Summary: Week#5: “Where did they get their names
Primary Longman Elect 3A Chapter 5 Asking about dates.
Copyright © Cengage Learning. All rights reserved.
Composition is Our Friend
Modeling Arithmetic, Computation, and Languages
Automata and Languages What do these have in common?
Payroll Calendar Fiscal Year
Abbreviations.
Second Grade Saxon Math Lesson 30A


Fun with dates and calendars
Building Finite-State Machines
5. Context-Free Grammars and Languages
Fun with dates and calendars
1   1.テキストの入れ替え テキストを自由に入れ替えることができます。 フチなし全面印刷がおすすめです。 印刷のポイント.
January MON TUE WED THU FRI SAT SUN
January MON TUE WED THU FRI SAT SUN
2017 Jan Sun Mon Tue Wed Thu Fri Sat
CHAPTER 2 Context-Free Languages

Building Finite-State Machines
FY 2019 Close Schedule Bi-Weekly Payroll governs close schedule
January Sun Mon Tue Wed Thu Fri Sat
January MON TUE WED THU FRI SAT SUN
January MON TUE WED THU FRI SAT SUN
Jan Sun Mon Tue Wed Thu Fri Sat
Finite-State Programming
January MON TUE WED THU FRI SAT SUN
GANTT CHART can be used for scheduling generic resources as well as project management. They can also be used for scheduling production processes and.
CS 240 – Lecture 7 Boolean Operations, Increment and Decrement Operators, Constant Types, enum Types, Precedence.
January MON TUE WED THU FRI SAT SUN
JANUARY 1 Sun Mon Tue Wed Thu Fri Sat
January MON TUE WED THU FRI SAT SUN
JANUARY 1 Sun Mon Tue Wed Thu Fri Sat
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
JUNE 2010 CALENDAR PROJECT PLANNING 1 Month MONDAY TUESDAY WEDNESDAY

Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
Text for section 1 1 Text for section 2 2 Text for section 3 3
January MON TUE WED THU FRI SAT SUN
S M T W F S M T W F
JANUARY 1 Sun Mon Tue Wed Thu Fri Sat
TIMELINE NAME OF PROJECT Today 2016 Jan Feb Mar Apr May Jun
1 January 2018 Sun Mon Tue Wed Thu Fri Sat
Reviewing Abbreviations
Presentation transcript:

languages & relations regular expressions finite-state networks Lauri Karttunen XRCE

overview regular expressions and networks Simple examples Common regular expression operators negation, union, intersection, composition Symbols Interpretation of ? Xerox operators contains, restriction, replacement

encodes denotes compiles into LANGUAGE/RELATION {<“a”, “b”>} FINITE-STATE NETWORK a:b REGULAR EXPRESSION a:b compiles into

simple expressions and networks Regular expression Finite-state network the empty string language. ?* ? the universal language. (the universal identity relation). ~[?*] the empty language

common regular expression operators concatenation * + iteration | union & intersection* ~ \ - complementation*, minus* .x. crossproduct .o. composition * = not applicable to regular relations because the result may not be encodable by a finite-state network.

Xerox extensions $ containment => restriction -> @-> replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

\ term negation a ? Sigma: ? a \a The interpretation of ? in a network is determined by the sigma alphabet of the network. \A is equivalent to [? - A].

~ complementation a ? a ? a ~A is equivalent to [?* - A]. completed negated a original ~A is equivalent to [?* - A].

union a b a b a a | b b New start state. After epsilon removal New start state. After epsilon removal and minimization. a | b a b

intersection [a | b] [c | d] c => b _ a b c d A ? B A B C A & B 1 2 c d [a | b] [c | d] A ? c => b _ B A B C A & B Each state in the intersection corresponds to a pair of states in the original networks. <0,0> 0 1 a <1,0> 1 2 d <2,0> 2 c d 3 b <1,1> 3

composition Similar to A B C intersection. Matching lower side 1 2 c d [a | b] [c | d] A a d -> d a B a:d ? d:a 1 A .o. B C Similar to intersection. Matching lower side of A with upper side of B. A B C < 0, 0> 0 2 d:a <2,0> 2 1 a:d <1,1> 1 c d 4 b <1,0> 4 3 a <1,2> 3 c

In general, it’s a bad idea to have symbols like ab ordinary symbols single character symbols a, b, c, … multicharacter symbols [Verb] [Sg] +Sg, +Verb, ^Fin, @U.H.b@, … Be careful to distinguish between them! ab a b a b In general, it’s a bad idea to have symbols like ab

special symbols in regular expressions 0 (EPSILON) represents the empty string. ? (ANY) Represents any known symbol and any unknown single character symbol. ? denotes an infinite language.

symbols vs. symbol pairs In general, no distinction is made between a the language {“a”} a:a the identity relation {<“a”, “a”>} but we have to make a distinction between ? a language or identity relation ?:? any mapping between between any two symbols The term label is a common name for symbols and symbol pairs.

more about ? In regular expressions a:? ?:? ? represents any symbol whatsoever. In networks a:? a ?:? ? ? represents any unknown symbol.

why is ? so complicated? a ? ? a Consider the concatenation of [a] and[?]... a ? a ?

? \a Sigma: ? a a a ? … vs. the concatenation of [a] and [\a]. a ? The interpretation of ? in a network is determined by the sigma alphabet of the network.

xfst[1]> compact sigma … vs. the union [a | \a]. a ? Sigma: ? a ? a redundant result: xfst[1]> compact sigma ? best result:

some equivalencies [0 a] is equivalent to a.

Xerox extensions (abbreviations) $ containment => restriction -> @-> replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.

Containment operator $ ? $a [?* a ?*] Equivalent expression

Restriction operator => b a => b _ c b ? a c “Any a must be preceded by b and followed by c.” ? c c ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression

Replacement operator -> a:b b a ? b:a a b -> b a “Replace ‘ab’ by ‘ba’.” [[~$[a b] [[a b] .x. [b a]]]* ~$[a b]] Equivalent expression

Marking (a kind of double insertion) a|e|i|o|u -> %[ ... %] 0:[ [ 0:] ? a e i o u ] p o t a t o p[o]t[a]t[o]

observations Transducers derived from -> expressions are generally meant to be applied in a given direction for example, a b -> b a yields a transducer suitable for downward application: apply down> ab apply up> ba ba ab ba The input language (upper in this case) is the universal language; that is, the transducer never fails on any input in this direction. If the rule does not match, the input is mapped without change to the output

variants Optional replacement operator (->) Inverse replacement (more upward-oriented) <- Parallel replacement a -> b, b -> a Directed replacement (conceptually left-to-right) @-> @> Conditional replacement (context-dependent)

Four “factorizations” of the input string. multiple results a b | b | b a | a b a -> x applied in a downward direction to the string “aba” a b a a b a a b a a b a a x a a x x a x Four “factorizations” of the input string.

directed replace operators Guarantee a unique result by constraining the factorization of the input string by Direction of the match (rightward or leftward) Length (longest or shortest) N.B. such rules seem to work algorithmically, but like all regular expressions, they compile into finite-state networks

@-> left-to-right, longest match replacement a b | b | b a | a b a @-> x applied to “aba”, effectively prefers the longest match, as if the matching were being done left-to-right a b a a b a a b a a b a a x a a x x a x

conditional replacement A -> B Replacement Context The relation that replaces A by B between L and R leaving everything else unchanged. Sources of complexity: Replacements and contexts may overlap Alternative ways of interpreting “between left and right.”

both contexts on the input side, || operator A -> B || L _ R a b -> x || a b _ a a b a b a b a .o. a b a b a b a a b x x a yields In practice, this is the most-used type of Replace Rule

L on input side, R on the output side, // operator A -> B // L _ R a b -> x // a b _ a a b a b a b a .o. a b a b a b a a b x a b a yields // rules can be useful for handling vowel harmony

Two languages with vowel shortening V: -> V || V: C* _ Left context on the input side Slovak v o l + a: v + a: m e: v o l + a: v + a m e we call often Gidabal g u n u: m + b a: + d a: ng + b e: + g u n u: m + b a + d a: ng + b e + is certainly right on the stump V: -> V // V: C* _ Left context on the output side

syllabification define C [ b | c | d | f ... define V [ a | e | i | o | u ]; [C* V+ C*] @-> ... "-" || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” s t r u k t u r a l i s m i s t r u k - t u - r a - l i s - m i

Syntactic “chunking” recognizing dates Today is Wednesday, [March 14, 2000]. Today is [Wednesday, March 14], 2000. Today is Wednesday, [March 14], 2000. Today is [Wednesday], March 14, 2000. Best result Bad results Today is [Wednesday, March 14, 2000]. Need left-to-right, longest-match constraints.

Defining the language of dates Day = Monday | Tuesday | ... | Sunday Month = January| February | ... | December Date = 1 | 2 | 3 | ... | 3 1 Year = 1To9 (%0To9 (%0To9 (%0To9))) from 1 to 9999 AllDates = Day | (Day “, “) Month “ “ Date (“, “ Year))

All dates from 1.1.1 to 31.12.9999 13 states, 96 arcs Jan 2 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Tue Mon Wed Fri Sat Sun Thu Feb 1 2 3 4 5 6 7 8 9 Mar Apr 4 5 6 7 8 9 May Jun 3 Jul Aug Sep Oct Nov 1 Dec , 13 states, 96 arcs 29 760 007 date expressions , Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Parser for dates How would we write a similar rule to do XML markup? Compiles into an unambiguous transducer (23 states, 332 arcs). AllDates @-> “[DE “ ... “]“ Today is [DE Wednesday, March 15, 2000] because yesterday was [DE Tuesday] and it was [DE March 14] so tomorrow must be [DE Thursday, March 16] and not [DE March 15] as it says on the program. How would we write a similar rule to do XML markup?

Problem of reference Wednesday, March 15, 2000 Valid Wednesday, March 15, 2000 Tuesday, February 29, 2000 Monday, September 16, 1996 Invalid Wednesday, April 31, 1996 Thursday, February 29, 1900 Tuesday, September 16, 1996

refinement by intersection AllDates LeapYears Feb 29 => _ ... MaxDays In Month 31 => Jan | Mar … _ 30 => Apr | Jun … _ WeekdayDate Valid Dates

ValidDates: 805 states, 6472 arcs defining valid dates AllDates: 13 states, 96 arcs 29 760 007 date expressions AllDates & MaxDaysInMonth LeapYears WeekdayDates = ValidDates ValidDates: 805 states, 6472 arcs 7 307 053 date expressions

Parser for valid and invalid dates [AllDates - ValidDates] @-> “[DE ” ... “]” , ValidDates @-> “[DT ” ... “]” 2688 states, 20439 arcs Today is [DT Wednesday, March 15, 2000], not [DE Tuesday, March 17, 2000]. valid date invalid date

observations For some subsets of natural language, such as dates, a finite-state description is more appropriate than a phrase structure grammar. Regular languages and relations can be modified directly with the finite-state calculus without rewriting the grammars that describe them. This is a fundamental advantage over higher-level formalisms.