Building Finite-State Machines

Slides:



Advertisements
Similar presentations
Intro to NLP - J. Eisner1 Finite-State Programming Some Examples.
Advertisements

Theory of Computation CS3102 – Spring 2014 A tale of computers, math, problem solving, life, love and tragic death Nathan Brunelle Department of Computer.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation.
Intro to NLP - J. Eisner1 Finite-State Methods.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
FSA Lecture 1 Finite State Machines. Creating a Automaton  Given a language L over an alphabet , design a deterministic finite automaton (DFA) M such.
Finite-State Machines with No Output Longin Jan Latecki Temple University Based on Slides by Elsa L Gunter, NJIT, and by Costas Busch Costas Busch.
Finite-State Machines with No Output
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
PZ02B Programming Language design and Implementation -4th Edition Copyright©Prentice Hall, PZ02B - Regular grammars Programming Language Design.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Grammars CPSC 5135.
Finite State Transducers for Morphological Parsing
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Intro to NLP - J. Eisner1 Building Finite-State Machines.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions Finite Automata.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Intro to NLP - J. Eisner1 Building Finite-State Machines.
1 Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
1 Context-Free Languages & Grammars (CFLs & CFGs) Reading: Chapter 5.
Lecture Three: Finite Automata Finite Automata, Lecture 3, slide 1 Amjad Ali.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
Pushdown Automata.
Transition Graphs.
Climate graphs LO: To be able to construct a climate graph for the tropical rainforest. To extract information from graphs and use it to explain climate.
Context-Free Languages & Grammars (CFLs & CFGs)
CS510 Compiler Lecture 4.
CSE 105 theory of computation
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Chapter 2 Finite Automata
Finite Automata.
Pattern Matching in Strings
Second Grade Saxon Math Lesson 30A
Compiler Design 4. Language Grammars


Hierarchy of languages
Some slides by Elsa L Gunter, NJIT, and by Costas Busch
CS 154, Lecture 3: DFANFA, Regular Expressions.
COSC 3340: Introduction to Theory of Computation
CSC NLP - Regex, Finite State Automata
2017 Jan Sun Mon Tue Wed Thu Fri Sat
Finite-State Methods Intro to NLP - J. Eisner.

Lecture 4: Lexical Analysis & Chomsky Hierarchy
Building Finite-State Machines
Finite-State Methods in Natural-Language Processing: Basic Mathematics
languages & relations regular expressions finite-state networks
Jan Sun Mon Tue Wed Thu Fri Sat
Finite-State Programming
Instructor: Aaron Roth
CS 240 – Lecture 7 Boolean Operations, Increment and Decrement Operators, Constant Types, enum Types, Precedence.
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
CSC312 Automata Theory Transition Graphs Lecture # 9
Chapter 1 Introduction to the Theory of Computation

NFAs and Transition Graphs
CSE 105 theory of computation
Lecture 5 Scanning.
TIMELINE NAME OF PROJECT Today 2016 Jan Feb Mar Apr May Jun
Reviewing Abbreviations
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
PZ02B - Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section PZ02B.
Presentation transcript:

Building Finite-State Machines 600.465 - Intro to NLP - J. Eisner

Finite-State Toolkits In these slides, we’ll use Xerox’s regexp notation Their tool is XFST – free version is called FOMA Usage: Enter a regular expression; it builds FSA or FST Now type in input string FSA: It tells you whether it’s accepted FST: It tells you all the output strings (if any) Can also invert FST to let you map outputs to inputs Could hook it up to other NLP tools that need finite-state processing of their input or output There are other tools for weighted FSMs (Thrax, OpenFST) 600.465 - Intro to NLP - J. Eisner

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F & intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x. crossproduct E .x. F .o. composition E .o. F .u upper (input) language E.u “domain” .l lower (output) language E.l “range” 600.465 - Intro to NLP - J. Eisner 3

Common Regular Expression Operators (in XFST notation) concatenation EF EF = {ef: e  E, f  F} ef denotes the concatenation of 2 strings. EF denotes the concatenation of 2 languages. To pick a string in EF, pick e  E and f  F and concatenate them. To find out whether w  EF, look for at least one way to split w into two “halves,” w = ef, such that e  E and f  F. A language is a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so does EF. (We will have to prove this by finding the FSA for EF!) 600.465 - Intro to NLP - J. Eisner

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ E* = {e1e2 … en: n0, e1 E, … en E} To pick a string in E*, pick any number of strings in E and concatenate them. To find out whether w  E*, look for at least one way to split w into 0 or more sections, e1e2 … en, all of which are in E. E+ = {e1e2 … en: n>0, e1 E, … en E} =EE* 600.465 - Intro to NLP - J. Eisner 5

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F E | F = {w: w E or w F} = E  F To pick a string in E | F, pick a string from either E or F. To find out whether w  E | F, check whether w  E or w  F. 600.465 - Intro to NLP - J. Eisner 6

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F & intersection E & F E & F = {w: w E and w F} = E F To pick a string in E & F, pick a string from E that is also in F. To find out whether w  E & F, check whether w  E and w  F. 600.465 - Intro to NLP - J. Eisner 7

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F & intersection E & F ~ \ - complementation, minus ~E, \x, F-E ~E = {e: e E} = * - E E – F = {e: e E and e F} = E & ~F \E =  - E (any single character not in E)  is set of all letters; so * is set of all strings 600.465 - Intro to NLP - J. Eisner 8

Regular Expressions E F * G + & A language is a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so do EF, etc. Regular expression: EF*|(F & G)+ Syntax: E F * G concat & + | Semantics: Denotes a regular language. As usual, can build semantics compositionally bottom-up. E, F, G must be regular languages. As a base case, e denotes {e} (a language containing a single string), so ef*|(f&g)+ is regular. 600.465 - Intro to NLP - J. Eisner 9

Regular Expressions for Regular Relations A language is a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so do EF, etc. A relation is a set of pairs – here, pairs of strings. It is a regular relation if here exists an FST that accepts all the pairs in the language, and no other pairs. If E and F denote regular relations, then so do EF, etc. EF = {(ef,e’f’): (e,e’)  E, (f,f’)  F} Can you guess the definitions for E*, E+, E | F, E & F when E and F are regular relations? Surprise: E & F isn’t necessarily regular in the case of relations; so not supported. 600.465 - Intro to NLP - J. Eisner 10

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F & intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x. crossproduct E .x. F E .x. F = {(e,f): e  E, f  F} Combines two regular languages into a regular relation. 600.465 - Intro to NLP - J. Eisner 11

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F & intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x. crossproduct E .x. F .o. composition E .o. F E .o. F = {(e,f): m. (e,m)  E, (m,f)  F} Composes two regular relations into a regular relation. As we’ve seen, this generalizes ordinary function composition. 600.465 - Intro to NLP - J. Eisner 12

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F & intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x. crossproduct E .x. F .o. composition E .o. F .u upper (input) language E.u “domain” E.u = {e: m. (e,m)  E} 600.465 - Intro to NLP - J. Eisner 13

Common Regular Expression Operators (in XFST notation) concatenation EF * + iteration E*, E+ | union E | F & intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x. crossproduct E .x. F .o. composition E .o. F .u upper (input) language E.u “domain” .l lower (output) language E.l “range” 600.465 - Intro to NLP - J. Eisner 14

Function from strings to ... Acceptors (FSAs) Transducers (FSTs) a:x c:z e:y a c e {false, true} strings Unweighted Weighted a/.5 c/.7 e/.5 .3 a:x/.5 c:z/.7 e:y/.5 .3 numbers (string, num) pairs

How to implement? concatenation EF * + iteration E*, E+ | union E | F slide courtesy of L. Karttunen (modified) concatenation EF * + iteration E*, E+ | union E | F ~ \ - complementation, minus ~E, \x, E-F & intersection E & F .x. crossproduct E .x. F .o. composition E .o. F .u upper (input) language E.u “domain” .l lower (output) language E.l “range” 600.465 - Intro to NLP - J. Eisner 16

Concatenation = example courtesy of M. Mohri r r 600.465 - Intro to NLP - J. Eisner

Union | = example courtesy of M. Mohri r 600.465 - Intro to NLP - J. Eisner

Closure (this example has outputs too) example courtesy of M. Mohri * = The loop creates (red machine)+ . Then we add a state to get do  | (red machine)+ . Why do it this way? Why not just make state 0 final? 600.465 - Intro to NLP - J. Eisner

Upper language (domain) example courtesy of M. Mohri .u = similarly construct lower language .l also called input & output languages 600.465 - Intro to NLP - J. Eisner

Reversal .r = example courtesy of M. Mohri 600.465 - Intro to NLP - J. Eisner

Inversion .i = example courtesy of M. Mohri 600.465 - Intro to NLP - J. Eisner

Complementation Given a machine M, represent all strings not accepted by M Just change final states to non-final and vice-versa Works only if machine has been determinized and completed first (why?) 600.465 - Intro to NLP - J. Eisner

Intersection & = 2/0.8 1 1 2/0.5 0,0 0,1 1,1 2,0/0.8 2,2/1.3 fat/0.5 example adapted from M. Mohri fat/0.5 pig/0.3 eats/0 2/0.8 1 sleeps/0.6 fat/0.2 1 2/0.5 eats/0.6 sleeps/1.3 pig/0.4 & 0,0 fat/0.7 0,1 1,1 pig/0.7 2,0/0.8 2,2/1.3 eats/0.6 sleeps/1.9 = 600.465 - Intro to NLP - J. Eisner

Intersection & = 1 2/0.8 1 2/0.5 0,0 0,1 1,1 2,0/0.8 2,2/1.3 fat/0.5 2/0.8 pig/0.3 eats/0 sleeps/0.6 fat/0.2 1 2/0.5 eats/0.6 sleeps/1.3 pig/0.4 & 0,0 fat/0.7 0,1 1,1 pig/0.7 2,0/0.8 2,2/1.3 eats/0.6 sleeps/1.9 = Paths 0012 and 0110 both accept fat pig eats So must the new machine: along path 0,0 0,1 1,1 2,0 600.465 - Intro to NLP - J. Eisner

Intersection & = 2/0.8 1 2/0.5 1 0,1 0,0 fat/0.5 pig/0.3 eats/0 1 sleeps/0.6 pig/0.4 2/0.5 fat/0.2 sleeps/1.3 & 1 eats/0.6 = fat/0.7 0,1 0,0 Paths 00 and 01 both accept fat So must the new machine: along path 0,0 0,1 600.465 - Intro to NLP - J. Eisner

Intersection & = 2/0.8 1 2/0.5 1 1,1 0,0 0,1 fat/0.5 pig/0.3 eats/0 1 sleeps/0.6 pig/0.4 2/0.5 fat/0.2 sleeps/1.3 & 1 eats/0.6 = fat/0.7 pig/0.7 1,1 0,0 0,1 Paths 00 and 11 both accept pig So must the new machine: along path 0,1 1,1 600.465 - Intro to NLP - J. Eisner

Intersection & = 2/0.8 1 2/0.5 1 0,0 0,1 1,1 2,2/1.3 fat/0.5 pig/0.3 eats/0 2/0.8 1 sleeps/0.6 pig/0.4 2/0.5 fat/0.2 sleeps/1.3 & 1 eats/0.6 = fat/0.7 pig/0.7 0,0 0,1 1,1 sleeps/1.9 2,2/1.3 Paths 12 and 12 both accept fat So must the new machine: along path 1,1 2,2 600.465 - Intro to NLP - J. Eisner

Intersection & = 2/0.8 1 2/0.5 1 2,0/0.8 0,0 0,1 1,1 2,2/1.3 fat/0.5 pig/0.3 eats/0 2/0.8 1 sleeps/0.6 pig/0.4 2/0.5 fat/0.2 sleeps/1.3 & 1 eats/0.6 eats/0.6 2,0/0.8 = fat/0.7 pig/0.7 0,0 0,1 1,1 2,2/1.3 sleeps/1.9 600.465 - Intro to NLP - J. Eisner

What Composition Means abgd g f 3 2 6 4 2 8 ab?d abcd abed abjd abed abd ... 600.465 - Intro to NLP - J. Eisner

What Composition Means 3+4 2+2 6+8 ab?d abgd abed Relation composition: f  g abd ... 600.465 - Intro to NLP - J. Eisner

Relation = set of pairs g f abgd ab?d abcd abed abjd abed abd 3 2 6 4 does not contain any pair of the form abjd  … ab?d  abcd ab?d  abed ab?d  abjd … abcd  abgd abed  abed abed  abd … abgd g f 3 2 6 4 2 8 ab?d abcd abed abjd abed abd ... 600.465 - Intro to NLP - J. Eisner

f  g = {xz: y (xy  f and yz  g)} Relation = set of pairs ab?d  abcd ab?d  abed ab?d  abjd … abcd  abgd abed  abed abed  abd … f  g = {xz: y (xy  f and yz  g)} where x, y, z are strings f  g ab?d  abgd ab?d  abed ab?d  abd … 4 ab?d abgd 2 abed 8 abd ... 600.465 - Intro to NLP - J. Eisner

Intersection vs. Composition pig/0.4 pig/0.3 pig/0.7 & = 1 1 0,1 1,1 Composition pig:pink/0.4 Wilbur:pig/0.3 Wilbur:pink/0.7 .o. = 1 1 0,1 1,1 600.465 - Intro to NLP - J. Eisner

Intersection vs. Composition Intersection mismatch elephant/0.4 pig/0.3 pig/0.7 & = 1 1 0,1 1,1 Composition mismatch elephant:gray/0.4 Wilbur:pig/0.3 Wilbur:gray/0.7 .o. = 1 1 0,1 1,1 600.465 - Intro to NLP - J. Eisner

Composition .o. = example courtesy of M. Mohri FIRST MACHINE: odd # of a’s (which become b’s), then b that is left alone, then any # of a’s the first of which is optionally replaced by b SECOND MACHINE: one or more b’s (all but the first of which become a’s), then a:b, then optional b:a

Composition .o. = a:b .o. b:b = a:b

Composition .o. = a:b .o. b:a = a:a

Composition .o. = a:b .o. b:a = a:a

Composition .o. = b:b .o. b:a = b:a

Composition .o. = a:b .o. b:a = a:a

Composition .o. = a:a .o. a:b = a:b

Composition .o. = b:b .o. a:b = nothing (since intermediate symbol doesn’t match)

Composition .o. = b:b .o. b:a = b:a

Composition .o. = a:b .o. a:b = a:b

Composition in Dyna start = &pair( start1, start2 ). final(&pair(Q1,Q2)) :- final1(Q1), final2(Q2). edge(U, L, &pair(Q1,Q2), &pair(R1,R2)) min= edge1(U, Mid, Q1, R1) + edge2(Mid, L, Q2, R2). 600.465 - Intro to NLP - J. Eisner

f  g = {xz: y (xy  f and yz  g)} Relation = set of pairs ab?d  abcd ab?d  abed ab?d  abjd … abcd  abgd abed  abed abed  abd … f  g = {xz: y (xy  f and yz  g)} where x, y, z are strings f  g ab?d  abgd ab?d  abed ab?d  abd … 4 ab?d abgd 2 abed 8 abd ... 600.465 - Intro to NLP - J. Eisner

3 Uses of Set Composition: Feed string into Greek transducer: {abedabed} .o. Greek = {abedabed, abedabd} {abed} .o. Greek = {abedabed, abedabd} [{abed} .o. Greek].l = {abed, abd} Feed several strings in parallel: {abcd, abed} .o. Greek = {abcdabgd, abedabed, abedabd} [{abcd,abed} .o. Greek].l = {abgd, abed, abd} Filter result via No = {abgd, abd, …} {abcd,abed} .o. Greek .o. No = {abcdabgd, abedabd} 600.465 - Intro to NLP - J. Eisner

What are the “basic” transducers? The operations on the previous slides combine transducers into bigger ones But where do we start? a:e for a  S e:x for x  D Q: Do we also need a:x? How about e:e ? a:e e:x 600.465 - Intro to NLP - J. Eisner

Some Xerox Extensions $ containment => restriction slide courtesy of L. Karttunen (modified) $ containment => restriction -> @-> replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems. 600.465 - Intro to NLP - J. Eisner 50 50

Containment $[ab*c] ?* [ab*c] ?* “Must contain a substring that matches ab*c.” Accepts xxxacyy Rejects bcba Warning: ? in regexps means “any character at all.” But ? in machines means “any character not explicitly mentioned anywhere in the machine.” ?* [ab*c] ?* Equivalent expression 600.465 - Intro to NLP - J. Eisner 51 51

~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Restriction slide courtesy of L. Karttunen (modified) ? c b a a => b _ c “Any a must be preceded by b and followed by c.” Accepts bacbbacde Rejects baca contains a not preceded by b contains a not followed by c ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression 600.465 - Intro to NLP - J. Eisner 52 52

[~$[a b] [[a b] .x. [b a]]]* ~$[a b] Replacement slide courtesy of L. Karttunen (modified) a:b b a ? b:a a b -> b a “Replace ‘ab’ by ‘ba’.” Transduces abcdbaba to bacdbbaa [~$[a b] [[a b] .x. [b a]]]* ~$[a b] Equivalent expression 600.465 - Intro to NLP - J. Eisner 53 53

Replacement is Nondeterministic a b -> b a | x “Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.” Transduces abcdbaba to {bacdbbaa, bacdbxa, xcdbbaa, xcdbxa} 600.465 - Intro to NLP - J. Eisner 54 54

Replacement is Nondeterministic [ a b -> b a | x ] .o. [ x => _ c ] “Replace ‘ab’ by ‘ba’ or ‘x’, nondeterministically.” Transduces abcdbaba to {bacdbbaa, bacdbxa, xcdbbaa, xcdbxa} 600.465 - Intro to NLP - J. Eisner 55 55

Replacement is Nondeterministic slide courtesy of L. Karttunen (modified) a b | b | b a | a b a -> x applied to “aba” Four overlapping substrings match; we haven’t told it which one to replace so it chooses nondeterministically a b a a b a a b a a b a a x a a x x a x 600.465 - Intro to NLP - J. Eisner 56 56

More Replace Operators slide courtesy of L. Karttunen Optional replacement: a b (->) b a Directed replacement guarantees a unique result by constraining the factorization of the input string by Direction of the match (rightward or leftward) Length (longest or shortest) 600.465 - Intro to NLP - J. Eisner 57 57

@-> Left-to-right, Longest-match Replacement slide courtesy of L. Karttunen a b | b | b a | a b a @-> x applied to “aba” a b a a b a a b a a b a a x a a x x a x @-> left-to-right, longest match @> left-to-right, shortest match ->@ right-to-left, longest match >@ right-to-left, shortest match 600.465 - Intro to NLP - J. Eisner 58 58

Using “…” for marking a|e|i|o|u -> [ ... ] p o t a t o p[o]t[a]t[o] slide courtesy of L. Karttunen (modified) 0:[ [ 0:] ? a e i o u ] a|e|i|o|u -> [ ... ] p o t a t o p[o]t[a]t[o] Note: actually have to write as -> %[ ... %] or -> “[” ... “]” since [] are parens in the regexp language 600.465 - Intro to NLP - J. Eisner 59 59

How would you change it to get the other answer? Using “…” for marking slide courtesy of L. Karttunen (modified) 0:[ [ 0:] ? a e i o u ] a|e|i|o|u -> [ ... ] p o t a t o p[o]t[a]t[o] Which way does the FST transduce potatoe? p o t a t o e p[o]t[a]t[o][e] p o t a t o e p[o]t[a]t[o e] vs. How would you change it to get the other answer? 600.465 - Intro to NLP - J. Eisner 60 60

Example: Finnish Syllabification slide courtesy of L. Karttunen define C [ b | c | d | f ... define V [ a | e | i | o | u ]; [C* V+ C*] @-> ... "-" || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” why? s t r u k t u r a l i s m i s t r u k - t u - r a - l i s - m i 600.465 - Intro to NLP - J. Eisner 61 61

Conditional Replacement slide courtesy of L. Karttunen The relation that replaces A by B between L and R leaving everything else unchanged. A -> B Replacement L _ R Context Sources of complexity: Replacements and contexts may overlap Alternative ways of interpreting “between left and right.” 600.465 - Intro to NLP - J. Eisner 62 62

Hand-Coded Example: Parsing Dates slide courtesy of L. Karttunen Today is Tuesday, [July 25, 2000]. Today is [Tuesday, July 25], 2000. Today is Tuesday, [July 25], 2000. Today is [Tuesday], July 25, 2000. Best result Bad results Today is [Tuesday, July 25, 2000]. Need left-to-right, longest-match constraints. 600.465 - Intro to NLP - J. Eisner 63 63

Source code: Language of Dates slide courtesy of L. Karttunen Day = Monday | Tuesday | ... | Sunday Month = January | February | ... | December Date = 1 | 2 | 3 | ... | 3 1 Year = %0To9 (%0To9 (%0To9 (%0To9))) - %0?* from 1 to 9999 AllDates = Day | (Day “, ”) Month “ ” Date (“, ” Year)) 600.465 - Intro to NLP - J. Eisner 64 64

Object code: All Dates from 1/1/1 to 12/31/9999 slide courtesy of L. Karttunen actually represents 7 arcs, each labeled by a string , , Jan 2 1 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Tue Mon Wed Fri Sat Sun Thu Feb 1 2 3 4 5 6 7 8 9 Mar Apr 4 5 6 7 8 9 May Jun 3 Jul Aug Sep Oct Nov 1 Dec , 13 states, 96 arcs 29 760 007 date expressions , Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 600.465 - Intro to NLP - J. Eisner 65 65

Parser for Dates AllDates @-> “[DT ” ... “]” slide courtesy of L. Karttunen (modified) Compiles into an unambiguous transducer (23 states, 332 arcs). AllDates @-> “[DT ” ... “]” Xerox left-to-right replacement operator Today is [DT Tuesday, July 25, 2000] because yesterday was [DT Monday] and it was [DT July 24] so tomorrow must be [DT Wednesday, July 26] and not [DT July 27] as it says on the program. 600.465 - Intro to NLP - J. Eisner 66 66

Problem of Reference Tuesday, July 25, 2000 Tuesday, February 29, 2000 slide courtesy of L. Karttunen Valid dates Tuesday, July 25, 2000 Tuesday, February 29, 2000 Monday, September 16, 1996 Invalid dates Wednesday, April 31, 1996 Thursday, February 29, 1900 Tuesday, July 26, 2000 600.465 - Intro to NLP - J. Eisner 67 67

Refinement by Intersection slide courtesy of L. Karttunen (modified) AllDates LeapYears Feb 29, => _ … MaxDays In Month “ 31” => Jan|Mar|May|… _ “ 30” => Jan|Mar|Apr|… _ WeekdayDate Valid Dates Xerox contextual restriction operator Q: Why does this rule end with a comma? Q: Can we write the whole rule? Q: Why do these rules start with spaces? (And is it enough?) 600.465 - Intro to NLP - J. Eisner 68 68

ValidDates: 805 states, 6472 arcs slide courtesy of L. Karttunen Defining Valid Dates AllDates: 13 states, 96 arcs 29 760 007 date expressions AllDates & MaxDaysInMonth LeapYears WeekdayDates = ValidDates ValidDates: 805 states, 6472 arcs 7 307 053 date expressions 600.465 - Intro to NLP - J. Eisner 69 69

Parser for Valid and Invalid Dates slide courtesy of L. Karttunen Parser for Valid and Invalid Dates [AllDates - ValidDates] @-> “[ID ” ... “]” , ValidDates @-> “[VD ” ... “]” 2688 states, 20439 arcs Comma creates a single FST that does left-to-right longest match against either pattern Today is [VD Tuesday, July 25, 2000], not [ID Tuesday, July 26, 2000]. valid date invalid date 600.465 - Intro to NLP - J. Eisner 70 70