Download presentation
Presentation is loading. Please wait.
Published byWillis Mills Modified over 8 years ago
1
Finite-State Methods in Natural Language Processing Lauri Karttunen LSA 2005 Summer Institute July 20, 2005
2
Course Outline July 18: Intro to computational morphology XFST Readings Lauri Karttunen, “Finite-State Constraints”, The Last Phonological Rule. J. Goldsmith (ed.), pages 173-194, University of Chicago Press, 1993. Karttunen and Beesley, “25 Years of Finite-State Morphology” Chapter 1: “Gentle Introduction” (B&K) July 20: Regular expressions More on XFST Readings Chapter 2: “Systematic Introduction” Chapter 3: “The XFST interface”
3
July 25 Concatenative morphotactics Constraining non-local dependencies Readings Chapter 4. “The LEXC Language” Chapter 5. “Flag Diacritics” July 27 Non-concatenative morphotactics Reduplication, interdigitation Readings Chapter 8. “Non-Concatenative Morphotactics”
4
August 1 Realizational morphology Readings Gregory T. Stump. Inflectional Morphology. A Theory of Paradigm Structure. Cambridge U. Press. 2001. (An excerpt) Lauri Karttunen, “ Computing with Realizational Morphology ”, Lecture Notes in Computer Science, Volume 2588, Alexander Gelbukh (ed.), 205-216, Springer Verlag. 2003. August 3 Optimality theory Readings Paul Kiparsky “ Finnish Noun Inflection ” Generative Approaches to Finnic and Saami Linguistics, Diane Nelson and Satu Manninen (eds.), pp.109-161, CSLI Publications, 2003. Nine Elenbaas and René Kager. "Ternary rhythm and the lapse constraint". Phonology 16. 273-329.
5
Scripting xfst xfst -l myscript xfst -f myscript xfst -e “echo Welcome” \ -e “regex a b c;” \ -e “save foo” \ -stop Start XFST execute myscript wait for more commands from the command line Execute myscript and exit Execute the commands in the given order. The commands must be on the same line. The -stop at the end is required to make xfst quit.
6
Numeral Script # This script constructs the language of English # numerals from "one” to "ninety-nine". # This is a comment. # From "one" through "nine": define OneToNine [{one} | {two} | {three} | {four} | {five} | {six} | {seven} | {eight} | {nine}]; # It is convenient to define a set of prefixes that # can be followed either by "teen" or by "ty". define TeenTyStem [{thir} | {fif} | {six} | {seven} | {eigh} | {nine}] ;
7
Numeral Script (Continued) # From "ten" to "nineteen" define Teens [{ten} | {eleven} | {twelve} | [TeenTyStem | {four}] {teen}]; # Let’s define stems that can be followed "ty". define TyStem [TeenTyStem | {twen} | {for}]; # TyStem is followed either by "ty" or by ty-" # and a number from OneToNine. define Tens [TyStem [{ty} | {ty-} OneToNine]]; define OneToNinetyNine [ OneToNine | Teens | Tens ]; push OneToNinetyNine
8
Number to Numeral Generation 105 hundred fivehundred and five one hundred and five Analysis hundred five 105
9
NumberToNumeral script # This script constructs a transducer that relates the # English numerals "one", "two",..., "ninety-nine", # to the corresponding numbers "1", 2... "99". define OneToNine [1:{one} | 2:{two} | 3:{three} | 4:{four} |5:{five} | 6:{six} | 7:{seven} | 8:{eight} | 9:{nine}]; define TeenTyStem [3:{thir} | 5:{fif} | 6:{six}| 7:{seven} | 8:{eigh} | 9:{nine}]; define Teens [1:0 [{0}:{ten} | 1:{eleven} | 2:{twelve} | [TeenTyStem | 4:{four}] 0:{teen}]];
10
NumberToNumeral (Continued) define TyStem [2:{twen} | TeenTyStem | 4:{for}]; # TyStem is followed either by "ty" paired with a zero # or by "ty-" mapped to an epsilon and followed by a # number. Note that {0} means zero and not epsilon. define Tens [TyStem [{0}:{ty} | 0:{ty-} OneToNine]]; define OneToNinetyNine [ OneToNine | Teens | Tens ]; push OneToNinetyNine
11
Xerox RE Operators $ containment => restriction -> @-> replacement Make it easier to describe complex languages and relations without extending the formal power of finite-state systems.
12
Containmenta? ? a $a [?* a ?*]
13
Restriction ? c b b c ? a c a => b _ c “Any a must be preceded by b and followed by c.” ~[~[?* b] a ?*] & ~[?* a ~[c ?*]] Equivalent expression
14
Replacement a:b b a ? ? b:a a a:b a b -> b a “Replace ‘ab’ by ‘ba’.” [[~$[a b] [[a b].x. [b a]]]* ~$[a b]] Equivalent expression
15
Marking 0:[ [ 0:] ? a e i o u ] a|e|i|o|u -> %[... %] p o t a t o p[o]t[a]t[o]
16
a b | b | b a | a b a -> x (a) b (a) -> x applied to “aba” a b a a b a a x a a x x a x Multiple Results Four factorizations of the input string.
17
Directed Replace Operators guarantee a unique result by constraining the factorization of the input string by Direction of the match (rightward or leftward) Length (longest or shortest)
18
@-> Left-to-right, Longest-match Replacement (a) b (a) @-> x applied to “aba” a b a a b a a x a a x x a x
19
Conditional Replacement The relation that replaces A by B between L and R leaving everything else unchanged. A -> B Replacement L _ R Context Sources of complexity: l Replacements and contexts may overlap l Alternative ways of interpreting “between left and right.” A -> B || L _ Rboth contexts on the input A -> B // L _ Rleft context on the output A -> B \\ L _ Rright context on the output
20
Vowel shortening after a long vowel V %: -> V || V %: C* _ Left context on the input side Slovak vol+a:v+a: me: vol+a:v+a me we call often Gidabal gunu:m+ ba:+da:ng+be: + gunu:m+ ba +da:ng+be + is certainly right on the stump V%: -> V // V%: C* _ Left context on the output side
21
Shortening script define V [ a | e | i | o | u | a ]; define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ]; define SlovakShortening %: -> 0 || V %: C* V _ ; define GidabalShortening %: -> 0 // V %: C* V _ ; push SlovakShortening down vola:va:me: vola:vame push GidabalShortening down gunu:mba:da:ngbe: gunu:mbada:ngbe
22
Palatalization and Vowel Raising Palatalization tim --> cim Vowel Raising memi --> mimi Interaction temi --> cimi tememi --> cimimi
23
Vowel Raising & Palatalization define C [ b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | x | y | z ]; define Raising e -> i \\ _ C* i ; define Palatalization t -> c || _ i; regex Raising.o. Palatalization; down memi mimi down tim cim down temi cimi down tememi cimimi t e m e m i t i m i m i c i m i m i
24
Making a lexical transducer Lexicon FST Rule FSTs Compiler Lexical Transducer (a single FST) composition Lexicon Regular Expression Rules Regular Expressions Morphotactics Alternations
25
Finnish Gradation Script define Stems [ {tukka}| {kakku} | {pappi} | {tippa} | {katto} | {juttu} |{tikka} | {huppu} | {rotta} | {nahka} |{lika} | {maku} | {rako} | {tuke} | {halko} | {jalka} | {virka} | {lanka} | {linko} | {puku} | {suku} | {tiuku} | {raaka} |{ripa} | {sopu} | {tapa} | {kampa} | {rumpu} | {sampe} | {sota} | {pata} | {kita} | {rinta} | {kanto} | {ranta} | {ilta} | {kulta} | {parta} | {kerta} ]; define Case [ "+Part":a | "+Gen":n ]; define Finnish [Stems Case];
26
Auxiliary definitions define V [a | e | i | o | u | y | ä | ö]; define C [b | c | d | f | g | h | j | k | l | m | n | p | q | r | s | t | v | w | x | z]; define Coda [ C [C |.#.] ]; define ClosedSyll [V Coda] ;
27
Weak form of k define WeakK k -> ' || V a _ a Coda, V u _ u Coda.o. k -> j || r _ e Coda.o. k -> v || u _ u Coda.o. k -> g || n _ V Coda.o. k -> 0 || \[s|h] _ V Coda ; # kiskon 'rail', # nahkan 'skin
28
Weak form of p define WeakP p -> m || m _ V Coda.o. p -> v || \[s|p] _ V Coda # piispan 'bishop'.o. p -> 0 || p _ V Coda;
29
Weak form of t define WeakT t -> n || n _ V Coda.o. t -> l || l _ V Coda.o. t -> r || r _ V Coda.o. t -> d || \[s|t] _ V Coda # koston revenge.o. t -> 0 || t _ V Coda ;
30
Putting it all together define Gradation WeakK.o. WeakP.o. WeakT; regex Finnish.o. Gradation; print lower-words echo *** Size of Finnish.o. Gradation print size echo *** Size of Finnish push Finnish print size echo *** Size of Gradation push Gradation print size
31
Syllabification define C [ b | c | d | f... define V [ a | e | i | o | u ]; s t r u k t u r a l i s m i s t r u k - t u - r a - l i s - m i [C* V+ C*] @->... "-" || _ [C V] “Insert a hyphen after the longest instance of the C* V+ C* pattern in front of a C V pattern.” C* V+ C* pattern in front of a C V pattern.”
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.