Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011
Announcements Wednesday online GP meeting scheduling Seminar on Friday: Luke Zettlemoyer (CSE) Automatic grammar induction Treehouse Friday: Classifiers – Memory Lane
Roadmap Motivation: FST applications FST perspectives FSTs and Regular Relations FST Operations
FSTs Finite automaton that maps between two strings Automaton with two labels/arc input:output
FST Applications Tokenization Segmentation Morphological analysis Transliteration Parsing Translation Speech recognition Spoken language understanding….
Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects
Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages
Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages FST as translator: Reads an input string and prints output string
Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages FST as translator: Reads an input string and prints output string FST as set relator: Computes relations between sets
FSTs & Regular Relations FSAs: equivalent to regular languages
FSTs & Regular Relations FSAs: equivalent to regular languages FSTs: equivalent to regular relations Sets of pairs of strings
FSTs & Regular Relations FSAs: equivalent to regular languages FSTs: equivalent to regular relations Sets of pairs of strings Regular relations: For all (x,y) in Σ 1 x Σ 2, {(x,y)} is a regular relation The empty set is a regular relation If R 1,R 2 are regular relations, R 1 R 2, R 1 U R 2 and R 1 * are regular relations
Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1 R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages
Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1 R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:
Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1 R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b m,c m )}, intersection is {(a n b n,c n )} => not regular
Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1 R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b n,c n )}, intersection is {(a n b n,c n )} => not regular Difference
Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1 R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b n,c n )}, intersection is {(a n b n,c n )} => not regular Difference Complementation
Regular Relation Closures Regular relations are also closed under: Composition:
Regular Relation Closures Regular relations are also closed under: Composition: Inversion:
Regular Relation Closures Regular relations are also closed under: Composition: Inversion: Operations: Projection:
Regular Relation Closures Regular relations are also closed under: Composition: Inversion: Operations: Projection: Identity & cross-product of regular languages
FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ
FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ
FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F
FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transition relations between states: δsubset Q x (Σuε) x (ΓU ε) x Q
FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transition relations between states: δsubset Q x (Σuε) x (ΓU ε) x Q FSAs are a special case of FSTs
FST Operations Union:
FST Operations Union: Concatenation:
FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to !
FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to I Composition: If T 1 is a transducer from I 1 to O 2 and T 2 is a transducer from O 2 to O 3, then T 1 T 2 is a transducer from I 1 to O 3
FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to I Composition: If T 1 is a transducer from I 1 to O 2 and T 2 is a transducer from O 2 to O 3, then T 1 T 2 is a transducer from I 1 to O 3
FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}
FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}
FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}
FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….} R(T) = {(a,x),(ab,xy),(abb,xyy),…}
FST Application Examples Case folding: He said he said
FST Application Examples Case folding: He said he said Tokenization: “He ran.” “ He ran. “
FST Application Examples Case folding: He said he said Tokenization: “He ran.” “ He ran. “ POS tagging: They can fish PRO VERB NOUN
FST Application Examples Pronunciation: B AH T EH R B AH DX EH R Morphological generation: Fox s Foxes Morphological analysis: cats cat s
FST Application Examples Pronunciation: B AH T EH R B AH DX EH R
FST Application Examples Pronunciation: B AH T EH R B AH DX EH R Morphological generation: Fox s Foxes
FST Application Examples Pronunciation: B AH T EH R B AH DX EH R Morphological generation: Fox s Foxes Morphological analysis: cats cat s
FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y) yes/no
FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y) yes/no Composition: Given a pair of transducers T1 and T2, create a new transducer T1 T2.
FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y) yes/no Composition: Given a pair of transducers T1 and T2, create a new transducer T1 T2. Transduction: Given an input string and an FST, compute the output string. x y
WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q
WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q R +
WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q R + Transition probabilities: δ R +
WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q R + Transition probabilities: δ R + Final state probabilities: Q R +
Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications
Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs
Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs Not closed under intersection, complementation, difference
Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs Not closed under intersection, complementation, difference Algorithms: recognition, composition, transduction
Morphology and FSTs
Roadmap Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology FSTs & Phonology
Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports
Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,…
Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,… How can we match?
Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,… How can we match? Convert surface forms to common base form Stemming or morphological analysis
The Lexicon Goal: Represent all the words in a language Approach?
The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words?
The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished
The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished Other languages?
The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished Other languages? Wildly impractical Turkish: 40,000 forms/verb; uygarlas¸tıramadıklarımızdanmıs¸sınızcasına “(behaving) as if you are among those whom we could not civilize”
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language.
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible impossible
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible impossible Suffix: e.g., walk walking
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible impossible Suffix: e.g., walk walking Infix: e.g., hingi humingi (Tagalog)
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible impossible Suffix: e.g., walk walking Infix: e.g., hingi humingi (Tagalog) Circumfix: e.g., sagen gesagt (German)
Two Perspectives Stemming: writing
Two Perspectives Stemming: writing write (or writ) Beijing
Two Perspectives Stemming: writing write (or writ) Beijing Beije Morphological Analysis:
Two Perspectives Stemming: writing write (or writ) Beijing Beije Morphological Analysis: writing write+V+prog
Two Perspectives Stemming: writing write (or writ) Beijing Beije Morphological Analysis: writing write+V+prog cats cat + N + pl writes write+V+3rdpers+Sg
Ambiguity in Morphology Alternative analyses: Flies
Ambiguity in Morphology Alternative analyses: Flies fly+N+Pl Flies fly+V+3rdpers+Sg Saw
Ambiguity in Morphology Alternative analyses: Flies fly+N+Pl Flies fly+V+3rdpers+Sg Saw see+V+past Saw
Ambiguity in Morphology Alternative analyses: Flies fly+N+Pl Flies fly+V+3rdpers+Sg Saw see+V+past Saw saw+N
Multi-linguality in Morphology Morphologically impoverished languages E.g. English
Multi-linguality in Morphology Morphologically impoverished languages E.g. English Isolating languages E.g., Chinese
Multi-linguality in Morphology Morphologically impoverished languages E.g. English Isolating languages E.g., Chinese Morphologically rich languages: E.g. Turkish
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morphone new class E.g. Walk + er walker (N)
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morphone new class E.g. Walk + er walker (N) Compounding: multiple stems new word E.g. doghouse, catwalk, …
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morphone new class E.g. Walk + er walker (N) Compounding: multiple stems new word E.g. doghouse, catwalk, … Clitics: stem+clitic I + ll I’ll; he + is he’s