Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.

Slides:



Advertisements
Similar presentations
Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
YES-NO machines Finite State Automata as language recognizers.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
Brief introduction to morphology
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
FSA and HMM LING 572 Fei Xia 1/5/06.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Finite Automata Finite-state machine with no output. FA consists of States, Transitions between states FA is a 5-tuple Example! A string x is recognized.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
1 Languages and Finite Automata or how to talk to machines...
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
Finite state automaton (FSA)
Morphological analysis
1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CSA3050: Natural Language Algorithms Finite State Devices.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
MORPHOLOGY definition; variability among languages.
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Intro to NLP - J. Eisner1 Finite-State and the Noisy Channel.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Languages.
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
4. Properties of Regular Languages
Closure Properties of Regular Languages
Morphological Parsing
Presentation transcript:

Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011

Announcements Wednesday online GP meeting scheduling Seminar on Friday: Luke Zettlemoyer (CSE) Automatic grammar induction Treehouse Friday: Classifiers – Memory Lane

Roadmap Motivation: FST applications FST perspectives FSTs and Regular Relations FST Operations

FSTs Finite automaton that maps between two strings Automaton with two labels/arc input:output

FST Applications Tokenization Segmentation Morphological analysis Transliteration Parsing Translation Speech recognition Spoken language understanding….

Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects

Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages

Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages FST as translator: Reads an input string and prints output string

Approaches to FSTs FST as recognizer: Takes pair of input:output strings Accepts if in language, o.w. rejects FST as generator: Outputs pairs of strings in languages FST as translator: Reads an input string and prints output string FST as set relator: Computes relations between sets

FSTs & Regular Relations FSAs: equivalent to regular languages

FSTs & Regular Relations FSAs: equivalent to regular languages FSTs: equivalent to regular relations Sets of pairs of strings

FSTs & Regular Relations FSAs: equivalent to regular languages FSTs: equivalent to regular relations Sets of pairs of strings Regular relations: For all (x,y) in Σ 1 x Σ 2, {(x,y)} is a regular relation The empty set is a regular relation If R 1,R 2 are regular relations, R 1  R 2, R 1 U R 2 and R 1 * are regular relations

Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages

Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:

Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b m,c m )}, intersection is {(a n b n,c n )} => not regular

Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b n,c n )}, intersection is {(a n b n,c n )} => not regular Difference

Regular Relation Closures By definition, Regular Relations are closed under: Concatenation: R 1  R 2 Union: R 1 U R 2 Kleene *: R 1 * Like regular languages Unlike regular languages, they are NOT closed under: Intersection:R1 ={(a n b *,c n )} & R2={(a*b n,c n )}, intersection is {(a n b n,c n )} => not regular Difference Complementation

Regular Relation Closures Regular relations are also closed under: Composition:

Regular Relation Closures Regular relations are also closed under: Composition: Inversion:

Regular Relation Closures Regular relations are also closed under: Composition: Inversion: Operations: Projection:

Regular Relation Closures Regular relations are also closed under: Composition: Inversion: Operations: Projection: Identity & cross-product of regular languages

FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ

FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ

FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F

FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transition relations between states: δsubset Q x (Σuε) x (ΓU ε) x Q

FST Formal Definition A Finite-State Transducer is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transition relations between states: δsubset Q x (Σuε) x (ΓU ε) x Q FSAs are a special case of FSTs

FST Operations Union:

FST Operations Union: Concatenation:

FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to !

FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to I Composition: If T 1 is a transducer from I 1 to O 2 and T 2 is a transducer from O 2 to O 3, then T 1 T 2 is a transducer from I 1 to O 3

FST Operations Inversion: Switching input and output labels If T maps from I to O, T -1 maps from O to I Composition: If T 1 is a transducer from I 1 to O 2 and T 2 is a transducer from O 2 to O 3, then T 1 T 2 is a transducer from I 1 to O 3

FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}

FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}

FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….}

FST Examples R(T) = {(ε,ε),(a,b),(aa,bb),(aaa,bbb)….} R(T) = {(a,x),(ab,xy),(abb,xyy),…}

FST Application Examples Case folding: He said  he said

FST Application Examples Case folding: He said  he said Tokenization: “He ran.”  “ He ran. “

FST Application Examples Case folding: He said  he said Tokenization: “He ran.”  “ He ran. “ POS tagging: They can fish  PRO VERB NOUN

FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R Morphological generation: Fox s  Foxes Morphological analysis: cats  cat s

FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R

FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R Morphological generation: Fox s  Foxes

FST Application Examples Pronunciation: B AH T EH R  B AH DX EH R Morphological generation: Fox s  Foxes Morphological analysis: cats  cat s

FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y)  yes/no

FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y)  yes/no Composition: Given a pair of transducers T1 and T2, create a new transducer T1 T2.

FST Algorithms Recognition: Is a given string pair (x,y) accepted by the FST? (x,y)  yes/no Composition: Given a pair of transducers T1 and T2, create a new transducer T1 T2. Transduction: Given an input string and an FST, compute the output string. x  y

WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q

WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q  R +

WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q  R + Transition probabilities: δ  R +

WFST Definition A Probabilistic Finite-State Automaton is a 7-tuple: A finite set of states: Q A finite set of input symbols: Σ A finite set of output symbols: Γ A finite set of initial states: I A finite set of final states: F A set of transitions: δsubset Q x (Σuε) x (ΓU ε) x Q Initial state probabilities: Q  R + Transition probabilities: δ  R + Final state probabilities: Q  R +

Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications

Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs

Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs Not closed under intersection, complementation, difference

Summary FSTs Equivalent to regular relations Transduce strings to strings Useful for range of applications Closed under union, concatenation, Kleene*, inversion, composition Project to FSAs Not closed under intersection, complementation, difference Algorithms: recognition, composition, transduction

Morphology and FSTs

Roadmap Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology FSTs & Phonology

Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports

Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,…

Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,… How can we match?

Surface Variation & Morphology Searching (a la Google) for documents about: Televised sports Many possible surface forms: Televised, television, televise,.. Sports, sport, sporting,… How can we match? Convert surface forms to common base form Stemming or morphological analysis

The Lexicon Goal: Represent all the words in a language Approach?

The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words?

The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished

The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished Other languages?

The Lexicon Goal: Represent all the words in a language Approach? Enumerate all words? Doable for English Typical for ASR (Automatic Speech Recognition) English is morphologically relatively impoverished Other languages? Wildly impractical Turkish: 40,000 forms/verb; uygarlas¸tıramadıklarımızdanmıs¸sınızcasına “(behaving) as if you are among those whom we could not civilize”

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language.

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking Infix: e.g., hingi  humingi (Tagalog)

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking Infix: e.g., hingi  humingi (Tagalog) Circumfix: e.g., sagen  gesagt (German)

Two Perspectives Stemming: writing 

Two Perspectives Stemming: writing  write (or writ) Beijing

Two Perspectives Stemming: writing  write (or writ) Beijing  Beije Morphological Analysis:

Two Perspectives Stemming: writing  write (or writ) Beijing  Beije Morphological Analysis: writing  write+V+prog

Two Perspectives Stemming: writing  write (or writ) Beijing  Beije Morphological Analysis: writing  write+V+prog cats  cat + N + pl writes  write+V+3rdpers+Sg

Ambiguity in Morphology Alternative analyses: Flies

Ambiguity in Morphology Alternative analyses: Flies  fly+N+Pl Flies  fly+V+3rdpers+Sg Saw 

Ambiguity in Morphology Alternative analyses: Flies  fly+N+Pl Flies  fly+V+3rdpers+Sg Saw  see+V+past Saw 

Ambiguity in Morphology Alternative analyses: Flies  fly+N+Pl Flies  fly+V+3rdpers+Sg Saw  see+V+past Saw  saw+N

Multi-linguality in Morphology Morphologically impoverished languages E.g. English

Multi-linguality in Morphology Morphologically impoverished languages E.g. English Isolating languages E.g., Chinese

Multi-linguality in Morphology Morphologically impoverished languages E.g. English Isolating languages E.g., Chinese Morphologically rich languages: E.g. Turkish

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morphone  new class E.g. Walk + er  walker (N)

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morphone  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, …

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morphone  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, … Clitics: stem+clitic I + ll  I’ll; he + is  he’s