Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.

Slides:



Advertisements
Similar presentations
Morphology Reading: Chap 3, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide.
Advertisements

Finite-state automata and Morphology
Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
Brief introduction to morphology
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Morphology 1 Introduction Morphology Morphological Analysis (MA)
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Ch4 – Features Consider the following data from Mokilese
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
CSA3050: Natural Language Algorithms Finite State Devices.
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Speech and Language Processing
Chapter 6 Morphology.
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
Presentation transcript:

Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011

Roadmap Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology Stemming Morphological analysis FSTs & Phonology

Words Goal: Compact representation of all surface forms in a language

Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages

Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er  Flier

Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er  Flier Morphological variation: saw + s  saws; fish + s  fish; goose + s  geese

Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er  Flier Morphological variation: saw + s  saws; fish + s  fish; goose + s  geese Phonological variation: dog + s  dog + /z/; fox + s  fox + /IH Z/

Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking Infix: e.g., hingi  humingi (Tagalog) Circumfix: e.g., sagen  gesagt (German)

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morpheme  new class E.g. Walk + er  walker (N)

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morpheme  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, …

Combining Morphemes Inflection: Stem + gram. morpheme  same class E.g.: help + ed  helped Derivation: Stem + gram. morpheme  new class E.g. Walk + er  walker (N) Compounding: multiple stems  new word E.g. doghouse, catwalk, … Clitics: stem+clitic I + ll  I’ll; he + is  he’s

Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives

Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English???

Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen

Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: sg, irreg pl: +’s; reg pl, after s,z: ‘ RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen

Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected

Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped

Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive Irregular verbs: Only about 250, but very frequent FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped eateatseatingateeaten catchcatchescatchingcaught cutcutscuttingcut

Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix  Noun

Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix  Noun Adjectives: Verb or Noun + affix  Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness

Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix  Noun Adjectives: Verb or Noun + affix  Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness SuffixBaseDerived Adjective -alcomputationcomputational -ableembraceembraceable -lessclueclueless

Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs

Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s

Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic

Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic Can prefix (proclitic) article, prep, conj, No markers Removal of such clitics often referred to as light stemming

Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise

Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why?

Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org)

Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes

Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising  televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes Model: Rule cascade No lexicon!

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros:

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons:

Porter Stemmer Rule cascade: Rule form: (condition) PATT1  PATT2 E.g. stem contains vowel, ING -> ε ATIONAL  ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons: Overaggressive and underaggressive Limited in application

FST Morphological Analysis Focus on English morphology FSA acceptor: cats  yes; foxes  yes; childs  no

FST Morphological Analysis Focus on English morphology FSA acceptor: cats  yes; foxes  yes; childs  no FST morphological analyzer: fox + N + pl  fox^s#

FST Morphological Analysis Focus on English morphology FSA acceptor: cats  yes; foxes  yes; childs  no FST morphological analyzer: fox + N + pl  fox^s# FST for orthographic rules: fox^s#  foxes#

Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl

Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N

Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N Orthographic rules: Spelling rules Changes when morphemes combine E.g. y  ie in try + s

Example Goal: foxes  fox + N + Pl

Example Goal: foxes  fox + N + Pl Surface: foxes

Example Goal: foxes  fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s

Example Goal: foxes  fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s Lexicon + morphotactics Lexical: fox + N + Pl

Multiple Levels Generation and Analysis Generation: fox + N + Pl  fox^s#; fox^s#  foxes# Analysis: foxes#  fox^s#; fox^s#  fox + N + Pl

The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages

The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base

The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base What about: flies: v, +sg +3 rd ? Common model of morphotactics: FSA

Basic Noun Lexicon (J&M, CH3) reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse

Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse

Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse

FSA Lexicon with Words What’s up with the ‘s’ arc?

FSA Lexicon with Words What’s up with the ‘s’ arc? Orthographic rules will fix ‘es’

Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang

Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang

FSA for Derivational Morphology Complex….

FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem

FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part

FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part morphotactic FSAs: Accept combinations of stems & affixes in language Reject o.w.

Recognition vs Analysis/Generation Can validate a morphological sequence

Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form

Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another

Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another FSTs

Multilevel Tape Machines FST1… Orthographic Rules …..FSTn Lexicon FST

Noun Morphology FSA Remember:

Schematic FST cat + N + Pl  cat^s#Map morph features to empty string if there is no corresponding output

Updating the Lexicon Need words, not just classes, as FST fox  fox

Updating the Lexicon Need words, not just classes, as FST fox  fox Need:

Updating the Lexicon Need words, not just classes, as FST fox  fox Need: geese  goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o o s e catsheep aardvarkmouse

Updating the Lexicon Need words, not just classes, as FST fox  fox Need: geese  goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o:e o:e s eg o o s e catsheep aardvarkm o:i u:εs:c emouse

Integrating the Lexicon Replace classes with stems

Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,..

Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries

Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y  ie before –s, i before -ed, etc

Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y  ie before –s, i before -ed, etc Approach: Transducers for orthographic rules

Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1:

Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes

Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes, but also cates, doges, etc…

Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε  e /(s|z|x|)_s Issue

Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s#  foxes Approach 1: ε  e  foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε  e /(s|z|x|)_s Issue? glass  glases Approach 3: ε  e /(s|z|x|)^_s#

Rewrite Rules Format: a  b/c_d Rewrite rules can be optional or obligatory Rewrite rules can be ordered to reduce ambiguity. Under some conditions, rewrite rules equivalent to FSTs. a not allowed to match s.t. introduced in prior rule application

E-insertion Rule Transducer ε  e /(s|z|x|)^_s# Input: ….(s|z|x)^s# Intermediate level Output: …(s|z|x)es# surface level

Using the E-insertion FST (fox,fox):

Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#):

Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#):

Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs):

Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject

Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject (fox^z#,foxz#) ?

What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#)

What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#) Goal: write rules capture only those constraints Let all other input pass through

Combining FST Lexicon & Rules Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate Rule transducers from Intermediate to Surface

Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes#

Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes# 

Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL

Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL or  fox + V + 3Sg How can we disambiguate?

Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL or  fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’?

Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL  foxes# Parsing: Given surface form, generate analysis foxes#  fox + N + PL or  fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’? Need same sort of search as NFAs

FST Morphological Analysis Summary: Main components Lexicon Morphotactics Orthographic rules Morphotactics as FSTs, expanded with FST Lexicon Orthographic rules as FSTs Combine FSTs, e.g. in cascade

Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated

Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction

Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction Potentially useful for many applications IR, MT

Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30

Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30 Treat as coding/compression problem Find most compact representation of lexicon Popular model MDL (Minimum Description Length) Smallest total encoding: Weighted combination of lexicon size & ‘rules’

Approach Generate initial model: Base set of words, compute MDL length

Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size

Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words

Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words 2 words (talk, walk) + 1 affix (-ed) + combination info 2 words (t,w) + 2 affixes (alk,-ed) + combination info

Homework #3