Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011
Roadmap Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology Stemming Morphological analysis FSTs & Phonology
Words Goal: Compact representation of all surface forms in a language
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er Flier
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er Flier Morphological variation: saw + s saws; fish + s fish; goose + s geese
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er Flier Morphological variation: saw + s saws; fish + s fish; goose + s geese Phonological variation: dog + s dog + /z/; fox + s fox + /IH Z/
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible impossible Suffix: e.g., walk walking Infix: e.g., hingi humingi (Tagalog) Circumfix: e.g., sagen gesagt (German)
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morpheme new class E.g. Walk + er walker (N)
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morpheme new class E.g. Walk + er walker (N) Compounding: multiple stems new word E.g. doghouse, catwalk, …
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morpheme new class E.g. Walk + er walker (N) Compounding: multiple stems new word E.g. doghouse, catwalk, … Clitics: stem+clitic I + ll I’ll; he + is he’s
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English???
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: sg, irreg pl: +’s; reg pl, after s,z: ‘ RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen
Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected
Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped
Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive Irregular verbs: Only about 250, but very frequent FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped eateatseatingateeaten catchcatchescatchingcaught cutcutscuttingcut
Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix Noun
Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix Noun Adjectives: Verb or Noun + affix Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness
Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix Noun Adjectives: Verb or Noun + affix Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness SuffixBaseDerived Adjective -alcomputationcomputational -ableembraceembraceable -lessclueclueless
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic Can prefix (proclitic) article, prep, conj, No markers Removal of such clitics often referred to as light stemming
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why?
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org)
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes Model: Rule cascade No lexicon!
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros:
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons:
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons: Overaggressive and underaggressive Limited in application
FST Morphological Analysis Focus on English morphology FSA acceptor: cats yes; foxes yes; childs no
FST Morphological Analysis Focus on English morphology FSA acceptor: cats yes; foxes yes; childs no FST morphological analyzer: fox + N + pl fox^s#
FST Morphological Analysis Focus on English morphology FSA acceptor: cats yes; foxes yes; childs no FST morphological analyzer: fox + N + pl fox^s# FST for orthographic rules: fox^s# foxes#
Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl
Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N
Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N Orthographic rules: Spelling rules Changes when morphemes combine E.g. y ie in try + s
Example Goal: foxes fox + N + Pl
Example Goal: foxes fox + N + Pl Surface: foxes
Example Goal: foxes fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s
Example Goal: foxes fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s Lexicon + morphotactics Lexical: fox + N + Pl
Multiple Levels Generation and Analysis Generation: fox + N + Pl fox^s#; fox^s# foxes# Analysis: foxes# fox^s#; fox^s# fox + N + Pl
The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages
The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base
The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base What about: flies: v, +sg +3 rd ? Common model of morphotactics: FSA
Basic Noun Lexicon (J&M, CH3) reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse
Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse
Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse
FSA Lexicon with Words What’s up with the ‘s’ arc?
FSA Lexicon with Words What’s up with the ‘s’ arc? Orthographic rules will fix ‘es’
Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang
Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang
FSA for Derivational Morphology Complex….
FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem
FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part
FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part morphotactic FSAs: Accept combinations of stems & affixes in language Reject o.w.
Recognition vs Analysis/Generation Can validate a morphological sequence
Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form
Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another
Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another FSTs
Multilevel Tape Machines FST1… Orthographic Rules …..FSTn Lexicon FST
Noun Morphology FSA Remember:
Schematic FST cat + N + Pl cat^s#Map morph features to empty string if there is no corresponding output
Updating the Lexicon Need words, not just classes, as FST fox fox
Updating the Lexicon Need words, not just classes, as FST fox fox Need:
Updating the Lexicon Need words, not just classes, as FST fox fox Need: geese goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o o s e catsheep aardvarkmouse
Updating the Lexicon Need words, not just classes, as FST fox fox Need: geese goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o:e o:e s eg o o s e catsheep aardvarkm o:i u:εs:c emouse
Integrating the Lexicon Replace classes with stems
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,..
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed, etc
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed, etc Approach: Transducers for orthographic rules
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1:
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes, but also cates, doges, etc…
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε e /(s|z|x|)_s Issue
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε e /(s|z|x|)_s Issue? glass glases Approach 3: ε e /(s|z|x|)^_s#
Rewrite Rules Format: a b/c_d Rewrite rules can be optional or obligatory Rewrite rules can be ordered to reduce ambiguity. Under some conditions, rewrite rules equivalent to FSTs. a not allowed to match s.t. introduced in prior rule application
E-insertion Rule Transducer ε e /(s|z|x|)^_s# Input: ….(s|z|x)^s# Intermediate level Output: …(s|z|x)es# surface level
Using the E-insertion FST (fox,fox):
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#):
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#):
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs):
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject (fox^z#,foxz#) ?
What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#)
What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#) Goal: write rules capture only those constraints Let all other input pass through
Combining FST Lexicon & Rules Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate Rule transducers from Intermediate to Surface
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes#
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes#
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3Sg How can we disambiguate?
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’?
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’? Need same sort of search as NFAs
FST Morphological Analysis Summary: Main components Lexicon Morphotactics Orthographic rules Morphotactics as FSTs, expanded with FST Lexicon Orthographic rules as FSTs Combine FSTs, e.g. in cascade
Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated
Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction
Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction Potentially useful for many applications IR, MT
Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30
Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30 Treat as coding/compression problem Find most compact representation of lexicon Popular model MDL (Minimum Description Length) Smallest total encoding: Weighted combination of lexicon size & ‘rules’
Approach Generate initial model: Base set of words, compute MDL length
Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size
Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words
Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words 2 words (talk, walk) + 1 affix (-ed) + combination info 2 words (t,w) + 2 affixes (alk,-ed) + combination info
Homework #3