Download presentation
Presentation is loading. Please wait.
Published byProsper Hodges Modified over 9 years ago
1
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011
2
Roadmap Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology Stemming Morphological analysis FSTs & Phonology
3
Words Goal: Compact representation of all surface forms in a language
4
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages
5
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er Flier
6
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er Flier Morphological variation: saw + s saws; fish + s fish; goose + s geese
7
Lexicon Goal: Compact representation of all surface forms in a language Enumeration: Impractical for morphologically rich languages Descriptively unsatisfying for most languages Orthographic variation: Fly+er Flier Morphological variation: saw + s saws; fish + s fish; goose + s geese Phonological variation: dog + s dog + /z/; fox + s fox + /IH Z/
8
Morphological Parsing Goal: Take a surface word form and generate a linguistic structure of component morphemes A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible impossible Suffix: e.g., walk walking Infix: e.g., hingi humingi (Tagalog) Circumfix: e.g., sagen gesagt (German)
9
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped
10
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morpheme new class E.g. Walk + er walker (N)
11
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morpheme new class E.g. Walk + er walker (N) Compounding: multiple stems new word E.g. doghouse, catwalk, …
12
Combining Morphemes Inflection: Stem + gram. morpheme same class E.g.: help + ed helped Derivation: Stem + gram. morpheme new class E.g. Walk + er walker (N) Compounding: multiple stems new word E.g. doghouse, catwalk, … Clitics: stem+clitic I + ll I’ll; he + is he’s
13
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives
14
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English???
15
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen
16
Inflectional Morphology (Mostly English) Relatively simple inflectional system Nouns, verbs, some adjectives Noun inflection: Only plural, possessive Non-English??? Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x Possessive: sg, irreg pl: +’s; reg pl, after s,z: ‘ RegularIrregular Singularcatthrushgooseox Pluralcatsthrushesgeeseoxen
17
Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected
18
Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped
19
Verb Inflectional Morphology Classes: Main (eat, hit), modal (can, should), primary (be, have) Only main, primary inflected Regular verbs: Forms predictable from stem, productive Irregular verbs: Only about 250, but very frequent FormRegularVerbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing partwalkingmergingtryingmapping past (-ed)walkedmergedtriedmapped eateatseatingateeaten catchcatchescatchingcaught cutcutscuttingcut
20
Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix Noun
21
Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix Noun Adjectives: Verb or Noun + affix Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness
22
Derivational Morphology Relatively complex, common in English Nominalization: Verb or Adj + affix Noun Adjectives: Verb or Noun + affix Adj SuffixBaseDerived Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness SuffixBaseDerived Adjective -alcomputationcomputational -ableembraceembraceable -lessclueclueless
23
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs
24
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s
25
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic
26
Cliticization Clitics: between affix and word Affix: short, reduced Word: act as pronouns, articles, conj, verbs In English: Presence is (mostly) unambiguous: ‘ Meaning is often ambiguous: e.g. he’s More complex in other languages: e.g. Arabic Can prefix (proclitic) article, prep, conj, No markers Removal of such clitics often referred to as light stemming
27
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise
28
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why?
29
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org)
30
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes
31
Stemming Simple type of morphological analysis Commonly used in information retrieval (IR) Supports matching using base form e.g. Television, televised, televising televise Typically improves retrieval of short documents – why? Most popular: Porter stemmer (snowball.tartarus.org) Task: Given surface form, produce base form Typically, removes suffixes Model: Rule cascade No lexicon!
32
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2
33
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε
34
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE
35
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing
36
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes
37
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros:
38
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons:
39
Porter Stemmer Rule cascade: Rule form: (condition) PATT1 PATT2 E.g. stem contains vowel, ING -> ε ATIONAL ATE Rule partial order: Step1a: -s Step1b: -ed, -ing Step 2-4: derivational suffixes Step 5: cleanup Pros: Simple, fast, buildable for a variety of languages Cons: Overaggressive and underaggressive Limited in application
40
FST Morphological Analysis Focus on English morphology FSA acceptor: cats yes; foxes yes; childs no
41
FST Morphological Analysis Focus on English morphology FSA acceptor: cats yes; foxes yes; childs no FST morphological analyzer: fox + N + pl fox^s#
42
FST Morphological Analysis Focus on English morphology FSA acceptor: cats yes; foxes yes; childs no FST morphological analyzer: fox + N + pl fox^s# FST for orthographic rules: fox^s# foxes#
43
Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl
44
Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N
45
Morphological Analysis Components Lexicon: List of stems and affixes E.g.: cat: N -s: Pl Morphotactics: Model of morpheme ordering Association with classes, affix ordering E.g. Pl follows N Orthographic rules: Spelling rules Changes when morphemes combine E.g. y ie in try + s
46
Example Goal: foxes fox + N + Pl
47
Example Goal: foxes fox + N + Pl Surface: foxes
48
Example Goal: foxes fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s
49
Example Goal: foxes fox + N + Pl Surface: foxes Orthographic rules Intermediate: fox s Lexicon + morphotactics Lexical: fox + N + Pl
50
Multiple Levels Generation and Analysis Generation: fox + N + Pl fox^s#; fox^s# foxes# Analysis: foxes# fox^s#; fox^s# fox + N + Pl
51
The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages
52
The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base
53
The Lexicon Repository for words: Simplest would be enumeration Impractical (at least) for many languages Includes stems, affixes, some morphotactics E.g cat: N, +sg; fly: v, +base What about: flies: v, +sg +3 rd ? Common model of morphotactics: FSA
54
Basic Noun Lexicon (J&M, CH3) reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse
55
Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse
56
Basic Noun Lexicon (J&M, CH3) As an FSA reg-nounirreg-pl-nounirreg-sg-nounplural foxgeesegoose-s catsheep dogmicemouse
57
FSA Lexicon with Words What’s up with the ‘s’ arc?
58
FSA Lexicon with Words What’s up with the ‘s’ arc? Orthographic rules will fix ‘es’
59
Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang
60
Lexicon for English Verbs Verbs and classes: reg-v-stemirreg-v-stemirreg-past-v-formpastpart-partpres-part3sg walkcutcaught-ed -ing-s fryspeakate talksingeaten impeachsang
61
FSA for Derivational Morphology Complex….
62
FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem
63
FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part
64
FSAs for Morphotactics We have: stems (and stem class identities) e.g cat: reg-noun, goose: irreg-noun e.g. walk: reg-verb-stem; cut: irreg-verb-stem affixes (by form and class) e.g. –s: Plural e.g. –ed: past, past-part morphotactic FSAs: Accept combinations of stems & affixes in language Reject o.w.
65
Recognition vs Analysis/Generation Can validate a morphological sequence
66
Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form
67
Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another
68
Recognition vs Analysis/Generation Can validate a morphological sequence Recognition not usually main goal Analysis: Given a surface form, produce component morphemes Generation: Given some morphological structure, produce full surface form Requires translation from one form to another FSTs
69
Multilevel Tape Machines FST1… Orthographic Rules …..FSTn Lexicon FST
70
Noun Morphology FSA Remember:
71
Schematic FST cat + N + Pl cat^s#Map morph features to empty string if there is no corresponding output
72
Updating the Lexicon Need words, not just classes, as FST fox fox
73
Updating the Lexicon Need words, not just classes, as FST fox fox Need:
74
Updating the Lexicon Need words, not just classes, as FST fox fox Need: geese goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o o s e catsheep aardvarkmouse
75
Updating the Lexicon Need words, not just classes, as FST fox fox Need: geese goose + N + Pl Assume f:f written as f reg-nounirreg-pl-nounirreg-sg-noun foxg o:e o:e s eg o o s e catsheep aardvarkm o:i u:εs:c emouse
76
Integrating the Lexicon Replace classes with stems
77
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,..
78
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries
79
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed, etc
80
Adding Orthographic Rules Current transducer concatenates morphemes Should work for cats, aardvarks, mice,.. foxs? Problem: spelling changes at morpheme boundaries Many such rules Consonant doubling before –ing, -ed E-deletion: silent e dropped before –ing, -ed, Y replacement: y ie before –s, i before -ed, etc Approach: Transducers for orthographic rules
81
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1:
82
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes
83
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes, but also cates, doges, etc…
84
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε e /(s|z|x|)_s Issue
85
Creating an Orthographic Rule Goal: Correct e insertion in plurals E.g. fox^s# foxes Approach 1: ε e foxes, but also cates, doges, etc… Only apply in context: after s,z,x, etc before s Approach 2: ε e /(s|z|x|)_s Issue? glass glases Approach 3: ε e /(s|z|x|)^_s#
86
Rewrite Rules Format: a b/c_d Rewrite rules can be optional or obligatory Rewrite rules can be ordered to reduce ambiguity. Under some conditions, rewrite rules equivalent to FSTs. a not allowed to match s.t. introduced in prior rule application
87
E-insertion Rule Transducer ε e /(s|z|x|)^_s# Input: ….(s|z|x)^s# Intermediate level Output: …(s|z|x)es# surface level
88
Using the E-insertion FST (fox,fox):
89
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#):
90
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#):
91
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs):
92
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject
93
Using the E-insertion FST (fox,fox): q0, q0,q0,q1, accept (fox#,fox#): q0.q0.q0.q1,q0, accept (fox^s#,foxes#): q0,q0,q0,q1,q2,q3,q4,q0,accept (fox^s,foxs): q0,q0,q0,q1,q2,q5,reject (fox^z#,foxz#) ?
94
What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#)
95
What will it accept? (f,f) (fox#,fox#) (fox^s#,foxes#) (fox^z#,foxz#) Goal: write rules capture only those constraints Let all other input pass through
96
Combining FST Lexicon & Rules Two-level morphological system: ‘Cascade’ Transducer from Lexicon to Intermediate Rule transducers from Intermediate to Surface
97
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes#
98
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes#
99
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL
100
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3Sg How can we disambiguate?
101
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’?
102
Generation & Parsing Generation: Given lexicon tape, cascade to produce surface form fox + N + PL foxes# Parsing: Given surface form, generate analysis foxes# fox + N + PL or fox + V + 3Sg How can we disambiguate? We can’t here – need outside information What about ‘assess’? Need same sort of search as NFAs
103
FST Morphological Analysis Summary: Main components Lexicon Morphotactics Orthographic rules Morphotactics as FSTs, expanded with FST Lexicon Orthographic rules as FSTs Combine FSTs, e.g. in cascade
104
Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated
105
Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction
106
Issues What do you think of creating all the rules for a languages – by hand? Time-consuming, complicated Proposed approach: Unsupervised morphology induction Potentially useful for many applications IR, MT
107
Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30
108
Unsupervised Morphology Start from tokenized text (or word frequencies) talk 60 talked120 walked40 walk30 Treat as coding/compression problem Find most compact representation of lexicon Popular model MDL (Minimum Description Length) Smallest total encoding: Weighted combination of lexicon size & ‘rules’
109
Approach Generate initial model: Base set of words, compute MDL length
110
Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size
111
Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words
112
Approach Generate initial model: Base set of words, compute MDL length Iterate: Generate a new set of words + some model to create a smaller description size E.g. for talk, talked, walk, walked 4 words 2 words (talk, walk) + 1 affix (-ed) + combination info 2 words (t,w) + 2 affixes (alk,-ed) + combination info
113
Homework #3
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.