CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot
Plan for Today’s Lecture Morphology: Definitions and Problems –What is Morphology? –Topology of Morphologies Approaches to Computational Morphology –Lexicons and Rules –Computational Morphology Approaches Assignment 2
Morphology The study of the way words are built up from smaller meaning units called Morphemes Syntax Lexeme/Inflected Lexeme Grammarssentences Morphology Morpheme/Allomorph Morphotacticswords Phonology Phoneme/Allophone Phonotacticsletters Abstract versus Realized HOP +PAST hop +ed hopped /hapt/ Context Context Context
Phonology and Morphology Phonology vs. Orthography Historical spelling –night, nite –attention, mission, fish Script Limitations –Spoken English has 14 vowels heed hid hayed head had hoed hood who’d hide how’d taught Tut toy enough –English Alphabet has 5 Use vowel combinatios: far fair fare Consonantal doubling (hopping vs. hoping)
Syntax and Morphology Phrase-level agreement –Subject-Verb John studies hard (STUDY+3SG) –Noun-Adjective Las vacas hermosas Sub-word phrasal structures –שבספרינו –ש+ב+ספר+ים+נו –That+in+book+PL+Poss:1PL –Which are in our books conj prep noun poss pluralarticle
Topology of Morphologies Concatinative vs. Templatic Derivational vs. Inflectional Regular vs. Irregular
Concatinative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme – hope+ing hopinghop hopping Affixes –Prefixes: Antidisestablishmentarianism –Suffixes: Antidisestablishmentarianism –Infixes: hingi (borrow) – humingi (borrower) in Tagalog –Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages –uygarlaştıramadıklarımızdanmışsınızcasına –uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına –Behaving as if you are among those whom we could not cause to become civilized
Templatic Morphology Roots and Patterns مكتوب ب K T B ?ومَ?? كت כתוב ב ?ו?? כת maktuub written ktuuv written
Templatic Morphology: Root Meaning KTB: writing “stuff” כתב מכתב כתב כתיב spelling כתובת address كتب كاتب مكتوب كتاب book مكتبة library مكتب office write writer letter
Inflectional vs. Derivational Word Classes –Parts of speech: noun, verb, adjectives, etc. –Word class dictates how a word combines with morphemes to form new words
Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, clueless, embraceable CatVar: Categorial Variation Database
Inflectional morphology Adds: Tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Five verb forms in English Other languages have (lots more)
Nouns and Verbs (in English) Nouns have simple inflectional morphology –cat –cat+s, cat+’s Verbs have more complex morphology
Regulars and Irregulars Nouns –Cat/Cats –Mouse/Mice, Ox, Oxen, Goose, Geese Verbs –Walk/Walked –Go/Went, Fly/Flew
Regular (English) Verbs Morphological Form ClassesRegularly Inflected Verbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing formwalkingmergingtryingmapping Past form or –ed participlewalkedmergedtriedmapped
Irregular (English) Verbs Morphological Form ClassesIrregularly Inflected Verbs Stemeatcatchcut -s formeatscatchescuts -ing formeatingcatchingcutting Past formatecaughtcut -ed participleeatencaughtcut
“To love” in Spanish
Computational Morphology Finite State Morphology –Finite State Transducers (FST) Input/Output Analysis/Generation
Computational Morphology WORDSTEM (+FEATURES)* cats cat +N +PL catcat +N +SG cities city +N +PL geese goose +N +PL ducks (duck +N +PL) or (duck +V +3SG) mergingmerge +V +PRES-PART caught(catch +V +PAST-PART) or (catch +V +PAST)
Computational Morphology The Rules and the Lexicon –General versus Specific –Regular versus Irregular –Accuracy, speed, space –The Morphology of a language Approaches –Lexicon only –Rules only –Lexicon and Rules Finite-state Automata Finite-state Transducers
Lexicon-only Morphology acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ The lexicon lists all surface level and lexical level pairs No rules …? Analysis/Generation is easy Very large for English What about Arabic or Turkish? Chinese?
Lexicon and Rules FSA Inflectional Morphology reg-nounIrreg-pl-nounIrreg-sg-nounplural fox cat dog geese sheep mice goose sheep mouse -s English Noun Lexicon English Noun Rule
FSA English Verb Inflectional Morphology reg-verb-stemirreg-verb-stemirreg-past-verbpastpast-partpres-part3sg walk fry talk impeach cut speak spoken sing sang caught ate eaten -ed -ing-s
FSA for Derivational Morphology: Adjectival Formation
More Complex Derivational Morphology
Using FSAs for Recognition: English Nouns and their Inflection
Morphological Parsing Finite-state automata (FSA) –Recognizer –One-level morphology Finite-state transducers (FST) –Two-level morphology PC-Kimmo (Koskenniemi 83) –input-output pair
Terminology for PC-Kimmo Upper = lexical tape Lower = surface tape Characters correspond to pairs, written a:b If “a:a”, write “a” for shorthand Two-level lexical entries # = word boundary ^ = morpheme boundary Other = “any feasible pair that is not in this transducer”
Four-Fold View of FSTs As a recognizer As a generator As a translator As a set relater
Nominal Inflection FST
Lexical and Intermediate Tapes
Spelling Rules NameRule DescriptionExample Consonant Doubling1-letter consonant doubled before -ing/-edbeg/begging E-deletionSilent e dropped before -ing and -edmake/making E-insertione added after s,z,x,ch,sh before swatch/watches Y-replacement-y changes to -ie before -s, -i before -edtry/tries K-insertionverbs ending with vowel + -c add -kpanic/panicked
Chomsky and Halle Notation ε → e / xszxsz ^ __ s #
Intermediate-to-Surface Transducer
State Transition Table
Two-Level Morphology
Sample Run
FST Properties Inversion T -1 = inversion of T Input/Output switched Composition T 1 maps I 1 to O 1 T 2 maps I 2 to O 2 T 1 ° T 2 maps I 1 to O 2
FSTs and ambiguity Kimmo Demo Parse Example 1: unionizable union +ize +able un+ ion +ize +able Parse Example 2: assess assess v ass N +ess N Parse Example 3: tender tender AJ ten Num +d AJ +er CMP
What to do about Global Ambiguity? Accept first successful structure Run parser through all possible paths Bias the search in some manner
Computational Morphology The Rules and the Lexicon –General versus Specific –Regular versus Irregular –Accuracy, speed, space –The Morphology of a language Approaches –Lexicon only –Rules only –Lexicon and Rules Finite-state Automata Finite-state Transducers
Lexicon-Free Morphology: Porter Stemmer Lexicon-Free FST Approach By Martin Porter (1980) Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER Porter Stemmer Game
Porter Stemmer Definitions C = consonant = Not A E I O U or (Y preceded by C) V = not C M = Measure: Words = C*(V*C*){M}V* – M=0 TR, EE, TREE, Y, BY – M=1 TROUBLE, OATS, TREES, IVY – M=2 TROUBLES, PRIVATE, OATEN, ORRERY Conditions –*S - stem ends with S –*v* - stem contains a V –*d - stem ends with double C -DD, -ZZ –*o - stem ends CVC, where the second C is not W, X or Y -WIL, -SOB
Porter Stemmer Step 1 : Plural Nouns and Third Person Singular Verbs SSES SS caresses caress IES I ponies poni ties ti SS SS caress caress S cats cat * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 2a : Verbal Past Tense and Progressive Forms (M>0) EED EE feed feed, agreed agree i(*v*) ED plastered plaster, bled bled ii (*v*) ING motoring motor, sing sing Step 2b : If 2a.i or 2a.ii is successful, Cleanup AT ATE conflat(ed) conflate BL BLE troubl(ed) trouble IZ IZE siz(ed) size (*d and not (*L or *S or *Z))hopp(ing) hop, tann(ed) tan single letterhiss(ing) hiss, fizz(ed) fizz (M=1 and *o) E fail(ing) fail, fil(ing) file
Porter Stemmer Step 3 : Y I (*v*) Y I happy happi sky sky * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y
Porter Stemmer Step 4 : Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
Porter Stemmer Step 5 : Derivational Morphology II: More Multiple Suffixes (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good
Porter Stemmer Step 5 : Derivational Morphology III: Single Suffixes (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y
Porter Stemmer Step 7a : Cleanup (m>1) E probate probat rate rate (m=1 and not *o) E cease ceas Step 7b: More Cleanup (m > 1 and *d and *L) controll control single letter roll roll * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y
Porter Stemmer Errors of Omission –EuropeanEurope –analysisanalyzes –matricesmatrix –noisenoisy –explainexplanation Errors of Commission –organizationorgan –doingdoe –generalization generic –numerical numerous –universityuniverse
Readings for next time J&M Chapter 6
Text from Assignment here… Assignment 2 Due Date is Midnight 2/25/2004