Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin.

Similar presentations


Presentation on theme: "CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin."— Presentation transcript:

1 CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot

2 Plan for Today’s Lecture Morphology: Definitions and Problems –What is Morphology? –Topology of Morphologies Approaches to Computational Morphology –Lexicons and Rules –Computational Morphology Approaches Assignment 2

3 Morphology The study of the way words are built up from smaller meaning units called Morphemes Syntax Lexeme/Inflected Lexeme Grammarssentences Morphology Morpheme/Allomorph Morphotacticswords Phonology Phoneme/Allophone Phonotacticsletters Abstract versus Realized HOP +PAST  hop +ed  hopped  /hapt/ Context Context Context

4 Phonology and Morphology Phonology vs. Orthography Historical spelling –night, nite –attention, mission, fish Script Limitations –Spoken English has 14 vowels heed hid hayed head had hoed hood who’d hide how’d taught Tut toy enough –English Alphabet has 5 Use vowel combinatios: far fair fare Consonantal doubling (hopping vs. hoping)

5 Syntax and Morphology Phrase-level agreement –Subject-Verb John studies hard (STUDY+3SG) –Noun-Adjective Las vacas hermosas Sub-word phrasal structures –שבספרינו –ש+ב+ספר+ים+נו –That+in+book+PL+Poss:1PL –Which are in our books conj prep noun poss pluralarticle

6 Topology of Morphologies Concatinative vs. Templatic Derivational vs. Inflectional Regular vs. Irregular

7 Concatinative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme – hope+ing  hopinghop  hopping Affixes –Prefixes: Antidisestablishmentarianism –Suffixes: Antidisestablishmentarianism –Infixes: hingi (borrow) – humingi (borrower) in Tagalog –Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages –uygarlaştıramadıklarımızdanmışsınızcasına –uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına –Behaving as if you are among those whom we could not cause to become civilized

8 Templatic Morphology Roots and Patterns مكتوب ب K T B ?ومَ?? كت כתוב ב ?ו?? כת maktuub written ktuuv written

9 Templatic Morphology: Root Meaning KTB: writing “stuff” כתב מכתב כתב כתיב spelling כתובת address كتب كاتب مكتوب كتاب book مكتبة library مكتب office write writer letter

10 Inflectional vs. Derivational Word Classes –Parts of speech: noun, verb, adjectives, etc. –Word class dictates how a word combines with morphemes to form new words

11 Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, clueless, embraceable CatVar: Categorial Variation Database http://clipdemos.umiacs.umd.edu/catvar/

12 Inflectional morphology Adds: Tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Five verb forms in English Other languages have (lots more)

13 Nouns and Verbs (in English) Nouns have simple inflectional morphology –cat –cat+s, cat+’s Verbs have more complex morphology

14 Regulars and Irregulars Nouns –Cat/Cats –Mouse/Mice, Ox, Oxen, Goose, Geese Verbs –Walk/Walked –Go/Went, Fly/Flew

15 Regular (English) Verbs Morphological Form ClassesRegularly Inflected Verbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing formwalkingmergingtryingmapping Past form or –ed participlewalkedmergedtriedmapped

16 Irregular (English) Verbs Morphological Form ClassesIrregularly Inflected Verbs Stemeatcatchcut -s formeatscatchescuts -ing formeatingcatchingcutting Past formatecaughtcut -ed participleeatencaughtcut

17 “To love” in Spanish

18 Computational Morphology Finite State Morphology –Finite State Transducers (FST) Input/Output Analysis/Generation

19 Computational Morphology WORDSTEM (+FEATURES)* cats cat +N +PL catcat +N +SG cities city +N +PL geese goose +N +PL ducks (duck +N +PL) or (duck +V +3SG) mergingmerge +V +PRES-PART caught(catch +V +PAST-PART) or (catch +V +PAST)

20 Computational Morphology The Rules and the Lexicon –General versus Specific –Regular versus Irregular –Accuracy, speed, space –The Morphology of a language Approaches –Lexicon only –Rules only –Lexicon and Rules Finite-state Automata Finite-state Transducers

21 Lexicon-only Morphology acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ The lexicon lists all surface level and lexical level pairs No rules …? Analysis/Generation is easy Very large for English What about Arabic or Turkish? Chinese?

22 Lexicon and Rules FSA Inflectional Morphology reg-nounIrreg-pl-nounIrreg-sg-nounplural fox cat dog geese sheep mice goose sheep mouse -s English Noun Lexicon English Noun Rule

23 FSA English Verb Inflectional Morphology reg-verb-stemirreg-verb-stemirreg-past-verbpastpast-partpres-part3sg walk fry talk impeach cut speak spoken sing sang caught ate eaten -ed -ing-s

24 FSA for Derivational Morphology: Adjectival Formation

25 More Complex Derivational Morphology

26 Using FSAs for Recognition: English Nouns and their Inflection

27 Morphological Parsing Finite-state automata (FSA) –Recognizer –One-level morphology Finite-state transducers (FST) –Two-level morphology PC-Kimmo (Koskenniemi 83) –input-output pair

28 Terminology for PC-Kimmo Upper = lexical tape Lower = surface tape Characters correspond to pairs, written a:b If “a:a”, write “a” for shorthand Two-level lexical entries # = word boundary ^ = morpheme boundary Other = “any feasible pair that is not in this transducer”

29 Four-Fold View of FSTs As a recognizer As a generator As a translator As a set relater

30 Nominal Inflection FST

31 Lexical and Intermediate Tapes

32 Spelling Rules NameRule DescriptionExample Consonant Doubling1-letter consonant doubled before -ing/-edbeg/begging E-deletionSilent e dropped before -ing and -edmake/making E-insertione added after s,z,x,ch,sh before swatch/watches Y-replacement-y changes to -ie before -s, -i before -edtry/tries K-insertionverbs ending with vowel + -c add -kpanic/panicked

33 Chomsky and Halle Notation ε → e / xszxsz ^ __ s #

34 Intermediate-to-Surface Transducer

35 State Transition Table

36 Two-Level Morphology

37 Sample Run

38 FST Properties Inversion T -1 = inversion of T Input/Output switched Composition T 1 maps I 1 to O 1 T 2 maps I 2 to O 2 T 1 ° T 2 maps I 1 to O 2

39 FSTs and ambiguity Kimmo Demo Parse Example 1: unionizable union +ize +able un+ ion +ize +able Parse Example 2: assess assess v ass N +ess N Parse Example 3: tender tender AJ ten Num +d AJ +er CMP

40 What to do about Global Ambiguity? Accept first successful structure Run parser through all possible paths Bias the search in some manner

41 Computational Morphology The Rules and the Lexicon –General versus Specific –Regular versus Irregular –Accuracy, speed, space –The Morphology of a language Approaches –Lexicon only –Rules only –Lexicon and Rules Finite-state Automata Finite-state Transducers

42 Lexicon-Free Morphology: Porter Stemmer Lexicon-Free FST Approach By Martin Porter (1980) http://www.tartarus.org/%7Emartin/PorterStemmer/ http://www.tartarus.org/%7Emartin/PorterStemmer/ Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER Porter Stemmer Game

43 Porter Stemmer Definitions C = consonant = Not A E I O U or (Y preceded by C) V = not C M = Measure: Words = C*(V*C*){M}V* – M=0 TR, EE, TREE, Y, BY – M=1 TROUBLE, OATS, TREES, IVY – M=2 TROUBLES, PRIVATE, OATEN, ORRERY Conditions –*S - stem ends with S –*v* - stem contains a V –*d - stem ends with double C -DD, -ZZ –*o - stem ends CVC, where the second C is not W, X or Y -WIL, -SOB

44 Porter Stemmer Step 1 : Plural Nouns and Third Person Singular Verbs SSES  SS caresses  caress IES  I ponies  poni ties  ti SS  SS caress  caress S  cats  cat * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 2a : Verbal Past Tense and Progressive Forms (M>0) EED  EE feed  feed, agreed  agree i(*v*) ED  plastered  plaster, bled  bled ii (*v*) ING  motoring  motor, sing  sing Step 2b : If 2a.i or 2a.ii is successful, Cleanup AT  ATE conflat(ed)  conflate BL  BLE troubl(ed)  trouble IZ  IZE siz(ed)  size (*d and not (*L or *S or *Z))hopp(ing)  hop, tann(ed)  tan  single letterhiss(ing)  hiss, fizz(ed)  fizz (M=1 and *o)  E fail(ing)  fail, fil(ing)  file

45 Porter Stemmer Step 3 : Y  I (*v*) Y  I happy  happi sky  sky * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y

46 Porter Stemmer Step 4 : Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

47 Porter Stemmer Step 5 : Derivational Morphology II: More Multiple Suffixes (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good

48 Porter Stemmer Step 5 : Derivational Morphology III: Single Suffixes (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y

49 Porter Stemmer Step 7a : Cleanup (m>1) E  probate  probat rate  rate (m=1 and not *o) E  cease  ceas Step 7b: More Cleanup (m > 1 and *d and *L) controll  control  single letter roll  roll * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y

50 Porter Stemmer Errors of Omission –EuropeanEurope –analysisanalyzes –matricesmatrix –noisenoisy –explainexplanation Errors of Commission –organizationorgan –doingdoe –generalization generic –numerical numerous –universityuniverse

51 Readings for next time J&M Chapter 6

52 Text from Assignment here… Assignment 2 Due Date is Midnight 2/25/2004


Download ppt "CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin."

Similar presentations


Ads by Google