CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

Finite-state automata and Morphology
Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
CMSC 723 / LING 645: Intro to Computational Linguistics September 22, 2004: Dorr Porter Stemmer, Intro to Probabilistic NLP and N-grams (chap )
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Morphology What is morphology? Finite State Transducers Two Level Morphology.
CMSC 723 / LING 645: Intro to Computational Linguistics September 15, 2004: Dorr More about FSA’s, Finite State Morphology (J&M 3) Prof. Bonnie J. Dorr.
CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari Presentation slides adapted.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
SIMS 290-2: Applied Natural Language Processing
Morphological analysis
Finite-State Morphology CMSC 723: Computational Linguistics I ― Session #3 Jimmy Lin The iSchool University of Maryland Wednesday, September 16, 2009.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Basic Text Processing: Morphology Word Stemming
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Speech and Language Processing
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.
Morphology: Parsing Words
עיבוד שפות טבעיות מבוא פרופ' עידו דגן המחלקה למדעי המחשב
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
Morphological Parsing
Basic Text Processing: Morphology Word Stemming
Presentation transcript:

CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot

Plan for Today’s Lecture Morphology: Definitions and Problems –What is Morphology? –Topology of Morphologies Approaches to Computational Morphology –Lexicons and Rules –Computational Morphology Approaches Assignment 2

Morphology The study of the way words are built up from smaller meaning units called Morphemes Syntax Lexeme/Inflected Lexeme Grammarssentences Morphology Morpheme/Allomorph Morphotacticswords Phonology Phoneme/Allophone Phonotacticsletters Abstract versus Realized HOP +PAST  hop +ed  hopped  /hapt/ Context Context Context

Phonology and Morphology Phonology vs. Orthography Historical spelling –night, nite –attention, mission, fish Script Limitations –Spoken English has 14 vowels heed hid hayed head had hoed hood who’d hide how’d taught Tut toy enough –English Alphabet has 5 Use vowel combinatios: far fair fare Consonantal doubling (hopping vs. hoping)

Syntax and Morphology Phrase-level agreement –Subject-Verb John studies hard (STUDY+3SG) –Noun-Adjective Las vacas hermosas Sub-word phrasal structures –שבספרינו –ש+ב+ספר+ים+נו –That+in+book+PL+Poss:1PL –Which are in our books conj prep noun poss pluralarticle

Topology of Morphologies Concatinative vs. Templatic Derivational vs. Inflectional Regular vs. Irregular

Concatinative Morphology Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme – hope+ing  hopinghop  hopping Affixes –Prefixes: Antidisestablishmentarianism –Suffixes: Antidisestablishmentarianism –Infixes: hingi (borrow) – humingi (borrower) in Tagalog –Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages –uygarlaştıramadıklarımızdanmışsınızcasına –uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına –Behaving as if you are among those whom we could not cause to become civilized

Templatic Morphology Roots and Patterns مكتوب ب K T B ?ومَ?? كت כתוב ב ?ו?? כת maktuub written ktuuv written

Templatic Morphology: Root Meaning KTB: writing “stuff” כתב מכתב כתב כתיב spelling כתובת address كتب كاتب مكتوب كتاب book مكتبة library مكتب office write writer letter

Inflectional vs. Derivational Word Classes –Parts of speech: noun, verb, adjectives, etc. –Word class dictates how a word combines with morphemes to form new words

Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, clueless, embraceable CatVar: Categorial Variation Database

Inflectional morphology Adds: Tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Five verb forms in English Other languages have (lots more)

Nouns and Verbs (in English) Nouns have simple inflectional morphology –cat –cat+s, cat+’s Verbs have more complex morphology

Regulars and Irregulars Nouns –Cat/Cats –Mouse/Mice, Ox, Oxen, Goose, Geese Verbs –Walk/Walked –Go/Went, Fly/Flew

Regular (English) Verbs Morphological Form ClassesRegularly Inflected Verbs Stemwalkmergetrymap -s formwalksmergestriesmaps -ing formwalkingmergingtryingmapping Past form or –ed participlewalkedmergedtriedmapped

Irregular (English) Verbs Morphological Form ClassesIrregularly Inflected Verbs Stemeatcatchcut -s formeatscatchescuts -ing formeatingcatchingcutting Past formatecaughtcut -ed participleeatencaughtcut

“To love” in Spanish

Computational Morphology Finite State Morphology –Finite State Transducers (FST) Input/Output Analysis/Generation

Computational Morphology WORDSTEM (+FEATURES)* cats cat +N +PL catcat +N +SG cities city +N +PL geese goose +N +PL ducks (duck +N +PL) or (duck +V +3SG) mergingmerge +V +PRES-PART caught(catch +V +PAST-PART) or (catch +V +PAST)

Computational Morphology The Rules and the Lexicon –General versus Specific –Regular versus Irregular –Accuracy, speed, space –The Morphology of a language Approaches –Lexicon only –Rules only –Lexicon and Rules Finite-state Automata Finite-state Transducers

Lexicon-only Morphology acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ The lexicon lists all surface level and lexical level pairs No rules …? Analysis/Generation is easy Very large for English What about Arabic or Turkish? Chinese?

Lexicon and Rules FSA Inflectional Morphology reg-nounIrreg-pl-nounIrreg-sg-nounplural fox cat dog geese sheep mice goose sheep mouse -s English Noun Lexicon English Noun Rule

FSA English Verb Inflectional Morphology reg-verb-stemirreg-verb-stemirreg-past-verbpastpast-partpres-part3sg walk fry talk impeach cut speak spoken sing sang caught ate eaten -ed -ing-s

FSA for Derivational Morphology: Adjectival Formation

More Complex Derivational Morphology

Using FSAs for Recognition: English Nouns and their Inflection

Morphological Parsing Finite-state automata (FSA) –Recognizer –One-level morphology Finite-state transducers (FST) –Two-level morphology PC-Kimmo (Koskenniemi 83) –input-output pair

Terminology for PC-Kimmo Upper = lexical tape Lower = surface tape Characters correspond to pairs, written a:b If “a:a”, write “a” for shorthand Two-level lexical entries # = word boundary ^ = morpheme boundary Other = “any feasible pair that is not in this transducer”

Four-Fold View of FSTs As a recognizer As a generator As a translator As a set relater

Nominal Inflection FST

Lexical and Intermediate Tapes

Spelling Rules NameRule DescriptionExample Consonant Doubling1-letter consonant doubled before -ing/-edbeg/begging E-deletionSilent e dropped before -ing and -edmake/making E-insertione added after s,z,x,ch,sh before swatch/watches Y-replacement-y changes to -ie before -s, -i before -edtry/tries K-insertionverbs ending with vowel + -c add -kpanic/panicked

Chomsky and Halle Notation ε → e / xszxsz ^ __ s #

Intermediate-to-Surface Transducer

State Transition Table

Two-Level Morphology

Sample Run

FST Properties Inversion T -1 = inversion of T Input/Output switched Composition T 1 maps I 1 to O 1 T 2 maps I 2 to O 2 T 1 ° T 2 maps I 1 to O 2

FSTs and ambiguity Kimmo Demo Parse Example 1: unionizable union +ize +able un+ ion +ize +able Parse Example 2: assess assess v ass N +ess N Parse Example 3: tender tender AJ ten Num +d AJ +er CMP

What to do about Global Ambiguity? Accept first successful structure Run parser through all possible paths Bias the search in some manner

Computational Morphology The Rules and the Lexicon –General versus Specific –Regular versus Irregular –Accuracy, speed, space –The Morphology of a language Approaches –Lexicon only –Rules only –Lexicon and Rules Finite-state Automata Finite-state Transducers

Lexicon-Free Morphology: Porter Stemmer Lexicon-Free FST Approach By Martin Porter (1980) Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER Porter Stemmer Game

Porter Stemmer Definitions C = consonant = Not A E I O U or (Y preceded by C) V = not C M = Measure: Words = C*(V*C*){M}V* – M=0 TR, EE, TREE, Y, BY – M=1 TROUBLE, OATS, TREES, IVY – M=2 TROUBLES, PRIVATE, OATEN, ORRERY Conditions –*S - stem ends with S –*v* - stem contains a V –*d - stem ends with double C -DD, -ZZ –*o - stem ends CVC, where the second C is not W, X or Y -WIL, -SOB

Porter Stemmer Step 1 : Plural Nouns and Third Person Singular Verbs SSES  SS caresses  caress IES  I ponies  poni ties  ti SS  SS caress  caress S  cats  cat * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Step 2a : Verbal Past Tense and Progressive Forms (M>0) EED  EE feed  feed, agreed  agree i(*v*) ED  plastered  plaster, bled  bled ii (*v*) ING  motoring  motor, sing  sing Step 2b : If 2a.i or 2a.ii is successful, Cleanup AT  ATE conflat(ed)  conflate BL  BLE troubl(ed)  trouble IZ  IZE siz(ed)  size (*d and not (*L or *S or *Z))hopp(ing)  hop, tann(ed)  tan  single letterhiss(ing)  hiss, fizz(ed)  fizz (M=1 and *o)  E fail(ing)  fail, fil(ing)  file

Porter Stemmer Step 3 : Y  I (*v*) Y  I happy  happi sky  sky * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y

Porter Stemmer Step 4 : Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible

Porter Stemmer Step 5 : Derivational Morphology II: More Multiple Suffixes (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good

Porter Stemmer Step 5 : Derivational Morphology III: Single Suffixes (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y

Porter Stemmer Step 7a : Cleanup (m>1) E  probate  probat rate  rate (m=1 and not *o) E  cease  ceas Step 7b: More Cleanup (m > 1 and *d and *L) controll  control  single letter roll  roll * = ends with *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y

Porter Stemmer Errors of Omission –EuropeanEurope –analysisanalyzes –matricesmatrix –noisenoisy –explainexplanation Errors of Commission –organizationorgan –doingdoe –generalization generic –numerical numerous –universityuniverse

Readings for next time J&M Chapter 6

Text from Assignment here… Assignment 2 Due Date is Midnight 2/25/2004