LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Chapter 2: Text Operations.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Morphology.
Objectives 1.You will identify what is meant by the term ‘suffix’. 2.You will identify and demonstrate the function of the most common suffixes.
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
1 Morphological Analysis difficulties Noun, plural, countable 3 morphemes: difficult FREE MORPHEME -y BOUND DERIVATIONAL (-y: suffix to form nouns from.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
October 2009HLT: Conflation Algorithms1 Human Language Technology Conflation Algorithms.
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin.
December 2007NLP: Conflation Algorithms1 Natural Language Processing Conflation Algorithms.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Rules Always answer in the form of a question 50 points deducted for wrong answer.
Brief introduction to morphology
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 11/1.
Morphology I. Basic concepts and terms Derivational processes
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
LING 388: Language and Computers Sandiway Fong Lecture 25: 11/22.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
McGraw-Hill © 2007 The McGraw-Hill Companies, Inc. All rights reserved. Objectives This chapter will show you how to improve your spelling by using: A.
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:
WORD-FORMATION (Word derivation) Lecture # 4
LING 388: Language and Computers Sandiway Fong Lecture 17.
Prof. Erik Lu. MORPHOLOGY GRAMMAR MORPHOLOGY MORPHEMES BOUND FREE WORDS LEXICAL GRAMMATICAL NOUNS VERBS ADJECTIVES (ADVERBS) PRONOUNS ARTICLES ADVERBS.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Ch4 – Features Consider the following data from Mokilese
WORD-FORMATION (Word derivation) Lecture # 4
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
1 CS 430 / INFO 430 Information Retrieval Lecture 5 Searching Full Text 5.
Morphological Analysis Lim Kay Yie Kong Moon Moon Rosaida bt ibrahim Nor hayati bt jamaludin.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
Natural Language Processing Chapter 2 : Morphology.
LING/C SC/PSYC 438/538 Lecture 25 Sandiway Fong 1.
(C) 2003, The University of Michigan1 Information Retrieval Handout #2 February 3, 2003.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
NLP. Text similarity People can express the same concept (or related concepts) in many different ways. For example, “the plane leaves at 12pm” vs “the.
MORPHOLOGY : THE STRUCTURE OF WORDS. MORPHOLOGY Morphology deals with the syntax of complex words and parts of words, also called morphemes, as well as.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Derivational morphemes
Morphology 1 : the Morpheme
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
MORPHOLOGY COURSE SPRING TERM 2016
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
Basic Text Processing: Morphology Word Stemming
Inflectional Morphology
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
WORD-FORMATION PRODUCTIVITY
LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.
עיבוד שפות טבעיות מבוא פרופ' עידו דגן המחלקה למדעי המחשב
Token generation - stemming
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
Discussion Class 3 Stemming Algorithms.
Introduction to English morphology
Year 3 Spelling Rules.
Added to the end of a word.
Basic Text Processing: Morphology Word Stemming
Spelling Scheme of Work
Presentation transcript:

LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG

Administrivia Room change next Tuesday Room change next Tuesday  One time deal :-)  PAS 224  Physics and Atmospheric Sciences Building

Morphology Inflectional Morphology: Inflectional Morphology:  Phi-features (person, number, gender)  Examples: movies, blonde, actress  Irregular examples: appendices, geese  Case  Examples: he/him, who/whom  Comparatives and superlatives  Examples: happier/happiest  Tense  Examples: drive/drives/drove (-ed)/driven

Morphology Derivational Morphology Derivational Morphology  Nominalization  Examples: formalization, informant, informer, refusal, lossage  Deadjectivals  Examples: weaken, happiness, simplify, formalize, slowly, calm  Deverbals  Examples: see nominalizations, readable, employee  Denominals  Examples: formal, bridge, ski, cowardly, useful

Morphology and Semantics Suffixation Suffixation  Examples:  x employ y employee: picks out yemployee: picks out y employer: picks out xemployer: picks out x  x read y readable: picks out yreadable: picks out y Prefixation Prefixation  Examples:  undo, redo, un-redo, encode, defrost, asymmetric, malformed, ill-formed, pro-Chomsky

Stemming Normalization procedure: Normalization procedure:  Inflectional morphology:  cities -> city, improves/improved -> improve  Derivational morphology:  transformation/transformational -> transform Criterion: preserve meaning Criterion: preserve meaning Primary application: information retrieval (IR) Primary application: information retrieval (IR)  Efficacy questioned: Harman (1991)

Stemming IR-centric view: IR-centric view:  Applies to open-class lexical items only:  Stop-words: the, below, being, does Not full morphology: Not full morphology:  prefixes generally excluded  (not meaning preserving)  Examples: asymmetric, undo,. encoding

Stemming: Methods Use a dictionary (look-up) Use a dictionary (look-up)  OK for English, not for languages with more productive morphology, e.g. Japanese Write rules, e.g. Porter Algorithm (Porter, 1980) Write rules, e.g. Porter Algorithm (Porter, 1980)  Example:  Ends in doubled consonant (not “l”, “s” or “z”), remove last character hopping -> hop, hissing -> hisshopping -> hop, hissing -> hiss

Stemming: Methods Dictionary approach not enough Dictionary approach not enough  Example: (Porter, 1991)  routed -> route/rout At Waterloo, Napoleon’s forces were routedAt Waterloo, Napoleon’s forces were routed The cars were routed off the highwayThe cars were routed off the highway  Here, the (inflected) verb form is polysemous

Stemming: Errors Understemming: failure to merge Understemming: failure to merge  Adhere/adhesion Overstemming: incorrect merge Overstemming: incorrect merge  Probe/probable  Claim: -able irregular suffix, root: probare (Lat.) Mis-stemming: removing a non-suffix (Porter, 1991) Mis-stemming: removing a non-suffix (Porter, 1991)  reply -> rep

Stemming: Interaction Interacts with noun compounding: Interacts with noun compounding:  Example:  operating systems  negative polarity items  For IR, compounds need to be identified first…

Stemming: Porter Algorithm The Porter Stemmer (Porter, 1980) The Porter Stemmer (Porter, 1980)  For English  Most widely used system  Manually written rules  5 stage approach to extracting roots  Considers suffixes only  May produce non-word roots

Stemming: Porter Algorithm Rule format: Rule format:  (condition on stem) suffix 1 -> suffix 2  In case of conflict, prefer longest suffix match “Measure” of a word is m in: “Measure” of a word is m in:  (C) (VC) m (V)  C = sequence of one or more consonants  V = sequence of one or more vowels  Examples:  tree C(VC) 0 V  troubles C(VC) 2

Stemming: Porter Algorithm Step 1a: remove plural suffixation Step 1a: remove plural suffixation  SSES -> SS (caresses)  IES -> I (ponies)  SS -> SS (caress)  S -> (cats) Step 1b: remove verbal inflection Step 1b: remove verbal inflection  (m>0) EED -> EE (agreed, feed)  (*v*) ED -> (plastered, bled)  (*v*) ING -> (motoring, sing)

Stemming: Porter Algorithm Step 1b: (contd. for -ed and -ing rules) Step 1b: (contd. for -ed and -ing rules)  AT -> ATE (conflated)  BL -> BLE (troubled)  IZ -> IZE (sized)  (*doubled c & ¬(*L v *S v *Z)) -> single c (hopping, hissing, falling, fizzing)  (m=1 & *cvc) -> E (filing, failing, slowing) Step 1c: Y and I Step 1c: Y and I  (*v*) Y -> I (happy, sky)

Stemming: Porter Algorithm Step 2: Peel one suffix off for multiple suffixes Step 2: Peel one suffix off for multiple suffixes  (m>0) ATIONAL -> ATE (relational)  (m>0) TIONAL -> TION (conditional, rational)  (m>0) ENCI -> ENCE (valenci)  (m>0) ANCI -> ANCE (hesitanci)  (m>0) IZER -> IZE (digitizer)  (m>0) ABLI -> ABLE (conformabli) - able (step 4)  …  (m>0) IZATION -> IZE (vietnamization)  (m>0) ATION -> ATE (predication)  …  (m>0) IVITI -> IVE (sensitiviti)

Stemming: Porter Algorithm Step 3 Step 3  (m>0) ICATE -> IC (triplicate)  (m>0) ATIVE -> (formative)  (m>0) ALIZE -> AL (formalize)  (m>0) ICITI -> IC (electriciti)  (m>0) ICAL -> IC (electrical, chemical)  (m>0) FUL -> (hopeful)  (m>0) NESS -> (goodness)

Stemming: Porter Algorithm Step 4: Delete last suffix Step 4: Delete last suffix  (m>1) AL -> (revival) - revive, see step 5  (m>1) ANCE -> (allowance, dance)  (m>1) ENCE -> (inference, fence)  (m>1) ER -> (airliner, employer)  (m>1) IC -> (gyroscopic, electric)  (m>1) ABLE -> (adjustable, mov(e)able)  (m>1) IBLE -> (defensible,bible)  (m>1) ANT -> (irritant,ant)  (m>1) EMENT -> (replacement)  (m>1) MENT -> (adjustment)  …

Stemming: Porter Algorithm Step 5a: remove e Step 5a: remove e  (m>1) E -> (probate, rate)  (m>1 & ¬*cvc) E -> (cease) Step 5b: ll reduction Step 5b: ll reduction  (m>1 & *LL) -> L (controller, roll)

Stemming: Porter Algorithm Misses (understemming) Misses (understemming)  Unaffected:  agreement (VC) 1 VCC - step 4 (m>1)  adhesion  Irregular morphology:  drove, geese Overstemming Overstemming  relativity - step 2 Mis-stemming Mis-stemming  wander C(VC) 1 VC

Stemming: Porter Algorithm Possible Term Project Possible Term Project  The Porter Stemmer is a rule-based system  We know how to write rules  Implement the Porter Stemmer in SWI-Prolog