LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong.

Slides:



Advertisements
Similar presentations
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Morphology.
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Srihari-CSE635-Fall 2002 CSE 635 Multimedia Information Retrieval Chapter 7: Text Preprocessing.
(C) 2003, The University of Michigan1 Information Retrieval Handout #4 January 28, 2005.
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence.
Brief introduction to morphology
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
Spring 2002NLE1 CC 384: Natural Language Engineering Week 2, Lecture 2 - Lemmatization and Stemming; the Porter Stemmer.
CS 430 / INFO 430 Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
The College of Saint Rose CIS 460 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 11/1.
Morphology I. Basic concepts and terms Derivational processes
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
LING 388: Language and Computers Sandiway Fong Lecture 25: 11/22.
Morphological analysis
1 LIN 1310B Introduction to Linguistics Prof: Nikolay Slavkov TA: Qinghua Tang CLASS 5, Jan 19, 2007.
LING 388 Language and Computers Lecture 21 11/13/03 Sandiway FONG.
Chapter 5: Information Retrieval and Web Search
LING/C SC/PSYC 438/538 Lecture 17 Sandiway Fong. Administrivia Grading – Midterm grading not finished yet – Homework 3 graded Reminder – Next Monday:
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Ch4 – Features Consider the following data from Mokilese
DCU meets MET: Bengali and Hindi Morpheme Extraction Debasis Ganguly, Johannes Leveling, Gareth J.F. Jones CNGL, School of Computing, Dublin City University,
Text Preprocessing. Preprocessing step Aims to create a correct text representation, according to the adopted model. Step: –Lexical analysis; –Case folding,
Chapter 6: Information Retrieval and Web Search
Morphological Analysis Lim Kay Yie Kong Moon Moon Rosaida bt ibrahim Nor hayati bt jamaludin.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
Natural Language Processing Chapter 2 : Morphology.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
LING/C SC/PSYC 438/538 Lecture 25 Sandiway Fong 1.
Morphology II (Lexical categories) Do you know grammar better than a 5 th grader?
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Lexical Phonology Specifically mixes phonology and morphology The word is the unit of analysis Relationship between phonology and morphology is captured.
Morphology 1 : the Morpheme
Introduction to Linguistics Unit Four Morphology, Part One Dr. Judith Yoel.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
MORPHOLOGY COURSE SPRING TERM 2016
Stemming Algorithms 資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪 From
MORPHOLOGICAL PROCESSES
Basic Text Processing: Morphology Word Stemming
Verbs: expresses action, being, or state of being.
Search Engine Optimization (SEO)
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Internet Searching: Finding Quality Information
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
WORD-FORMATION PRODUCTIVITY
Chapter 6 Morphology.
Word Classes and Affixes
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
Morphology.
עיבוד שפות טבעיות מבוא פרופ' עידו דגן המחלקה למדעי המחשב
CSCI 5832 Natural Language Processing
Token generation - stemming
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Chapter 5: Information Retrieval and Web Search
資訊擷取與推薦技術:期中報告 指導教授:黃三益 老師 學生: 黃哲修 張家豪
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
Morphological Parsing
Discussion Class 3 Stemming Algorithms.
Introduction to English morphology
CSCI 5832 Natural Language Processing
Introduction to Linguistics
Basic Text Processing: Morphology Word Stemming
Presentation transcript:

LING/C SC/PSYC 438/538 Lecture 26 Sandiway Fong

Today's Topics Optional Homework 14 graded Reminder: 538 Presentations start next time (see next slide) Last topic of the semester: Morphology and Stemming Chapter 3 of JM Morphology: Introduction Stemming

538 Presentations

Morphology Inflectional Morphology: Derivational Morphology basically: no change in category -features (person, number, gender) Examples: movies, blonde, actress Irregular examples: appendices (from appendix), geese (from goose) case Examples: he/him, who/whom comparatives and superlatives Examples: happier/happiest tense Examples: drive/drives/drove (-ed)/driven Derivational Morphology basically: category changing nominalization Examples: formalization, informant, informer, refusal, lossage deadjectivals Examples: weaken, happiness, simplify, formalize, slowly, calm deverbals Examples: see nominalization, readable, employee denominals Examples: formal, bridge, ski, cowardly, useful

Morphology and Semantics Morphemes: units of meaning suffixation Examples: x employ y employee: picks out y employer: picks out x x read y readable: picks out y prefixation undo, redo, un-redo, encode, defrost, asymmetric, malformed, ill-formed, pro-Chomsky

Stemming Normalization Procedure inflectional morphology: cities  city, improves/improved  improve derivational morphology: transformation/transformational  transform criterion preserve meaning (word senses) organization  organ primary application information retrieval (IR) efficacy questioned: Harman (1991) Google doesn’t use stemming Google didn’t use stemming

Stemming and Search up until recently ... Word Variations (Stemming) To provide the most accurate results, Google does not use "stemming" or support "wildcard" searches. In other words, Google searches for exactly the words that you enter in the search box. Searching for "book" or "book*" will not yield "books" or "bookstore". If in doubt, try both forms: "airline" and "airlines," for instance Google didn’t use stemming

Stemming and Search Google is more successful than other search engines in part because it returns “better”, i.e. more relevant, information its algorithm (a trade secret) is called PageRank SEO (Search Engine Optimization) is a topic of considerable commercial interest Goal: How to get your webpage listed higher by PageRank Techniques: e.g. by writing keyword-rich text in your page e.g. by listing morphological variants of keywords Google does not use stemming everywhere and it does not reveal its algorithm to prevent people “optimizing” their pages

Stemming and Search search on: note diet requirements (results from 2004) note can’t use quotes around phrase blocks stemming 6th-ranked page

Stemming and Search search on: notes diet requirements Top-ranked page has words diet, dietary and requirements but not the phrase “diet requirements” <html> <head> <title>BBC - Health - Healthy living - Dietary requirements</title> <meta name="description" content="The foods that can help to prevent and treat certain conditions"/> <meta name="keywords" content="health, cancer, diabetes, cardiovascular disease, osteoporosis, restricted diets, vegetarian, vegan"/>

Stemming and Search search on: notes: diet requirements 5th-ranked page does not have the word diet in it an academic conference unlikely to be “optimized” for page hits ranks above 6th-ranked page which does have the exact phrase “diet requirements” in it <HTML> <HEAD> <TITLE>PATLIB 2002 - Dietary requirements</TITLE> <LINK REL="stylesheet" HREF="styles.css" TYPE="text/css"> </HEAD>

Stemming and Search The internet is still growing quickly… (2008) (2011) (2013)

Stemming IR-centric view Applies to open-class lexical items only: stop-word list: the, below, being, does exclude: determiners, prepositions, auxiliary verbs not full morphology prefixes generally excluded (not meaning preserving) Examples: asymmetric, undo, encoding

Stemming: Methods use a dictionary (look-up) OK for English, not for languages with more productive morphology, e.g. Japanese, Turkish, Hungarian etc. write ad-hoc rules, e.g. Porter Algorithm (Porter, 1980) Example: Ends in doubled consonant (not l, s or z), remove last character hopping  hop hissing  hiss

Stemming: Methods dictionary approach not enough notes Example: (Porter, 1991) routed  route/rout At Waterloo, Napoleon’s forces were routed The cars were routed off the highway notes here, the (inflected) verb form is ambiguous preceding word (context) does not disambiguate, i.e. bigram not sufficient

Stemming: Errors Understemming: failure to merge Example: adhere/adhesion Overstemming: incorrect merge probe/probable Claim: -able irregular suffix, root: probare (Lat.) Mis-stemming: removing a non-suffix (Porter, 1991) reply  rep Page 68

Stemming: Interaction interacts with noun compounding Example: operating systems negative polarity items for IR, compounds need to be identified first…

Stemming: Porter Algorithm The Porter Stemmer (Porter, 1980) Textbook (page 68, Chapter 3 JM) 3.8 Lexicon- Free FSTs: The Porter Stemmer URL http://www.tartarus.org/~martin/PorterStemmer/ for English most widely used system (claimed) manually written rules 5 stage approach to extracting roots considers suffixes only may produce non-word roots BCPL: extinct programming language

Stemming: Porter Algorithm

Stemming: Porter Algorithm

Stemming: Porter Algorithm Test it on… Stuff it should work on… regular morphology agreed, agrees, agreeing, feed care, cares, cared, caring cancellation, cancelation, canceled, cancelled wares, spyware New words obamacare, cancelization, phablet(s) Understemming agreement (VC)1VCC - step 4 (m>1) adhesion geese Overstemming relativity - step 2 Mis-stemming wander C(VC)1VC

Porter Stemmer: Perl http://tartarus.org/~martin/PorterStemmer/perl.txt

Porter Stemmer: Perl

Porter Stemmer: Perl Just 3 pages of code You should be familiar enough with Perl programming and regular expressions to be able to read and understand the code

Stemming: Porter Algorithm rule format: (condition on stem) suffix1  suffix2 In case of conflict, prefer longest suffix match “Measure” of a word is m in: (C) (VC)m (V) C = sequence of one or more consonants V = sequence of one or more vowels examples: tree C(VC)0V (tr)()(ee) troubles C(VC)2 (tr)(ou)(bl)(e)(s) Question: is this a regular expression?

Stemming: Porter Algorithm Step 1a: remove plural suffixation SSES  SS (caresses) IES  I (ponies) SS  SS (caress) S (cats) Step 1b: remove verbal inflection (m>0) EED  EE (agreed, feed) (*v*) ED (plastered, bled) (*v*) ING (motoring, sing)

Stemming: Porter Algorithm Step 1b: (contd. for -ed and -ing rules) AT  ATE (conflated) BL  BLE (troubled) IZ  IZE (sized) (*doubled c & ¬(*L v *S v *Z))  single c (hopping, hissing, falling, fizzing) (m=1 & *cvc)  E (filing, failing, slowing) Step 1c: Y and I (*v*) Y  I (happy, sky)

Stemming: Porter Algorithm Step 2: Peel one suffix off for multiple suffixes (m>0) ATIONAL  ATE (relational) (m>0) TIONAL  TION (conditional, rational) (m>0) ENCI  ENCE (valenci) (m>0) ANCI  ANCE (hesitanci) (m>0) IZER  IZE (digitizer) (m>0) ABLI  ABLE (conformabli) - able (step 4) … (m>0) IZATION  IZE (vietnamization) (m>0) ATION  ATE (predication) (m>0) IVITI  IVE (sensitiviti)

Stemming: Porter Algorithm Step 3 (m>0) ICATE  IC (triplicate) (m>0) ATIVE (formative) (m>0) ALIZE  AL (formalize) (m>0) ICITI  IC (electriciti) (m>0) ICAL  IC (electrical, chemical) (m>0) FUL  (hopeful) (m>0) NESS  (goodness)

Stemming: Porter Algorithm Step 4: Delete last suffix (m>1) AL  (revival) - revive, see step 5 (m>1) ANCE (allowance, dance) (m>1) ENCE (inference, fence) (m>1) ER (airliner, employer) (m>1) IC (gyroscopic, electric) (m>1) ABLE (adjustable, mov(e)able) (m>1) IBLE (defensible,bible) (m>1) ANT (irritant,ant) (m>1) EMENT (replacement) (m>1) MENT (adjustment) …

Stemming: Porter Algorithm Step 5a: remove e (m>1) E (probate, rate) (m>1 & ¬*cvc) E (cease) Step 5b: ll reduction (m>1 & *LL)  L (controller, roll)

Stemming: Porter Algorithm Misses (understemming) SWI-Prolog/PERL/Python NLTK Unaffected: agreement (VC)1VCC - step 4 (m>1) adhesion adhes vs. adher(e) Irregular morphology: drove, geese Overstemming relativity - step 2 rel Mis-stemming wander C(VC)1VC wander

Stemming: Porter Algorithm Extensions? ad hoc and English-specific errors can be fixed on a case-by-case basis i.e. use a look-up table (dictionary) also there are a finite number of irregularities, e.g. drive/drove, goose/geese