INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Document ingestion.
Advertisements

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
The term vocabulary and postings lists
CS276A Information Retrieval
Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2.
Stemming, tagging and chunking Text analysis short of parsing.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
CS276 Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 1.
Information Retrieval Document Parsing. Basic indexing pipeline Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend.
PrasadL4DictonaryAndQP1 Dictionary and Postings; Query Processing Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning.
Index Construction David Kauchak cs458 Fall 2012 adapted from:
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 2: The term vocabulary and postings lists Related to Chapter 2:
Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan The term vocabulary,
Information Retrieval Lecture 2: The term vocabulary and postings lists.
1 Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.
Information Retrieval and Web Search Lecture 2: Dictionary and Postings.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Term vocabulary and postings lists – preprocessing steps 1.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Document Parsing Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Information Retrieval and Web Search Lecture 2: The term vocabulary and postings lists Minqi Zhou 1.
Text Processing & Characteristics
Information Retrieval in Practice
Information Retrieval
Ch 2 Term Vocabulary & Postings List
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 2: The term vocabulary and postings lists
Modified from Stanford CS276 slides
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Text Processing & Characteristics
Text Processing.
Information Retrieval (IR)
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Document ingestion.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Token generation - stemming
CS276: Information Retrieval and Web Search
CS276: Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recap of the previous lecture
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 2: The term vocabulary and postings lists
Lecture 2: The term vocabulary and postings lists
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 10 Lemmatization Stemming

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Lemmatization Stemming Porter’s algorithm Language-specificity

5. Lemmatization NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word Reduce inflectional/variant forms to base form E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color No change in proper nouns e.g. Pakistan remains same Lemmatization implies doing “proper” reduction to dictionary headword form Example: Lemmatization of “saw”  attempts to return “see” or “saw” depending on whether the use of the token is a verb or a noun 00:01:20  00:02:20 00:03:00  00:03:25 00:04:10  00:05:00 00:07:51  00:09:50 00:10:04  00:10:30 00:11:00  00:12:00

6. Stemming Reduce terms to their “roots” before indexing “Stemming” suggests crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. e.g., computation, computing, computer, all reduce to comput. 00:17:10  00:17:40 00:18:10  00:20:00 00:26:00  00:28:00 for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress

Porter’s algorithm Commonest algorithm for stemming English Results suggest it’s at least as good as other stemming options Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 00:31:00  00:32:32

Typical rules in Porter sses  ss Processes  Process ies  I Skies  Ski; ponies  poni ational  ate Rotational  Rotate tional  tion national  nation S  “” cats  cat Weight of word sensitive rules (m>1) EMENT → (whatever comes before emenet has length greater than 1, replace emenet with null) replacement → replac cement → cement 00:33:00  00:36:30 00:36:53  00:37:55 00:38:20  00:39:05 careses parties separational -> separate factional -> faction

Other stemmers Other stemmers exist: Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm Single-pass, longest suffix removal (about 250 rules) Paice/Husk stemmer Snowball Full morphological analysis (lemmatization) At most modest benefits for retrieval 00:40:06  00:40:40

Text Processing Stemming Example 00:40:55  00:44:40 00:44:55  00:45:15

Language-specificity The above methods embody transformations that are Language-specific, and often Application-specific These are “plug-in” addenda to the indexing process Both open source and commercial plug-ins are available for handling these 00:47:38  00:48:20

Does stemming/lemmatization help? English: very mixed results. Helps recall for some queries but harms precision on others E.g., operative (dentistry) ⇒ oper Operational (research) => oper Operating (systems) => oper Increase recall but reduce precision, such normalization is not very useful in English language. Definitely useful for Spanish, German, Finnish, … 30% performance gains for Finnish! Reason is that there are very clear morphological rules so as to form words in these languages. Domain specific normalization may also be helpful e.g. normalizing the words w.r.t their usage in a particular domain. 00:49:20  00:50:50 00:51:10  00:51:50

Resources MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http//www.sims.berkeley.edu/~hearst/irbook/porter.html H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems. http://www.seg.rmit.edu.au/research/research.php?author=4