Download presentation
Presentation is loading. Please wait.
Published byDan Carl-Johan Åberg Modified over 5 years ago
1
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture # 10 Lemmatization Stemming
2
ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, Alistair Moffat, Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, “Web Information Retrieval” by Stefano Ceri, Alessandro Bozzon, Marco Brambilla
3
Outline Lemmatization Stemming Porter’s algorithm Language-specificity
4
5. Lemmatization NPL tool. It uses dictionaries and morphological analysis of words in order to return the base or dictionary form of a word Reduce inflectional/variant forms to base form E.g., am, are, is be car, cars, car's, cars' car the boy's cars are different colors the boy car be different color No change in proper nouns e.g. Pakistan remains same Lemmatization implies doing “proper” reduction to dictionary headword form Example: Lemmatization of “saw” attempts to return “see” or “saw” depending on whether the use of the token is a verb or a noun 00:01:20 00:02:20 00:03:00 00:03:25 00:04:10 00:05:00 00:07:51 00:09:50 00:10:04 00:10:30 00:11:00 00:12:00
5
6. Stemming Reduce terms to their “roots” before indexing
“Stemming” suggests crude affix chopping language dependent e.g., automate(s), automatic, automation all reduced to automat. e.g., computation, computing, computer, all reduce to comput. 00:17:10 00:17:40 00:18:10 00:20:00 00:26:00 00:28:00 for example compressed and compression are both accepted as equivalent to compress. for exampl compress and compress ar both accept as equival to compress
6
Porter’s algorithm Commonest algorithm for stemming English
Results suggest it’s at least as good as other stemming options Conventions + 5 phases of reductions phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 00:31:00 00:32:32
7
Typical rules in Porter
sses ss Processes Process ies I Skies Ski; ponies poni ational ate Rotational Rotate tional tion national nation S “” cats cat Weight of word sensitive rules (m>1) EMENT → (whatever comes before emenet has length greater than 1, replace emenet with null) replacement → replac cement → cement 00:33:00 00:36:30 00:36:53 00:37:55 00:38:20 00:39:05 careses parties separational -> separate factional -> faction
8
Other stemmers Other stemmers exist:
Lovins stemmer Single-pass, longest suffix removal (about 250 rules) Paice/Husk stemmer Snowball Full morphological analysis (lemmatization) At most modest benefits for retrieval 00:40:06 00:40:40
9
Text Processing Stemming Example
00:40:55 00:44:40 00:44:55 00:45:15
10
Language-specificity
The above methods embody transformations that are Language-specific, and often Application-specific These are “plug-in” addenda to the indexing process Both open source and commercial plug-ins are available for handling these 00:47:38 00:48:20
11
Does stemming/lemmatization help?
English: very mixed results. Helps recall for some queries but harms precision on others E.g., operative (dentistry) ⇒ oper Operational (research) => oper Operating (systems) => oper Increase recall but reduce precision, such normalization is not very useful in English language. Definitely useful for Spanish, German, Finnish, … 30% performance gains for Finnish! Reason is that there are very clear morphological rules so as to form words in these languages. Domain specific normalization may also be helpful e.g. normalizing the words w.r.t their usage in a particular domain. 00:49:20 00:50:50 00:51:10 00:51:50
12
Resources MG 3.6, 4.3; MIR 7.2 Porter’s stemmer: http// H.E. Williams, J. Zobel, and D. Bahle, “Fast Phrase Querying with Combined Indexes”, ACM Transactions on Information Systems.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.