Download presentation
Presentation is loading. Please wait.
1
IR Data Structures Making Matching Queries and Documents Effective and Efficient
2
Lecture Objectives l Learn an algorithm to stem without a dictionary l Know principles of other stemming systems l Understand other data structures which facilitate rapid access from keywords to documents
3
Stemming l Reducing morphological variants of words to a standard underlying form –e.g. calculate, calculates, calculations to calculat- l improves recall at the expense of precision
4
Porter Stemming Algorithm l Well known, effective stemmer, which does not use a dictionary l uses measure m –C(VC) m V –where »C is a sequence of consonants »V is a sequence of vowels
5
Porter Algorithm Step 1 -sses-ss-ing--at-ate-y-i Stem only vowels
6
Porter Algorithm Step 2-4 -aliti-al-icate-ic-able- Measure >0 Measure >1
7
Dictionary Based Stemmers l Dictionary of stems –cf vector based methods l Dictionary of words –effective handling of irregular forms l Proper Name/Controlled Vocabulary Lists l Equivalent Term/Thesaurii
8
Problems with stemming l Always worsens precision hoping to improve recall l Causes (sometimes odd misretrieval) –“bled” vs “bleeding” –incorrect term conflation “plastered” to “plaster” l Do we really want to improve recall on the web ?
9
N-Gram structures l Store keywords broken down into fixed length segments –e.g. trigrams “sea colony” to »sea + col + olo + lon + ony l useful as an index structure, stemming and for spelling correction –“compuuter”
10
Index Data Structures l Inverted Files l PAT Data Structure –tree based substrings l Signature Files l Hypertext Data Structure
11
Inverted Files Alice 1 5 2 887 42 51182 1, 5, 51182
12
Inverted Files Supporting Proximity Alice 1, 5, 51182 while Alice was sitting curled up in a corner of the great arm- chair, half talking to herself and half asleep, thekitten had been having a grand game of romps with the ball of worsted Alice had 167, 201,...
13
Hypertext Data Structure l Nodes and Links l File types imply a program to interpret (Display/play) the data l Tags in HTML imply how to load referenced data: –protocol –server –location at server
14
URL Example http:// www.cet. sunderland.ac.uk/ ~cs0jel/teaching/com268/Lglass.asc protocol server location
15
The Web
16
Conclusions l Stemmers –Porters Algorithm –Dictionary Based –disadvantages l Inverted Files l Hypertext N-grams - other Data Structures
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.