Download presentation
Presentation is loading. Please wait.
Published byDuane McDaniel Modified over 9 years ago
1
Ranked Retrieval INST 734 Module 3 Doug Oard
2
Agenda Ranked retrieval Similarity-based ranking Probability-based ranking
3
What’s a Model? Model –A simplification that describes something complex –A particular way of “looking at things” Computational model –Simplification or reality that facilitates computation
4
Similarity-Based Queries Treat the query as if it were a document –Create a query bag-of-words Find the similarity of each document –Using the coordination measure, for example Rank order the documents by similarity –Most similar to the query first Surprisingly, this works pretty well! –Especially for very short queries
5
Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “document frequency” makes it stronger still
6
A Partial Solution: TF*IDF High TF is evidence of meaning Low DF is evidence of term importance –Equivalently high “IDF” Multiply them to get a “term weight” Add up the weights for each query term
7
TF*IDF Example 4 5 6 3 1 3 1 6 5 3 4 3 7 1 nuclear fallout siberia contaminated interesting complicated information retrieval 2 123 2 3 2 4 4 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 0.301 0.125 0.602 0.301 0.000 0.602 query: contaminated retrieval Result: 2, 3, 1, 4
8
The Document Length Effect Document lengths vary in many collections Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –They use the same terms repeatedly So they have much higher term frequencies Two strategies –Adjust term frequencies for document length –Divide the documents into equal “passages”
9
Passage Retrieval Break long documents up somehow –Con chapter or section boundaries –On topic boundaries (“text tiling”) –Overlapping 300-word passages (“sliding window”) Use best passage’s rank as the document’s rank
10
“Cosine” Normalization Compute the length of each document vector –Multiply each weight by itself –Add all the resulting values –Take the square root of that sum Divide each weight by that length
11
Cosine Normalization Example 0.29 0.37 0.53 0.13 0.62 0.77 0.57 0.14 0.19 0.79 0.05 0.71 123 0.69 0.44 0.57 4 4 5 6 3 1 3 1 6 5 3 4 3 7 1 nuclear fallout siberia contaminated interesting complicated information retrieval 2 123 2 3 2 4 4 0.50 0.63 0.90 0.13 0.60 0.75 1.51 0.38 0.50 2.11 0.13 1.20 123 0.60 0.38 0.50 4 0.301 0.125 0.602 0.301 0.000 0.602 1.700.972.670.87 Length query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)
12
Why Call It “Cosine”? d2 d1
13
Formally … Document Vector Query Vector Inner Product Length Normalization
14
Interpreting the Cosine Measure Think of query and the document as vectors –Query normalization does not change the ranking –Square root does not change the ranking Similarity is the angle between two vectors –Small angle = very similar –Large angle = little similarity Passes some key sanity checks –Depends on pattern of word use but not on length –Every document is most similar to itself
15
“Okapi BM-25” Term Weights TF componentIDF component
16
Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Sum the weights for each query term –In linear algebra, this is an “inner product” operation
17
Agenda Ranked retrieval Similarity-based ranking Probability-based ranking
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.