Download presentation
Presentation is loading. Please wait.
1
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search System 1
2
Introduction to Information Retrieval Efficient cosine ranking Find the K docs in the collection “nearest” to the query K largest query-doc cosines. Efficient ranking: – Computing a single cosine efficiently. – Choosing the K largest cosine values efficiently. Sec. 7.1
3
Introduction to Information Retrieval 3 Why is ranking so important? Users want to look at a few results – not thousands. It’s very hard to write queries that produce a few results. More data on “users only look at a few results” 3
4
Introduction to Information Retrieval Computing cosine scores Sec. 6.3.3
5
Introduction to Information Retrieval Special case – unweighted queries No weighting on query terms – Assume each query term occurs only once Then for ranking, don’t need to normalize query vector – Slight simplification of algorithm from Lecture 6 Sec. 7.1
6
Introduction to Information Retrieval Faster cosine: unweighted query Sec. 7.1
7
Introduction to Information Retrieval Computing the K largest cosines: selection vs. sorting Typically we want to retrieve the top K docs (in the cosine ranking for the query) Can we pick off docs with K highest cosines? Let J = number of docs with nonzero cosines – We seek the K best of these J Sec. 7.1
8
Introduction to Information Retrieval Use heap for selecting top K Binary tree in which each node’s value > the values of children(max heap) Takes 2J operations to construct, then each of K “winners” read off in log J steps. Sec. 7.1
9
Introduction to Information Retrieval Generic approach Find a set A of contenders, with K < |A| << N – A does not necessarily contain the top K, but has many docs from among the top K – Return the top K docs in A The same approach is also used for other (non- cosine) scoring functions Sec. 7.1.1
10
Introduction to Information Retrieval Index elimination Considers docs containing at least one query term Take this further: – Only consider high-idf query terms – Only consider docs containing many query terms Sec. 7.1.2
11
Introduction to Information Retrieval High-idf query terms only For a query such as catcher in the rye Only accumulate scores from catcher and rye Benefit: – Postings of low-idf terms have many docs these (many) docs get eliminated from set A of contenders Sec. 7.1.2
12
Introduction to Information Retrieval Docs containing many query terms Any doc with at least one query term is a candidate for the top K output list For multi-term queries, only compute scores for docs containing several of the query terms – Say, at least 3 out of 4 Easy to implement in postings traversal Sec. 7.1.2
13
Introduction to Information Retrieval 3 of 4 query terms Brutus Caesar Calpurnia 12358132134 248163264128 1316 Antony 348163264128 32 Scores only computed for docs 8, 16 and 32. Sec. 7.1.2
14
Introduction to Information Retrieval Champion lists Precompute for each dictionary term t, the r docs of highest weight in t’s postings – Call this the champion list for t – (also call fancy list or top docs for t) Note that r has to be chosen at index build time – Thus, it’s possible that r < K At query time, only compute scores for docs in the champion list of some query term Sec. 7.1.3
15
Introduction to Information Retrieval Static quality scores In many search engines, we have available a measure of quality for each document that is query-independent and thus static. This quality measure may be viewed as a number between zero and one. For instance, in the context of news stories on the web, Assign to each document a query-independent quality score in [0,1] to each document d g(d) may be derived from the number of favorable reviews of the story by web surfers. Other indexing provides further discussion on this topic, as does Chapter 21 in the context of web search. Sec. 7.1.4
16
Introduction to Information Retrieval Net score The net score for a document is some combination of g(d) together with cosine score( query- dependent score). In this simple form, the static quality and the query- dependent score from 6.10 have equal contributions, assuming each is between 0 and 1. Sec. 7.1.4
17
Introduction to Information Retrieval Champion lists in g(d)-ordering Can combine champion lists with g(d)-ordering Maintain for each term a champion list of the r docs with highest g(d) + tf-idf td Seek top-K results from only the docs in these champion lists Sec. 7.1.4
18
Introduction to Information Retrieval High and low lists For each term, we maintain two postings lists called high and low Think of high as the champion list When traversing postings on a query, only traverse high lists first If we get more than K docs, select the top K and stop Else proceed to get docs from the low lists Can be used even for simple cosine scores, without quality g(d) A means for segmenting index into two tiers Sec. 7.1.4
19
Introduction to Information Retrieval Impact-ordered postings We only want to compute scores for docs for which tf t,d is high enough We sort each postings list by decreasing tf t,d Now: not all postings in a common order! How do we compute scores in order to pick off top K? Sec. 7.1.5
20
Introduction to Information Retrieval 1. Early termination When traversing term’s postings, stop early after either – a fixed number of r docs – tf t,d drops below some threshold Take the union of the resulting sets of docs – One from the postings of each query term Compute only the scores for docs in this union Sec. 7.1.5
21
Introduction to Information Retrieval 2. idf-ordered terms When considering the postings of query terms Look at them in order of decreasing idf – High idf terms likely to contribute most to score As we get to query terms with lower idf,we can determine whether to proceed based on the changes in document scores from processing the previous query term. If these changes are minimal, we may omit the accumulation from the remaining query terms. Sec. 7.1.5
22
Introduction to Information Retrieval Cluster pruning: preprocessing In cluster pruning we have a preprocessing step during which we cluster the document vectors. Pick N docs at random: call these leaders For every other doc, pre-compute nearest leader – Docs attached to a leader: its followers; – Likely: each leader has ~ N followers. Sec. 7.1.6
23
Introduction to Information Retrieval Visualization Query LeaderFollower Sec. 7.1.6
24
Introduction to Information Retrieval Cluster pruning: query processing Process a query as follows: – Given query Q, find its nearest leader L. – Seek K nearest docs from among L’s followers. Sec. 7.1.6
25
Introduction to Information Retrieval Tiered indexes Break postings up into a hierarchy of lists In this example Figure 6.9 we set a tf threshold of 20 for tier 1 and 10 for tier 2,meaning that tier 1 index only has postings entries with tf values exceeding 20,while the tier 2 index only has postings entries with tf values exceeding 10,and tier 3 has remaining postings entries. Sec. 7.2.1 Doc1Doc2Doc3 Car27424 Auto3330 Insurance03329 Best14017
26
Introduction to Information Retrieval Example tiered index Sec. 7.2.1
27
Introduction to Information Retrieval 27 Tiered indexes Basic idea: Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If we’ve only found < k hits: repeat for next index in tier cascade 27
28
Introduction to Information Retrieval Query term proximity Free text queries: just a set of terms typed into the query box – common on the web Users prefer docs in which query terms occur within close proximity of each other. Let w be the smallest window in a doc containing all query terms, e.g.,For the query strained mercy the smallest window in the doc The quality of mercy is not strained is 4 (words) In cases where the document does not contain all of the query terms, we can set w to be some enormous number. We could also consider variants in which only words that are not stop words are considered in computing w. Sec. 7.2.2
29
Introduction to Information Retrieval Query parsers Free text query from user may in fact spawn one or more queries to the indexes, e.g. query rising interest rates 1.Run the query as a phrase query 2.If results < K docs contain the phrase rising interest rates, run the two phrase queries rising interest and interest rates 3.If we still have fewer than k results, run the vector space query(6.3.2) consisting of the three individual query terms. Sec. 7.2.3
30
Introduction to Information Retrieval Aggregate scores Each of these steps yield a list of scored documents, for each of which we compute a score. We’ve seen that score functions can combine cosine, static quality, proximity, etc. How do we know the best combination? – Some applications – expert-tuned – Increasingly common: machine-learned – See 15 th lecture Sec. 7.2.3
31
Introduction to Information Retrieval Putting it all together Sec. 7.2.4
32
Introduction to Information Retrieval 32 Components we have introduced thus far parsing linguistic (tokenization and stemming) Metadata in Zone and field indexes Tiered inverted positional indexes Inexact top K retrieval k-gram indexes for wildcard queries and spelling correction Spelling-correction Free text query parser Scoring and ranking 32
33
Introduction to Information Retrieval 33 Components we haven’t covered yet Document cache: we need this for generating snippets (=dynamic summaries) Machine-learned ranking : a technique that builds on Section 6.1.2 (to be further developed in Section 15.4.1) for scoring and ranking documents. 33
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.