Hinrich Schütze and Christina Lioma

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1 A B C
Simplifications of Context-Free Grammars
Variations of the Turing Machine
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
AP STUDY SESSION 2.
1
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Processes and Operating Systems
Copyright © 2013 Elsevier Inc. All rights reserved.
David Burdett May 11, 2004 Package Binding for WS CDL.
Prepared by: Workforce Enterprise Services For: The Illinois Department of Commerce and Economic Opportunity Bureau of Workforce Development ENTRY OF EMPLOYER.
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
Create an Application Title 1Y - Youth Chapter 5.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
CALENDAR.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
The 5S numbers game..
A Fractional Order (Proportional and Derivative) Motion Controller Design for A Class of Second-order Systems Center for Self-Organizing Intelligent.
Welcome. © 2008 ADP, Inc. 2 Overview A Look at the Web Site Question and Answer Session Agenda.
Break Time Remaining 10:00.
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
EE, NCKU Tien-Hao Chang (Darby Chang)
Turing Machines.
PP Test Review Sections 6-1 to 6-6
Chapter 10: Applications of Arrays and the class vector
1 IMDS Tutorial Integrated Microarray Database System.
Data structure is concerned with the various ways that data files can be organized and assembled. The structures of data files will strongly influence.
Briana B. Morrison Adapted from William Collins
Outline Minimum Spanning Tree Maximal Flow Algorithm LP formulation 1.
Operating Systems Operating Systems - Winter 2010 Chapter 3 – Input/Output Vrije Universiteit Amsterdam.
Computer vision: models, learning and inference
Text Categorization.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Traditional IR models Jian-Yun Nie.
Adding Up In Chunks.
MaK_Full ahead loaded 1 Alarm Page Directory (F11)
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
1 Termination and shape-shifting heaps Byron Cook Microsoft Research, Cambridge Joint work with Josh Berdine, Dino Distefano, and.
Artificial Intelligence
When you see… Find the zeros You think….
Before Between After.
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
1 Let’s Recapitulate. 2 Regular Languages DFAs NFAs Regular Expressions Regular Grammars.
Types of selection structures
1 Titre de la diapositive SDMO Industries – Training Département MICS KERYS 09- MICS KERYS – WEBSITE.
CSE3201/4500 Information Retrieval Systems
Converting a Fraction to %
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Clock will move after 1 minute
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Import Tracking and Landed Cost Processing An Enhancement For AS/400 DMAS from  Copyright I/O International, 2001, 2005, 2008, 2012 Skip Intro Version.
Copyright Tim Morris/St Stephen's School
1.step PMIT start + initial project data input Concept Concept.
9. Two Functions of Two Random Variables
1 Dr. Scott Schaefer Least Squares Curves, Rational Representations, Splines and Continuity.
1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.
Chapter 5 The Mathematics of Diversification
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
CpSc 881: Information Retrieval. 2 Why is ranking so important? Problems with unranked retrieval Users want to look at a few results – not thousands.
Introduction to Information Retrieval Scores in a Complete Search System CSE 538 MRS BOOK – CHAPTER VII 1.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 538 MRS BOOK – CHAPTER VII
Presentation transcript:

Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search System

Overview Recap Why rank? More on cosine Implementation of ranking The complete search system

Outline Recap Why rank? More on cosine Implementation of ranking The complete search system

Term frequency weight The log frequency weight of term t in d is defined as follows 4

idf weight The document frequency dft is defined as the number of documents that t occurs in. We define the idf weight of term t as follows: idf is a measure of the informativeness of the term. 5

tf-idf weight The tf-idf weight of a term is the product of its tf weight and its idf weight. 6

Cosine similarity between query and document qi is the tf-idf weight of term i in the query. di is the tf-idf weight of term i in the document. and are the lengths of and and are length-1 vectors (= normalized). 7

Cosine similarity illustrated 8

tf-idf example: lnc.ltn Query: “best car insurance”. Document: “car insurance auto insurance”. term frequency, df: document frequency, idf: inverse document frequency, weight:the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight 1/1.92 0.52 1.3/1.92 0.68 Final similarity score between query and document: i wqi · wdi = 0 + 0 + 1.04 + 2.04 = 3.08 9

Take-away today The importance of ranking: User studies at Google Length normalization: Pivot normalization Implementation of ranking The complete search system 10

Outline Recap Why rank? More on cosine Implementation of ranking The complete search system

Why is ranking so important? Last lecture: Problems with unranked retrieval Users want to look at a few results – not thousands. It’s very hard to write queries that produce a few results. Even for expert searchers → Ranking is important because it effectively reduces a large set of results to a very small one. Next: More data on “users only look at a few results” Actually, in the vast majority of cases they only examine 1, 2, or 3 results. 12

Empirical investigation of the effect of ranking How can we measure how important ranking is? Observe what searchers do when they are searching in a controlled setting Videotape them Ask them to “think aloud” Interview them Eye-track them Time them Record and count their clicks The following slides are from Dan Russell’s JCDL talk Dan Russell is the “Über Tech Lead for Search Quality & User Happiness” at Google. 13

14

15

16

17

18

19

Importance of ranking: Summary Viewing abstracts: Users are a lot more likely to read the abstracts of the top-ranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10). Clicking: Distribution is even more skewed for clicking In 1 out of 2 cases, users click on the top-ranked page. Even if the top-ranked page is not relevant, 30% of users will click on it. → Getting the ranking right is very important. → Getting the top-ranked page right is most important. 20

Outline Recap Why rank? More on cosine Implementation of ranking The complete search system

Why distance is a bad idea The Euclidean distance of and is large although the distribution of terms in the query q and the distribution of terms in the document d2 are very similar. That’s why we do length normalization or, equivalently, use cosine to compute query-document matching scores. 22

Exercise: A problem for cosine normalization Query q: “anti-doping rules Beijing 2008 olympics” Compare three documents d1: a short document on anti-doping rules at 2008 Olympics d2: a long document that consists of a copy of d1 and 5 other news stories, all on topics different from Olympics/anti- doping d3: a short document on anti-doping rules at the 2004 Athens Olympics What ranking do we expect in the vector space model? What can we do about this? 23

Pivot normalization Cosine normalization produces weights that are too large for short documents and too small for long documents (on average). Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot Effect: Similarities of short documents with query decrease; similarities of long documents with query increase. This removes the unfair advantage that short documents have. 24

Predicted and true probability of relevance source: Lillian Lee 25

Pivot normalization source: Lillian Lee 26

Pivoted normalization: Amit Singhal’s experiments (relevant documents retrieved and (change in) average precision) 27

Outline Recap Why rank? More on cosine Implementation of ranking The complete search system

Now we also need term frequncies in the index term frequencies We also need positions. Not shown here 29

Term frequencies in the inverted index In each posting, store tft,d in addition to docID d As an integer frequency, not as a (log-)weighted real number . . . . . . because real numbers are difficult to compress. Unary code is effective for encoding term frequencies. Why? Overall, additional space requirements are small: less than a byte per posting with bitwise compression. Or a byte per posting with variable byte code 30

Exercise: How do we compute the top k in ranking? In many applications, we don’t need a complete ranking. We just need the top k for a small k (e.g., k = 100). If we don’t need a complete ranking, is there an efficient way of computing just the top k? Naive: Compute scores for all N documents Sort Return the top k What’s bad about this? Alternative? 31

Use min heap for selecting top k ouf of N Use a binary min heap A binary min heap is a binary tree in which each node’s value is less than the values of its children. Takes O(N log k) operations to construct (where N is the number of documents) . . . . . . then read off k winners in O(k log k) steps 32

Binary min heap 33

Selecting top k scoring documents in O(N log k) Goal: Keep the top k documents seen so far Use a binary min heap To process a new document d′ with score s′: Get current minimum hm of heap (O(1)) If s′ ˂ hm skip to next document If s′ > hm heap-delete-root (O(log k)) Heap-add d′/s′ (O(log k)) 34

Priority queue example 35

Even more efficient computation of top k? Ranking has time complexity O(N) where N is the number of documents. Optimizations reduce the constant factor, but they are still O(N), N > 1010 Are there sublinear algorithms? What we’re doing in effect: solving the k-nearest neighbor (kNN) problem for the query vector (= query point). There are no general solutions to this problem that are sublinear. We will revisit this issue when we do kNN classification in IIR 14. 36

More efficient computation of top k: Heuristics Idea 1: Reorder postings lists Instead of ordering according to docID . . . . . . order according to some measure of “expected relevance”. Idea 2: Heuristics to prune the search space Not guaranteed to be correct . . . . . . but fails rarely. In practice, close to constant time. For this, we’ll need the concepts of document-at-a-time processing and term-at-a-time processing. 37

Non-docID ordering of postings lists So far: postings lists have been ordered according to docID. Alternative: a query-independent measure of “goodness” of a page Example: PageRank g(d) of page d, a measure of how many “good” pages hyperlink to d (chapter 21) Order documents in postings lists according to PageRank: g(d1) > g(d2) > g(d3) > . . . Define composite score of a document: net-score(q, d) = g(d) + cos(q, d) This scheme supports early termination: We do not have to process postings lists in their entirety to find top k. 38

Non-docID ordering of postings lists (2) Order documents in postings lists according to PageRank: g(d1) > g(d2) > g(d3) > . . . Define composite score of a document: net-score(q, d) = g(d) + cos(q, d) Suppose: (i) g → [0, 1]; (ii) g(d) < 0.1 for the document d we’re currently processing; (iii) smallest top k score we’ve found so far is 1.2 Then all subsequent scores will be < 1.1. So we’ve already found the top k and can stop processing the remainder of postings lists. Questions? 39

Document-at-a-time processing Both docID-ordering and PageRank-ordering impose a consistent ordering on documents in postings lists. Computing cosines in this scheme is document-at-a-time. We complete computation of the query-document similarity score of document di before starting to compute the query- document similarity score of di+1. Alternative: term-at-a-time processing 40

Weight-sorted postings lists Idea: don’t process postings that contribute little to final score Order documents in postings list according to weight Simplest case: normalized tf-idf weight (rarely done: hard to compress) Documents in the top k are likely to occur early in these ordered lists. → Early termination while processing postings lists is unlikely to change the top k. But: We no longer have a consistent ordering of documents in postings lists. We no longer can employ document-at-a-time processing. 41

Term-at-a-time processing Simplest case: completely process the postings list of the first query term Create an accumulator for each docID you encounter Then completely process the postings list of the second query term . . . and so forth 42

Term-at-a-time processing 43

Computing cosine scores For the web (20 billion documents), an array of accumulators A in memory is infeasible. Thus: Only create accumulators for docs occurring in postings lists This is equivalent to: Do not create accumulators for docs with zero scores (i.e., docs that do not contain any of the query terms) 44

Accumulators: Example For query: [Brutus Caesar]: Only need accumulators for 1, 5, 7, 13, 17, 83, 87 Don’t need accumulators for 8, 40, 85 45

Removing bottlenecks Use heap / priority queue as discussed earlier Can further limit to docs with non-zero cosines on rare (high idf) words Or enforce conjunctive search (a la Google): non-zero cosines on all words in query Example: just one accumulator for [Brutus Caesar] in the example above . . . . . . because only d1 contains both words. 46

Outline Recap Why rank? More on cosine Implementation of ranking The complete search system

Complete search system 48

Tiered indexes Basic idea: Example: two-tier system Create several tiers of indexes, corresponding to importance of indexing terms During query processing, start with highest-tier index If highest-tier index returns at least k (e.g., k = 100) results: stop and return results to user If we’ve only found < k hits: repeat for next index in tier cascade Example: two-tier system Tier 1: Index of all titles Tier 2: Index of the rest of documents Pages containing the search words in the title are better hits than pages containing the search words in the body of the text. 49

Tiered index 50

Tiered indexes The use of tiered indexes is believed to be one of the reasons that Google search quality was significantly higher initially (2000/01) than that of competitors. (along with PageRank, use of anchor text and proximity constraints) 51

Exercise Design criteria for tiered system Each tier should be an order of magnitude smaller than the next tier. The top 100 hits for most queries should be in tier 1, the top 100 hits for most of the remaining queries in tier 2 etc. We need a simple test for “can I stop at this tier or do I have to go to the next one?” There is no advantage to tiering if we have to hit most tiers for most queries anyway. Question 1: Consider a two-tier system where the first tier indexes titles and the second tier everything. What are potential problems with this type of tiering? Question 2: Can you think of a better way of setting up a multitier system? Which “zones” of a document should be indexed in the different tiers (title, body of document, others?)? What criterion do you want to use for including a document in tier 1? 52

Complete search system 53

Components we have introduced thus far Document preprocessing (linguistic and otherwise) Positional indexes Tiered indexes Spelling correction k-gram indexes for wildcard queries and spelling correction Query processing Document scoring Term-at-a-time processing 54

Components we haven’t covered yet Document cache: we need this for generating snippets (=dynamic summaries) Zone indexes: They separate the indexes for different zones: the body of the document, all highlighted text in the document, anchor text, text in metadata fields etc Machine-learned ranking functions Proximity ranking (e.g., rank documents in which the query terms occur in the same local window higher than documents in which the query terms occur far from each other) Query parser 55

Vector space retrieval: Interactions How do we combine phrase retrieval with vector space retrieval? We do not want to compute document frequency / idf for every possible phrase. Why? How do we combine Boolean retrieval with vector space retrieval? For example: “+”-constraints and “-”-constraints Postfiltering is simple, but can be very inefficient – no easy answer. How do we combine wild cards with vector space retrieval? Again, no easy answer 56

Take-away today The importance of ranking: User studies at Google Length normalization: Pivot normalization Implementation of ranking The complete search system 57

Resources Chapters 6 and 7 of IIR Resources at http://ifnlp.org/ir How Google tweaks its ranking function Interview with Google search guru Udi Manber Yahoo Search BOSS: Opens up the search engine to developers. For example, you can rerank search results. Compare Google and Yahoo ranking for a query How Google uses eye tracking for improving search 58