INF 141: Information Retrieval Discussion Session Week 8 – Winter 2011 TA: Sara Javanmardi
Evaluation Evaluation is key to building effective and efficient search engines measurement usually carried out in controlled laboratory experiments online testing can also be done Effectiveness, efficiency and cost are related e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration efficiency and cost targets may impact effectiveness
Evaluation Precision Recall NDCG
Effectiveness Measures A is set of relevant documents, B is set of retrieved documents
Problems Users look at only top part of ranked results. Precision at rank p (p =10) Problem: Order of the ranked results is not important.
Discounted Cumulative Gain Popular measure for evaluating web search and related tasks Two assumptions: Highly relevant documents are more useful than marginally relevant document the lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined
Discounted Cumulative Gain Uses graded relevance as a measure of the usefulness, or gain, from examining a document Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks Typical discount is 1/log (rank) With base 2, the discount at rank 4 is 1/2, and at rank 8 it is 1/3
Discounted Cumulative Gain DCG is the total gain accumulated at a particular rank p: Alternative formulation: used by some web search companies emphasis on retrieving highly relevant documents
DCG Example 10 ranked documents judged on 0-3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 DCG: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61
Normalized DCG DCG numbers are averaged across a set of queries at specific rank values e.g., DCG at rank 5 is 6.89 and at rank 10 is 9.61 DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking makes averaging easier for queries with different numbers of relevant documents
NDCG Example Perfect ranking: ideal DCG values: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 ideal DCG values: 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10 NDCG values (divide actual by ideal): 1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88 NDCG £ 1 at any rank position
fair fair Good
Retrieval Model Overview Older models Boolean retrieval Vector Space model Probabilistic Models BM25 Language models Combining evidence Inference networks Learning to Rank
Assignment 5 Read chapter 7 Cosine Similarity: Creating Word Cloud http://www.search-engines-book.com/slides/ Cosine Similarity: http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html Creating Word Cloud http://worditout.com/ or http://www.wordle.net/ Word Proximity http://nlp.stanford.edu/IR-book/html/htmledition/query-term-proximity-1.html
Vector Space Model 3-d pictures useful, but can be misleading for high-dimensional space
Vector Space Model Documents ranked by distance between points representing query and documents Similarity measure more common than a distance or dissimilarity measure e.g. Cosine correlation
Similarity Calculation Consider two documents D1, D2 and a query Q D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
Term Weights tf.idf weight Term frequency weight measures importance in document: Inverse document frequency measures importance in collection: Some heuristic modifications
Language Model Unigram language model N-gram language model probability distribution over the words in a language generation of text consists of pulling words out of a “bucket” according to the probability distribution and replacing them N-gram language model some applications use bigram and trigram language models where probabilities depend on previous words
LMs for Retrieval 3 possibilities: Models of topical relevance probability of generating the query text from a document language model probability of generating the document text from a query language model comparing the language models representing the query and document topics Models of topical relevance
Query-Likelihood Model Rank documents by the probability that the query could be generated by the document model (i.e. same topic) Given query, start with P(D|Q) Using Bayes’ Rule Assuming prior is uniform, unigram model
Estimating Probabilities Obvious estimate for unigram probabilities is Maximum likelihood estimate makes the observed value of fqi;D most likely If query words are missing from document, score will be zero Missing 1 out of 4 query words same as missing 3 out of 4
Smoothing Document texts are a sample from the language model Missing words should not have zero probability of occurring Smoothing is a technique for estimating probabilities for missing (or unseen) words lower (or discount) the probability estimates for words that are seen in the document text assign that “left-over” probability to the estimates for the words that are not seen in the text
Estimating Probabilities Estimate for unseen words is αDP(qi|C) P(qi|C) is the probability for query word i in the collection language model for collection C (background probability) αD is a parameter Estimate for words that occur is (1 − αD) P(qi|D) + αD P(qi|C) Different forms of estimation come from different αD
Jelinek-Mercer Smoothing αD is a constant, λ Gives estimate of Ranking score Use logs for convenience accuracy problems multiplying small numbers
Where is tf.idf Weight? - proportional to the term frequency, inversely proportional to the collection frequency
Dirichlet Smoothing αD depends on document length Gives probability estimation of and document score
Query Likelihood Example For the term “president” fqi,D = 15, cqi = 160,000 For the term “lincoln” fqi,D = 25, cqi = 2,400 number of word occurrences in the document |d| is assumed to be 1,800 number of word occurrences in the collection is 109 500,000 documents times an average of 2,000 words μ = 2,000
Query Likelihood Example Negative number because summing logs of small numbers
Query Likelihood Example