A Novel TF-IDF Weighting Scheme for Effective Ranking Jiaul H. Paik Indian Statistical Institute, Kolkata, India SIGIR’2013 Presenter:

Slides:

Advertisements

Similar presentations

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.

Advertisements

Chapter 5: Introduction to Information Retrieval

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, and ChengXiang Zhai University of Illinois at Urbana Champaign SIGIR 2004 (Best paper.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

A Machine Learning Approach for Improved BM25 Retrieval

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

IR Models: Overview, Boolean, and Vector

Evaluating Search Engine

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.

Ch 4: Information Retrieval and Text Mining

Evaluating Hypotheses

Statistical Background

Vector Space Model CS 652 Information Extraction and Integration.

The Vector Space Model …and applications in Information Retrieval.

IR Models: Review Vector Model and Probabilistic.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

Chapter 5: Information Retrieval and Web Search

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.

Copyright © 2012 by Nelson Education Limited. Chapter 7 Hypothesis Testing I: The One-Sample Case 7-1.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Chapter 6: Information Retrieval and Web Search

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

Chapter Outline 2.1 Estimation Confidence Interval Estimates for Population Mean Confidence Interval Estimates for the Difference Between Two Population.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.

Gravitation-Based Model for Information Retrieval Shuming Shi, Ji-Rong Wen, Qing Yu, Ruihua Song, Wei-Ying Ma Microsoft Research Asia SIGIR 2005.

Lower-Bounding Term Frequency Normalization Yuanhua Lv and ChengXiang Zhai University of Illinois at Urbana-Champaign CIKM 2011 Best Student Award Paper.

1 A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao and ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Language Model in Turkish IR Melih Kandemir F. Melih Özbekoğlu Can Şardan Ömer S. Uğurlu.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:

Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

SIGIR 2005 Relevance Information: A Loss of Entropy but a Gain for IDF? Arjen P. de Vries Thomas Roelleke,

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.

Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.

A Formal Study of Information Retrieval Heuristics Hui Fang, Tao Tao, ChengXiang Zhai University of Illinois at Urbana Champaign Urbana SIGIR 2004 Presented.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Sampath Jayarathna Cal Poly Pomona

A Formal Study of Information Retrieval Heuristics

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Compact Query Term Selection Using Topically Related Text

INF 141: Information Retrieval

Presentation transcript:

A Novel TF-IDF Weighting Scheme for Effective Ranking Jiaul H. Paik Indian Statistical Institute, Kolkata, India SIGIR’2013 Presenter: SHIH,KAI-WUN 1

Outline 1. Introduction 2. State Of The Art 3. Proposed Work 4. Experimental Setup 5. Results 6. Conclusion 2

Introduction(1/4) Almost all retrieval models integrate three major variables to determine the degree of importance of a term for a document: (i) within document term frequency, (ii) document length and (iii) the specificity of the term in the collection. Retrieval models can be broadly classified into two major families based on their term weight estimation principle. 1.Vector space model 2.Probabilistic model 3

Most of the existing models (possibly all) employ a single term frequency normalization mechanism that does not take into account various aspects of a term’s saliency in a document. Another major limitation of the present models is that they do not balance well in preferring short and long documents. Introduction(2/4) 4

Introduction(3/4) This article proposes a term weighting scheme that can address these weaknesses in an effective way.  It introduces a two aspect term frequency normalization scheme, that combines relative tf weighting and the tf normalization based on document length.  It uses the query length information to emphasize the appropriate component.  It modifies the usual term discrimination measure (idf) by integrating mean term frequency of a term in the set of documents the term is contained in.  Finally, we use asymptotically bounded function to transform the tf factors that better handles the term coverage issues in the documents and also helps to combine the two tf factors more easily. 5

The experimental results show that the proposed weighting function consistently and significantly outperforms five state of the art retrieval models measured in terms of the standard metrics such as MAP and NDCG. The experimental outcomes also attest that the model achieves significantly better precision than all the other models when measured in terms of a recently popularized metric, namely, expected reciprocal rank (ERR). Introduction(4/4) 6

State Of The Art(1/2) Three widely used models in IR are the vector space model the probabilistic models and the inference network based model. In vector space model, queries and documents are represented as the vector of terms. To compute a score between a document and a query the model measures the similarity between the query and document vector using cosine function. Three main factors that come into play to compute the weight of a term are: (i) frequency of the term in the document, (ii) document frequency of the term in the collection and (iii) the length of the document that contains the term. 7

State Of The Art(2/2) The key part of the probabilistic models is to estimate the probability of relevance of the documents for a query. Probabilistic language modeling technique is another effective ranking model that is widely used today. However, smoothing is crucial and it has very similar effect as the parameter that controls the length normalization components in pivoted normalization. Relatively recent, another probabilistic model is proposed in [3] that computes the weight of a term by measuring the informative content of a term by computing the amount of divergence of the term frequency distribution from the distribution based on a random process. 8

Proposed Work - Preliminaries(1/3) Before we give the main motivation behind our work, let us first revisit the three key hypotheses. 1. Term Frequency Hypothesis (TFH): The weight of term in a document should increase with the increase in term frequency (TF). However, it seems unlikely that the importance of a term grows linearly with the increase in TF. Therefore, researchers have used dampened TF instead of the raw TF for ranking. 9

Advanced TF Hypothesis (AD-TFH): The modified term frequency hypothesis captures the intuition that the rate of change of weight of a term should decrease with the larger TF. Formally, we hypothesize that, a function Ft(TF), that maps the original TF to the resultant value (which will be used for final weighting), should possess the following two properties. (a) F’t (TF) > 0(b) F’’t (TF) < 0 Proposed Work - Preliminaries(2/3) 10

2. Document Length Hypothesis (DLH): This hypothesiscaptures the relationship between the term frequency and the document length. Long documents tend to use a term repeatedly, thus term frequency can be higher in a long document. Thus, it is necessary to regulate the TF value in accordance with the document length. 3. Term Discrimination Hypothesis (TDH): If a query contains more than one term, then a good weighting scheme should prefer a document that contains the rare term (in the collection). Proposed Work - Preliminaries(3/3) 11

we consider the following two aspects: 1.Relative Intra-document TF (RITF) : In this factor, the importance of a term is measured by considering its frequency relative to the average TF of the document. Thus, a natural formulation for this could be where TF(t,D) and Avg.TF(t,D) denote the frequency of the term t in D and average term frequency of D respectively. Proposed Work - Two Aspects of TF(1/4) 12

However, Equation 2 may too much prefer excessively long documents, since the denominator is close to 1 for a very long document. Hence, we use the following function. Proposed Work - Two Aspects of TF(2/4) 13

2. Relative Intra-document TF (RITF) : We assume that the appropriate length of a document should be the average document length of the collection and the frequency of the terms of an average length document should remain unchanged. Thus, a reasonable starting point could be, where ADL(C) is the average document length of the collection and len(D) is the length of the document D. Proposed Work - Two Aspects of TF(3/4) 14

But once again, it seems unlikely that the increase in term frequency follows a linear relationship with the document length, and thus the above formula over-penalizes the long documents. To overcome this bias, we employ the following function to achieve the length dependent normalization. Equation 4 still punishes the long documents but with diminishing effect. Proposed Work - Two Aspects of TF(4/4) 15

In order to motivate the use of two TF factors, let us consider the following two somewhat toy examples. Example 1 Let D1 and D2 be two documents of equal lengths, with the following statistics. 1.len(D1) =20, # distinct term of D1 = 5, TF(t,D1)=4 2.len(D2) =20, # distinct term of D2 = 15, TF(t,D2)=4 In both of the cases, LRTF considers t equally important. A little thought will convince us that this is not appropriate, since the focus of the document D1 seems to be divided equally among 5 terms and therefore t should not be considered salient, while t seems to be important for D2. Thus, in the later case RITF seems to be a better choice to LRTF. Proposed Work - Motivation(1/3) 16

Let us now turn to the other direction and consider the second example. Example 2 Let D1 and D2 be two documents with the following statistics. 1.len(D1) =20, # distinct term of D1 = 15, TF(t,D1)=4 2.len(D2) = 200, # distinct term of D2 = 150, TF(t,D2)=4 For this instance however, RITF considers the term t equally important for both D1 and D2, which is not right, since D2 contains more distinct terms and thus seems to cover many other topics (also possibly uses t repeatedly). Therefore, in this case, the use of LRTF seems to be a potential choice over RITF. Proposed Work - Motivation(2/3) 17

Motivated by the above examples, our main goal now is to integrate the above two factors into our weighting scheme. However, we do not use the TF factors as defined in the Equations 3 and 4. We transform these TF values for our final use that in some sense makes use of the hypothesis ADTFH. Proposed Work - Motivation(3/3) 18

We transform the TF factors using a function f(x) that possesses the following properties: (i) vanishes at 0, (ii) satisfies the two conditions of AD-TFH hypothesis (f’(x) > 0 and f’’(x) < 0), and (iii) asymptotically upper bounded to 1. One of the simplest functions that satisfies the above properties is. Using this function, we now transform the two TF factors as follows: Proposed Work - Transforming TF Factors 19

Now the key question that we face: how should we combine BRITF(t,D) and BLRTF(t,D)? A natural way to do this is as follows: where 0 < w < 1. The next important issues that arise out of Equation 7 are the following: 1.Should we prefer BRITF(t,D) (w > 0.5)? 2.Should we prefer BLRTF(t,D) (w < 0.5)? Proposed Work - Combining Two TF Factors(1/5) 20

In order to answer these questions, we now analyze the properties of the two TF components. From Equation 5, it is clear that BRITF(t,D) has a tendency to prefer long documents, since for long documents the denominator part of RITF(t,D) is close to 1, and TF is usually larger. On the other hand, BLRTF(t,D) tends to prefer short documents, since LRTF(t,D) → 0 as len(D) → ∞. Therefore, when a query is long, BRITF(t,D) heavily prefers extremely long documents, since the number of matches is more or less proportional to the length of the document. Proposed Work - Combining Two TF Factors(2/5) 21

On the contrary, since BLRTF(t,D) prefers short documents it can penalize extremely long documents when it faces longer queries, and thus it is preferable when longer queries are encountered. Another interesting property of BRITF(t,D) is that it emphasizes on the number of matches, since the main component of this formula RITF(t,D) heavily punishes the term frequency, and thus important for the short queries. Hence, the foregoing discussion suggests that, for short queries BRITF(t,D) should be preferred, while for longer queries, BLRTF(t,D) should be given more weight. Proposed Work - Combining Two TF Factors(3/5) 22

The value of w should decrease with the increase in query length, while it must lie between [0-1]. Specifically, we characterize the query length factor (QLF(Q)) by the following variables. (i) QLF(Q) = 1 for |Q| = 1, (ii) QLF(Q) < 0 and (iii) 0 < QLF(Q) < 1. Proposed Work - Combining Two TF Factors(4/5) 23

Numerous different functions can be constructed that satisfy the above conditions. We used the following three different functions. The first function descends more rapidly than the second function, while the second function descends more rapidly than the third function. Proposed Work - Combining Two TF Factors(5/5) 24

Inverse document frequency (IDF) is a well known measure that serves the above purpose. We use the following standard idf measure. We hypothesize that the term discrimination is a combination of the above two factors. In particular, we hypothesize that if two terms have equal document frequencies, then the term discrimination should increase with the increase in average elite set term frequency. The average elite set term frequency (AEF) is defined as, where CTF(t, C) denotes the total occurrence of the term t in the entire collection. Proposed Work - Term Discrimination Factor(1/2) 25

However, the combination of raw AEF with IDF may disturb the overall term discrimination value, since the IDF values are obtained by dampening through log function. Hence, we employ a slowly increasing function to transform the AEF values for this combination. Once again, we use the function f(x) = x/(1+x) to transform the AEF values for the final use. The final term discrimination value of term t is computed as Proposed Work - Term Discrimination Factor(2/2) 26

Integrating the above factors we now obtain the following final scoring formula. Again, since TFF(qi,D) < 1, we obtain the following relationship. Therefore, we can easily modify Equation 13 to get the normalized similarity scores (0 < Sim(Q,D) < 1) as follows: Proposed Work - Final Formula 27

Experimental Setup - Data Table 1 summarizes the statistics on test collections used in our experiments. The experiments are conducted on a large number of standard test collections, that vary both by type, the size of the document collections and the number of queries. 28

Experimental Setup - Evaluation Measures and IR System(1/2) All our experiments are carried out using TERRIER1 retrieval system (version 3.5). We use title field of the topics (note that two million query data contain more than 1000 queries that contain more than 5 terms). From all the collections we removed stopwords during indexing. Documents and queries are stemmed using Porter stemmer. 29

We use the following metrics to evaluate the systems. 1.Mean Average Precision (MAP) 2.Normalized DCG at k 3.Expected Reciprocal Rank at k is computed as follows: where R(g) =, hg is the highest grade and g 1, g 2... g k are the relevance grades associated with the top k documents. The first two metrics are used to reflect the overall performance of the systems, while the last evaluation measure reflects better the precision of search results. Experimental Setup - Evaluation Measures and IR System(2/2) 30

The details of the baselines are given below. 1.Pivoted length normalized TF-IDF model: This model is one of the best performing TF-IDF formula in the vector space model framework. The value of the parameter s is set to Lemur TF-IDF model: This model is another TF-IDF model that uses Robertson’s tf and the standard idf. The parameter of this model is set to its default value, Experimental Setup – Baselines(1/3) 31

3. Classical Probabilistic model (BM25): The main differences with this model and the previous model are that BM25 uses query term frequency in a different way and the idf also differs with the standard one. The parameters of this model is set to k1 = 1.2, b = 0.6 and k3 = Dirichlet smooth language model (LM): Language model is another probabilistic model that performs very effectively. For this model we set the value of Dirichlet smoothing ( μ ) to Experimental Setup – Baselines(2/3) 32

5. Divergence from Randomness model (PL2): Finally, PL2 represents the recently proposed non-parametric probabilistic model from divergence from randomness(DFR) family. Similar to the previous models, its performance also depends on a parameter value (c in the formula). We conduct experiments for this model by setting c = 13. Experimental Setup – Baselines(3/3) 33

In this section we focus on to compare the performance of the proposed model (MATF) with the Lemur TF-IDF and and Pivoted TF-IDF models. Results - Comparison with TF-IDF Models(1/2) 34

The performances of MATF and the two TF-IDF models on two million query data, measured by statistical average precision, are shown in Table 3. Results - Comparison with TF-IDF Models(2/2) 35

In the last section we compare the performance of our model with two TF-IDF models. In this section we compare the performance of MATF with three well known state of the art probabilistic retrieval models. Results - Comparison with Probabilistic Models(1/2) 36

Table 5 compares the performance of four models for million query collections measured in terms of statistical average precision. Results - Comparison with Probabilistic Models(2/2) 37

Results - Analysis Table 6 presents the experimental results on two million query data. The results seem to confirm our aforesaid assumption. LRTF always performs better than RITF on both of the collections, while RITF does better for short queries. 38

Conclusion In this paper, we present a novel TF-IDF term weighting scheme. Experiments carried out on a set of news and web collections show that the proposed model outperforms two well known state of the art TF-IDF baselines with significantly large margin, when measured in terms of MAP and NDCG. Moreover, the proposed model is also significantly better than all of the five baselines in improving precision. 39