Download presentation
Presentation is loading. Please wait.
Published byKelley Jenkins Modified over 9 years ago
1
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine, David Carmel, Adam Darlow IBM Haifa Research Labs SIGIR 2005
2
2 Abstract Novel learning methods are used for estimating the quality of results returned by a search engine in response to a query. Novel learning methods are used for estimating the quality of results returned by a search engine in response to a query. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. Estimation is based on the agreement between the top results of the full query and the top results of its sub-queries. Quality estimation are useful for several applications, including improvement of retrieval, detecting queries for which no relevant content exists in the document collection, and distributed information retrieval. Quality estimation are useful for several applications, including improvement of retrieval, detecting queries for which no relevant content exists in the document collection, and distributed information retrieval.
3
3 Introduction (1/2) Many IR systems suffer from a radical variance in performance. Many IR systems suffer from a radical variance in performance. Estimating query difficulty is an attempt to quantify the quality of results returned by a given system for the query. Estimating query difficulty is an attempt to quantify the quality of results returned by a given system for the query. Reasons for query difficulty estimation Reasons for query difficulty estimation Feedback to the user Feedback to the user The user can rephrase “difficult” queries. The user can rephrase “difficult” queries. Feedback to the search engine Feedback to the search engine To invoke alternative strategies for different queries To invoke alternative strategies for different queries Feedback to the system administrator Feedback to the system administrator To identify queries related to a specific subject, and expand the document collection. To identify queries related to a specific subject, and expand the document collection. For distributed information retrieval For distributed information retrieval
4
4 Introduction (2/2) The observation and motivation: The observation and motivation: queries answered well are those whose query terms agree on most of the returned documents. queries answered well are those whose query terms agree on most of the returned documents. Agreement is measured by the overlap between the top results. Agreement is measured by the overlap between the top results. Difficult queries are those: Difficult queries are those: A. The query terms cannot agree on top results. B. Most of the terms do agree except a few outliers ( 局外人 ). A TREC query for example: A TREC query for example: “What impact has the chunnel ( 水底隧道 ) had on the British economy and/or the life style of the British”
5
5 Related Work (1/2) In the Robust track of TREC 2004, systems are asked to rank the topics by predicted difficulty. In the Robust track of TREC 2004, systems are asked to rank the topics by predicted difficulty. The goal is eventually to use such predictions to do topic-specific processing. The goal is eventually to use such predictions to do topic-specific processing. Prediction methods suggested by the participants: Prediction methods suggested by the participants: Measuring clarity based on the system’s score of the top results Measuring clarity based on the system’s score of the top results Analyzing the ambiguity of the query terms Analyzing the ambiguity of the query terms Learning a predictor using old TREC topics as training data Learning a predictor using old TREC topics as training data (Ounis, 2004) showed that IDF-based predictor is positively related to query precision. (Ounis, 2004) showed that IDF-based predictor is positively related to query precision. (Diaz, 2004) used temporal distribution together with content of the documents to improve the prediction of AP for a query. (Diaz, 2004) used temporal distribution together with content of the documents to improve the prediction of AP for a query.
6
6 Related Work (2/2) The Reliable Information Access (RIA) workshop investigated the reasons for system performance variance across queries. The Reliable Information Access (RIA) workshop investigated the reasons for system performance variance across queries. 10 failure categories were identified. 10 failure categories were identified. 4 of which are due to emphasizing only partial aspects of the query. 4 of which are due to emphasizing only partial aspects of the query. One of the conclusions of this workshop: One of the conclusions of this workshop: “…comparing a full topic ranking against ranking based on only one aspect of the topic will give a measure of the importance of that aspect to the retrieved set”
7
7 Estimating Query Difficulty Query terms are defined as the keywords and the lexical affinities. Query terms are defined as the keywords and the lexical affinities. Features used for learning: Features used for learning: The overlap between each sub-query and the full query The overlap between each sub-query and the full query Measured by κ-statistics Measured by κ-statistics The rounded logarithm of the document frequency, log(DF), of each of the sub-queries. The rounded logarithm of the document frequency, log(DF), of each of the sub-queries. Two challenges for learning: Two challenges for learning: The number of sub-queries is not constant. The number of sub-queries is not constant. A canonic representation is needed. A canonic representation is needed. The sub-queries are not ordered. The sub-queries are not ordered.
8
8 Query Estimator Using a Histogram (1/2) The basic procedure: The basic procedure: 1) Find the top N results for the full query and for each sub-query. Build a histogram of the overlaps h ( i, j ) to form a feature vector. Build a histogram of the overlaps h ( i, j ) to form a feature vector. Values of log(DF) are split into 3 discrete values {0 - 1, 2 - 3, 4 + }. Values of log(DF) are split into 3 discrete values {0 - 1, 2 - 3, 4 + }. h ( i, j ) means log(DF) = i & overlaps = j. h ( i, j ) means log(DF) = i & overlaps = j. The rows of h ( i, j ) are concatenated as a feature vector. The rows of h ( i, j ) are concatenated as a feature vector. 1) Compute the linear weight vector c for prediction. An example, suppose a query has 4 sub-queries: An example, suppose a query has 4 sub-queries: log(DF( n )) = [0 1 1 2], overlap = [2 0 0 1] → h ( i ) = [0 0 1 2 0 0 0 1 0]
9
9 Query Estimator Using a Histogram (2/2) Two additional features Two additional features 1) The score of the top-ranked document 2) The number of words in the query Estimate the linear weight vector c (Moore-Penrose pseudo- inverse): Estimate the linear weight vector c (Moore-Penrose pseudo- inverse): c = (H . H T ) -1 . H . t T H = the matrix with columns are feature vectors of training queries t = a vector of the target measure (P@10 or MAP) of training queries (H and t can be modified according to the objective)
10
10 Query Estimator Using a Modified Decision Tree (1/2) Useful for sparseness, i.e. queries are too short. Useful for sparseness, i.e. queries are too short. A binary decision tree A binary decision tree Pairs of overlap and log(DF) of sub-queries form features. Pairs of overlap and log(DF) of sub-queries form features. Each node consists of a weight vector, threshold, and score. Each node consists of a weight vector, threshold, and score. An example: An example:
11
11 Query Estimator Using a Modified Decision Tree (2/2) The concept of Random Forest The concept of Random Forest Better decision trees can be obtained by training a multitude of trees, each in a slightly different manner or using different data. Better decision trees can be obtained by training a multitude of trees, each in a slightly different manner or using different data. Apply AdaBoost algo. to resample training data Apply AdaBoost algo. to resample training data
12
12 Experiment and Evaluation (1/2) The IR system is Juru. The IR system is Juru. Two document collections Two document collections TREC-8: 528,155 documents, 200 topics TREC-8: 528,155 documents, 200 topics WT10G: 1,692,096 documents, 100 topics WT10G: 1,692,096 documents, 100 topics Four-fold cross-validation, Four-fold cross-validation, Measured by Kendall ’ s-τcoefficient Measured by Kendall ’ s-τcoefficient
13
13 Experiment and Evaluation (2/2) Compared with some other algorithms Compared with some other algorithms Estimation based on the score of the top result Estimation based on the score of the top result Estimation based on the average score of the top ten results Estimation based on the average score of the top ten results Estimation based on the standard deviation of IDF values of query terms Estimation based on the standard deviation of IDF values of query terms Estimation based on learning a SVM for regression Estimation based on learning a SVM for regression
14
14 Application 1: Improving IR Using Query Estimation (1/2) Selective automatic query expansion Selective automatic query expansion 1. Adding terms to the query based on frequently appearing terms in the top retrieved documents 2. Only works for easy queries 3. Using the same features to train a SVM classifier Deciding which part of the topic should be used Deciding which part of the topic should be used 1. TREC topics contain two parts: short title and longer description 2. Some topics that are not answered well by the description part are better answered by the title part. 3. Difficult topics use title part and easy topics use description.
15
15 Application 1: Improving IR Using Query Estimation (2/2)
16
16 Application 2: Detecting Missing Content (1/2) Missing content queries (MCQs) are those have no relevant document in the collection. Missing content queries (MCQs) are those have no relevant document in the collection. Experiment method Experiment method 166 MCQs are created artificially from 400 TREC queries 166 MCQs are created artificially from 400 TREC queries 200 TREC topics consist of title and description. 200 TREC topics consist of title and description. Ten-fold cross-validation Ten-fold cross-validation A tree-based classifier is trained to classify MCQs from non-MCQs. A tree-based classifier is trained to classify MCQs from non-MCQs. A query difficulty estimator may or may not be used as a pre-filter of easy queries before the MCQ classifier. A query difficulty estimator may or may not be used as a pre-filter of easy queries before the MCQ classifier.
17
17 Application 2: Detecting Missing Content (2/2)
18
18 Application 3: Merging the Results of Distributed Retrieval (1/2) It is difficult to rerank the documents from different datasets since the scores are local for each specific dataset. It is difficult to rerank the documents from different datasets since the scores are local for each specific dataset. CORI (W. Croft, 1995) is one of the state-of-the-art algorithm for distributed retrieval, using inference network to do collection ranking. CORI (W. Croft, 1995) is one of the state-of-the-art algorithm for distributed retrieval, using inference network to do collection ranking. Apply the estimator to this problem: Apply the estimator to this problem: A query estimator is trained for each dataset. A query estimator is trained for each dataset. The estimated difficulty is used for weighting the scores. The estimated difficulty is used for weighting the scores. These weighted scores are merged to built the final ranking. These weighted scores are merged to built the final ranking. Ten-fold cross-validation Ten-fold cross-validation Only minimal information is supplied by the search engine. Only minimal information is supplied by the search engine.
19
19 Application 3: Merging the Results of Distributed Retrieval (2/2) Selective weighting Selective weighting All queries are clustered (2-means) based on their estimations for each of the datasets. All queries are clustered (2-means) based on their estimations for each of the datasets. In one cluster, the variance of the estimations is small In one cluster, the variance of the estimations is small → unweighted scores are better for queries in this cluster. The estimations of difficulty become noise when there is little variance. The estimations of difficulty become noise when there is little variance.
20
20 Conclusions and Future Work Two methods for learning an estimator of query difficulty are described. Two methods for learning an estimator of query difficulty are described. The learned estimator predicts the expected precision of the query by analyzing the overlap between the results of the full query and the results of its sub-queries. The learned estimator predicts the expected precision of the query by analyzing the overlap between the results of the full query and the results of its sub-queries. We show that such an estimator can be used to several applications. We show that such an estimator can be used to several applications. Our results show that the quality of query prediction strongly depends on the query length. Our results show that the quality of query prediction strongly depends on the query length. One of the future work is to look for additional features not depend on the query length. One of the future work is to look for additional features not depend on the query length. Whether more training data can be accumulated in automatic or semi-automatic manner is left for future research. Whether more training data can be accumulated in automatic or semi-automatic manner is left for future research.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.