Modeling Score Distributions for Combining the Outputs of Search Engines Reading Notes Wang Ning (wangning@db.pku.edu.cn) Lab of Database and Information Systems Dec 3rd, 2003
Revision History Nov. 30th, 2003: Draft Dec. 1st, 2003: Add all pictures Dec. 2nd, 2003: Add references
Literature Information Title Modeling Score Distributions for Combining the Outputs of Search Engines Author R. Manmatha(manmatha@cs.umass.edu) T. Rath(trath@cs.umass.edu) F. Feng(feng@cs.umass.edu) Institution Center for Intelligent Information Retrieval University of Massachusetts Conference Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Basic Idea Meta Search: Difficulties Previous Work The Authors’ Idea Combining results from search engines Difficulties No architecture and algorithm information No score information Previous Work Linear combination of document ranks COMMIN, COMMAX, COMSUM, COMMNZ The Authors’ Idea Model the score distributions
Test Data TREC: Text REtrieval Conference Search Engines TREC 3, TREC 4 TREC 6 for Chinese Documents Search Engines INQUERY (Probabilistic Model) CITY (Probabilistic Model) SMART (Vector Space Model) Bellcore (LSI Engine)
Model Assumptions The sets of non-relevant documents can be modeled with exponential distribution The sets of relevant documents can be modeled with Gaussian distribution Explanations and argumentations comes later
Non-relevant Documents: Exponential Distribution
Relevant Documents: Gaussian Distribution
Likelihood Function
MLE: Maximum Likelihood Estimate
Basic Idea of MLE God always let the event with the biggest probability happen firstly -- The MLE of Θ is to make the sample occur the most likely.
Limitations of Gaussian Fit Well: sufficient relevant documents (>=60) Bad: fewer relevant documents (usually) Why? Model Fault Lack of samples (the authors’ point) Solutions Maybe Bayesian analysis works here
Mixture Model Fit
Mixture Model Fit (cont.)
EM: Expectation Maximization Important parameter estimation method
EM Steps
Mixture Model Fit: INQUERY
Mixture Model Fit: SMART
Posterior Probabilities
Posterior Probabilities: SMART
Limitations of Posterior Probabilities
Problem I: Mixture Model Model Selection: Exponential and Gaussian? Fit the data well Can be recovered with EM algorithm EM Algorithm: Limitations and Solutions Local maxima Solutions: Arbitrary initial condition Fit the exponential distribution first, and remove those documents that do not fit well to fit the Gaussian
Problem II: Shapes of Distributions
Shapes of Poisson's
Applications Combining Outputs of Search Engines Using posterior probabilities Automatic Engine Selection Distinction: larger distance between mean and intersect point of two distributions Relevance: higher maximum of posterior probabilities
Comparative Study: Combining
Comparative Study: Selecting
What Can I Learn from this Paper? Scientific Methodology Clear and simple models Theoretical reasoning & experimental support Natural and simple mathematical methods Standard test data and comparative study
Alternative Method Bayes Optimal Metasearch: A Probabilistic Model for Combining the Results of Multiple Retrieval Systems J. A. Aslam & M. Montague Dartmouth College SIGIR’01
Probabilistic Model
Comparisons manmatha01modeling aslam01Bayes Pros Cons Clear and simple models Cons Strong model assumptions Some inherent limitations of EM algorithm aslam01Bayes Training prior probabilities Naive Bayes independent assumptions
My Thoughts Training of prior probabilities to obtain more accurate outputs models The small sample space limits the use of traditional statistics. Maybe we can use Bayes analysis to avoid it.
References R. Manmatha and T. Rath and Fangfang Feng. Modeling Score Distributions for Combining the Outputs of Search Engines. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, 2001 267-275. J. A. Aslam and M. Montague. Bayes optimal metasearch: A probabilistic model for combining the results of multiple retrieval systems. In the Proc. of the 23rd ACM SIGIR conf. on Research and Developement in Information Retrieval, pages 379--381, 2000. Jiangsheng, Yu. Expectation Maximization: An Approach to Parameter Estimation. Lecture of Machine Learning Seminar, 2003