Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,

Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE, AUSTRALIA PRESENTED BY ANTTI PUURULA

Introduction TREC evaluations of the 90es established the current ranking functions for ad- hoc document retrieval Mid 90s introduced BM25 [23], the most successful ranking function to date Armstrong et al. [1, 2] in 2009 showed no evidence of improvements in a decade, but multiple recent publications claim improvements

Introduction Has there been any improvement in ranking function precision? We examine this question testing several recent BM25 and LM ranking functions We test each function, add relevance feedback, stemming, and stopping Mean Average Precision (MAP) compared on INEX Wikipedia 2010 and TREC Ad- hoc 1-8, with functions optimized on Wikipedia 2009

Ranking functions: ATIRE BM25 Trotman et al. (2012) [27] : ATIRE version of BM25

Ranking functions: ATIRE BM25 Retrieval Status Value for query q Trotman et al. (2012) [27] : ATIRE version of BM25

Ranking functions: ATIRE BM25 Retrieval Status Value for query q Robertson-Walker IDF N = #documents df t = #documents term t occurs Trotman et al. (2012) [27] : ATIRE version of BM25

Ranking functions: ATIRE BM25 Retrieval Status Value for query q Robertson-Walker IDF N = #documents df t = #documents term t occurs BM25 term frequency normalization tf td = count of term t in document d Ld = length (L1-norm) of document d L avg = average length of documents Trotman et al. (2012) [27] : ATIRE version of BM25

Ranking functions: BM25L Lv & Zhai (2011) [12] : BM25 corrected for very long documents

Ranking functions: BM25L Lv & Zhai (2011) [12] : BM25 corrected for very long documents = BM25 with smoothed parameter estimates (with 1.0, 0.5, and δ added) Smoothed Robertson- Walker IDF Length-corrected BM25 term frequency normalization

Ranking functions: BM25+ Lv & Zhai (2011) [11]: BM25 with lower-bounded term weights

Ranking functions: BM25+ Lv & Zhai (2011) [11]: BM25 with lower-bounded term weights Smoothed Robertson-Walker IDF Lower-bounding parameter

Ranking functions: BM25-adpt Lv & Zhai (2011) [10]: BM25 with term-dependent k 1, using Information Gain G q r

Ranking functions: BM25-adpt Lv & Zhai (2011) [10]: BM25 with term-dependent k 1, using Information Gain G q r Smoothed Robertson-Walker IDFTerm-dependent component

Ranking functions: BM25-adpt Lv & Zhai (2011) [10]: BM25 with term-dependent k 1, using Information Gain G q r k’ 1 solved offline for each term from the index, using a curve-fitting technique and the least square method

Ranking functions: BM25T Lv & Zhai (2012) [13]: BM25 with term-dependent k 1, using log-logistic method k’ 1 solved offline for each term from the index, using Newton-Raphson method

Ranking functions: TF l ° δ ° p xIDF Rousseau & Vazirgiannis (2013) [25]: Composite non-linear TF normalizations

Ranking functions: TF l ° δ ° p xIDF Rousseau & Vazirgiannis (2013) [25]: Composite non-linear TF normalizations Smoothed Robertson-Walker IDF BM25 soft length normalization Log-concavity normalization Lower-bounding parameter

Ranking functions: LM-DS Zhai & Lafferty (2001): Unigram Language Model with Dirichlet Prior Smoothing

Ranking functions: LM-DS Zhai & Lafferty (2001): Unigram Language Model with Dirichlet Prior Smoothing Smoothing component Matched term component

Ranking functions: LM-PYP Momtazi & Klakow (2010): Unigram LM with Pitman-Yor Process smoothing

Ranking functions: LM-PYP Momtazi & Klakow (2010): Unigram LM with Pitman-Yor Process smoothing Power-law discounting

Ranking functions: LM-PYP-TFIDF Puurula (2012): LM-PYP with TFIDF feature weighting

Ranking functions: LM-PYP-TFIDF Puurula (2012): LM-PYP with TFIDF feature weighting TF-IDF feature weighting

ATIRE KL-divergence feedback Rank terms in top k retrieved documents R i using KL-divergence Expand query with the top n ranked terms using Rocchio feedback:

ATIRE KL-divergence feedback Rank terms in top k retrieved documents R i using KL-divergence Expand query with the top n ranked terms using Rocchio feedback: Top-k document model Collection model Feedback query vector Original query vector

Truncated model-based feedback Reweight original query terms using the top-k documents, using posterior probabilities of documents as mixture weights Interpolate with original query weights

Truncated model-based feedback Reweight original query terms using the top-k documents, using posterior probabilities of documents as mixture weights Interpolate with original query weights Original query vector Feedback query vector

Parameter optimization Parameters for each ranking function optimized on INEX Wikipedia 2009 ◦Parameters constrained on reasonable ranges ◦Particle Swarm Optimization with 64 particles and 20 generations ◦50 generations used for models with feedback (with up to 8 parameters) Functions tested on INEX Wikipedia 2010 and TREC 1-8 datasets ◦INEX Wikipedia 2010: same documents as INEX 2009, different queries ◦TREC 1-8: different documents, different queries

First observations Same Documents, Different Queries (INEX 2010): ◦Differences between ranking functions very small Different Documents, Different Queries (TREC 1-8): ◦BM25-adpt slightly better than others on 5 out of 9 collections ◦Most likely due to the collection-adaptive k 1 -parameters ◦LMs generally worse than BM25 variants ◦But ATIRE LM implementations not extensively optimized, unlike BM25

More observations Feedback is very effective for both BM25 and LM ◦ATIRE KL-feedback fails on LMs, truncated model-based feedback works Stopping harms BM25 strongly, stemming can help ◦Porter-stemming seems to harm ◦S-stemmer and Krovetz help

Final observations: feedback+stemming Feedback+stemming improves BM25 and LM+DP ◦No ranking function clearly better than rest ◦Stemming is effective ◦Again ATIRE KL-feedback fails on LMs, truncated model-based feedback works Paired 1-tailed t-tests of best-performing functions: ◦Feedback is better than no feedback (p=0.0267) ◦Stemming with feedback is better than just feedback (p=0.0292) ◦Stemming with feedback is better than neither (p<0.0001)

Conclusions Differences between the suggested BM25 ranking functions become very small, when parameters are optimal for a different but similar dataset ◦LM power-law discounting particularly brittle, BM25 parameters more stable Feedback works for both BM25 and LM, but different feedback functions needed Stopping harms BM25, stemming can help Results were exploratory, but in this scenario BM25 seems to outperform LM ◦Implementation differences can reduce ranking function performance ◦Optimization becomes increasingly difficult with many parameters

Rewriting BM25 (BM25L example)

Robertson & Sparck-Jones 1976

Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,

Similar presentations

Presentation on theme: "Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,

Similar presentations

Presentation on theme: "Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,"— Presentation transcript:

Similar presentations

About project

Feedback