1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,

Slides:



Advertisements
Similar presentations
Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,
Advertisements

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Basic IR: Modeling Basic IR Task: Slightly more complex:
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
A Machine Learning Approach for Improved BM25 Retrieval
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Improvements to BM25 and Language Models Examined ANDREW TROTMAN, ANTTI PUURULA, BLAKE BURGESS AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014 MELBOURNE,
Cumulative Progress in Language Models for Information Retrieval Antti Puurula 6/12/2013 Australasian Language Technology Workshop University of Waikato.
Hierarchical Dirichlet Trees for Information Retrieval Gholamreza Haffari Simon Fraser University Yee Whye Teh University College London NAACL talk, Boulder,
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Evaluating Search Engine
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Modern Retrieval Evaluations Hongning Wang
SEEKING STATEMENT-SUPPORTING TOP-K WITNESSES Date: 2012/03/12 Source: Steffen Metzger (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh 1.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor : Dr. Koh Jia-Ling Speaker : Chou-Bin Fan Date :
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,
1 Statistical source expansion for question answering CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
More Than Relevance: High Utility Query Recommendation By Mining Users' Search Behaviors Xiaofei Zhu, Jiafeng Guo, Xueqi Cheng, Yanyan Lan Institute of.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Ranking-based Processing of SQL Queries Date: 2012/1/16 Source: Hany Azzam (CIKM’11) Speaker: Er-gang Liu Advisor: Dr. Jia-ling Koh.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
Modern Retrieval Evaluations Hongning Wang
Multi-Aspect Query Summarization by Composite Query Date: 2013/03/11 Author: Wei Song, Qing Yu, Zhiheng Xu, Ting Liu, Sheng Li, Ji-Rong Wen Source: SIGIR.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏.
Using Blog Properties to Improve Retrieval Gilad Mishne (ICWSM 2007)
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Evaluation of IR Systems
An Empirical Study of Learning to Rank for Entity Search
Martin Rajman, Martin Vesely
Wikitology Wikipedia as an Ontology
Cumulated Gain-Based Evaluation of IR Techniques
Learning to Rank with Ties
Connecting the Dots Between News Article
Preference Based Evaluation Measures for Novelty and Diversity
Presentation transcript:

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG

Outline Introduction Revision History Analysis – Global Revision History Analysis – Edit History Burst Detection – Revision History Burst Analysis Incorporating RHA in retrieval models System implementation Experiment Conclusion 2

Introduction Many researches will use modern IR models – Term weighting becomes central part of these models – Frequency-based These models only examine one(final) version of the document to be retrieved, ignoring the actual document generation process. 3

4 IR model document original after many revision document latest Term frequency True term frequency

Introduction New term weighting model – Use the revision history of the document – Redefine term frequency – In order to obtain a better characterization of term’s true importance in a document 5

Revision History Analysis Global revision history analysis – Simplest RHA model – document grows steadily over time – a term is relatively important if it appears in the early revisions. 6

Revision History Analysis 7 Decay factor

Revision History Analysis 8

Burst 9 1 st revision: 500 th revision: Current revision:

Burst 10 Time Term Frequency Document Length “Pandora”“James Cameron” Nov Dec Month (2009)Jul.Aug.Sep.OctNov.Dec. Edit Activity First photo & trailer releasedMovie released Burst of Document (Length) & Change of Term Frequency Burst of Edit Activity & Associated Events Global Model might be insufficient

Edit History Burst Detection 11

Edit History Burst Detection 12 Average revision counts Deviation

Revision History Burst Analysis A burst resets the decay clock for a term. The weight will decrease after a burst. 13 Decay factor for j th Burst B = {b 1,b 2,….b m } : the set of burst indicators for document d b j : the value of b j is the revision index of the end of the j-th burst of document d

Revision History Burst Analysis 14 W : decay matrix i : a potential burst position j : a document revision

Revision History Burst Analysis 15 U = [u 1,u 2 …u n ] : the burst indicator that will be used to filter the decay matrix W to contain only the true bursts

Revision History Burst Analysis d : { a,b,c } tf(a=3 b=2 c=1) V = {v 1,v 2,v 3,v 4 } B = {b 1,b 2,b 3,b 4 } = {1,0,1,0} V 1 = {a,b,c,d} tf(a=50 b=20 c=30 d=10) V 2 = {a,b,c,d}tf(a=52 b=21 c=33 d=10) V 3 = {a,b,c,d} tf(a=70 b=35 c=40 d=20) V 4 = {a,b,c,d}tf(a=73 b=33 c=48 d=21) 16

Incorporating RHA in retrieval models 17 BM25 Statistical Language Models + RHA RHA Term Frequency: RHA Term Probability:

System implementation 18 Revision History Analysis The date of creating/editing. Content change

Evaluate metrics 19 Queries and Labels: – INEX: provided – TREC: subset of ad-hoc track Metrics: – Bpref (robust to missing judgments) – MAP: mean average precision – R-prec: precision at position R – NDCG: normalized discounted cumulative gain

Dataset 20 INEX: well established forum for structured retrieval tasks (based on Wikipedia collection) TREC: performance comparison on different set of queries and general applicability INEX 64 topic Top 1000 retrieved articles 1000 revisions for each article Corpus for INEX TREC 68 topic Top 1000 retrieved articles 1000 revisions for each article Corpus for TREC Wiki Dump Wiki Dump

INEX Results 21 ModelbprefMAPR-precision BM BM25+RHA0.375 (+5.93%)0.360 (+1.69%)0.337 (+7.32%) LM LM+RHA0.372 (+4.20%)0.378 (+2.16%)0.359 (+3.16%) Parameters tuned on INEX query Set

TREC Results 22 ModelbprefMAPNDCG BM BM25+RHA0.547** (+4.39%)0.568 ** (+3.65%)0.656** (+3.47%) LM LM+RHA0.532 (+0.95%)0.567 (+1.98%)0.653 (+1.24%) parameters tuned on INEX query Set, ** indicates statistically significant the 0.01 significance level with two tailed paired t-test

Cross validation on INEX 23 ModelbprefMAPR-precision BM BM25+RHA0.312 (+1.63%)0.291 (+3.56%)0.320 (-1.23%) LM LM+RHA0.338 (+8.68%)0.298 (+4.93%)0.359 (+0.61%) 5-fold cross validation on INEX 2008 query Set ModelbprefMAPR-precision BM BM25+RHA0.363 (+2.54%)0.348 (-1.70%)0.333 (+6.05%) LM LM+RHA0.366 (+2.52%)0.375 (+1.35%)0.352 (+1.15%) 5-fold cross validation on INEX 2009 query Set

Performance Analysis 24

Performance Analysis 25

Conclusion RHA captures importance signal from document authoring process. Introduced RHA term weighting approach Natural integration with state-of-the-art retrieval models. Consistent improvement over baseline retrieval models 26