Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.

Slides:

Advertisements

Similar presentations

Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.

Advertisements

Term Level Search Result Diversification DATE : 2013/09/11 SOURCE : SIGIR’13 AUTHORS : VAN DANG, W. BRUCE CROFT ADVISOR : DR.JIA-LING, KOH SPEAKER : SHUN-CHEN,

1 Opinion Summarization Using Entity Features and Probabilistic Sentence Coherence Optimization (UIUC at TAC 2008 Opinion Summarization Pilot) Nov 19,

DQR : A Probabilistic Approach to Diversified Query recommendation Date: 2013/05/20 Author: Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, Eric Lo Source:

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date ： 2014/04/15 Source ： KDD’13 Authors ： Chi Wang, Marina Danilevsky, Nihit.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

Toward Whole-Session Relevance: Exploring Intrinsic Diversity in Web Search Date: 2014/5/20 Author: Karthik Raman, Paul N. Bennett, Kevyn Collins-Thompson.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.

Evaluating Search Engine

Personalized Search Result Diversification via Structured Learning

IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

IR Models: Review Vector Model and Probabilistic.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 30, (2014) BERLIN CHEN, YI-WEN CHEN, KUAN-YU CHEN, HSIN-MIN WANG2 AND KUEN-TYNG YU Department of Computer.

Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.

Improved search for Socially Annotated Data Authors: Nikos Sarkas, Gautam Das, Nick Koudas Presented by: Amanda Cohen Mostafavi.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.

Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.

A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.

Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. An information-pattern-based approach to novelty detection Presenter : Lin, Shu-Han Authors : Xiaoyan.

Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.

1 Blog site search using resource selection 2008 ACM CIKM Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor ： Dr. Koh Jia-Ling Speaker ： Tu.

Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,

A Word Clustering Approach for Language Model-based Sentence Retrieval in Question Answering Systems Saeedeh Momtazi, Dietrich Klakow University of Saarland,Germany.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

Threshold Setting and Performance Monitoring for Novel Text Mining Wenyin Tang and Flora S. Tsai School of Electrical and Electronic Engineering Nanyang.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.

DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Multi-Aspect Query Summarization by Composite Query Date: 2013/03/11 Author: Wei Song, Qing Yu, Zhiheng Xu, Ting Liu, Sheng Li, Ji-Rong Wen Source: SIGIR.

A Novel Relational Learning-to- Rank Approach for Topic-focused Multi-Document Summarization Yadong Zhu, Yanyan Lan, Jiafeng Guo, Pan Du, Xueqi Cheng Institute.

PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

{ Adaptive Relevance Feedback in Information Retrieval Yuanhua Lv and ChengXiang Zhai (CIKM ‘09) Date: 2010/10/12 Advisor: Dr. Koh, Jia-Ling Speaker: Lin,

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Date : 2013/1/10 Author : Lanbo Zhang, Yi Zhang, Yunfei Chen

INF 141: Information Retrieval

Preference Based Evaluation Measures for Novelty and Diversity

Presentation transcript:

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science University of Delaware Newark, DE ( CIKM ’09 ) Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling

Agenda Introduction - Motivation, Goal Faceted Topic Retrieval - Task, Evaluation Faceted Topic Retrieval Models - 4 kinds of models Experiment & Results Conclusion

Introduction - Motivation Modeling documents as independently relevant does not necessarily provide the optimal user experience.

Traditional evaluation measure would reward System1 since it has higher recall Introduction - Motivation Actually, we prefer System2 (since it has more information) Actually, we prefer System2 (since it has more information) System2 is better !

Introduction Novelty and diversity become the new definition of relevance and evaluation measures. They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic.  we call faceted topic retrieval !

Introduction - Goal The faceted topic retrieval system must be able to find a small set of documents that covers all of the facets 3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets

Faceted Topic Retrieval - Task Define the task in terms of Information need : A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated How that need is best satisfied : Each answer is fully contained within at least one document

Faceted Topic Retrieval - Task Information need invest in next generation technologies increase use of renewable energy sources Invest in renewable energy sources double ethanol in gas supply shift to biodiesel shift to coal Facets (a set of answers)

Faceted Topic Retrieval Our System A Query : A sort list of keywords A Query : A sort list of keywords A ranked list of documents that contain as many unique facets as possible. D1 Dn D2

Faceted Topic Retrieval - Evaluation S-recall S-precision Redundancy

Evaluation – an example for S-recall and S-precision Total : 10 facets (assume all facets in documents are non-overlapped)

Evaluation – an example for Redundancy

Faceted topic retrieval models 4 kinds of models - MMR (Maximal Marginal Relevance) - Probabilistic Interpretation of MMR - Greedy Result Set Pruning - A Probabilistic Set-Based Approach

1. MMR 2. Probabilistic Interpretation of MMR Let c 1 =0, c 3 = c 4

3. Greedy Result Set Pruning First, rank without considering novelty (in order of relevance) Second, step down the list of documents, prune documents with similarity greater than some threshold  I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) >

4. A Probabilistic Set-Based Approach P(F D) :Probability of D contains F the probability that a facet Fj occurs in at least one document in a set D is the probability that all of the facets in a set F are captured by the documents D is

4. A Probabilistic Set-Based Approach 4.1 Hypothesizing Facets 4.2 Estimating Document-Facet Probabilities 4.3 Maximizing Likelihood

4.1 Hypothesizing Facets Two unsupervised probabilistic methods : Relevance modeling Topic modeling with LDA  Instead of extract facets directly from any particular word or phrase, we build a “ facet model ” P(w|F)

4.1 Hypothesizing Facets Since we do not know the facet terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model

Relevance modeling Estimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach: D Fj : the set of documents relevant to facet F j f k : facet terms

Topic modeling with LDA Probabilistic P(w|F j ) and P(F j ) can found through expectation maximization

4.2 Estimating Document-Facet Probabilities Both the facet relevance model and LDA model produce generation probabilistic P(D i |F j ) P(D i |F j ) : the probability that sampling terms from the facet model F j will produce document D i

4.3 Maximizing Likelihood Define the likelihood function Constrain : K : hypothesized minimum number required to cover the facets  Maximizing L(y) is a NP-Hard problem  Approximate solution :  For each facet F j, take the document D i with maximum

Experiment - Data TDT5 Corpus (278,109 docs) TDT5 Corpus (278,109 docs) A Query : A sort list of keywords A Query : A sort list of keywords Top 130 retrieved documents D1 D130 D2 Query Likelihood L.M.

Experiment - Data Top 130 retrieved documents D1 D130 D2 2 assessors to judge 2 assessors to judge 44.7 relevant documents per query Each document contains 4.3 facets 39.2 unique facets on average ( for average one unique facet per relevant document ) Agreement : 72% of all relevant documents were judged relevant by both assesso rs For 60 queries :

Experiment - Data TDT5 sample topic definition Judgments Query

Experiment – Retrieval Engines Using Lemur toolkit LM baseline : a query-likelihood language model RM baseline : a pseudo-feedback with relevance model MMR : query similarity scores from LM baseline and cosine similarity for novelty AvgMix (Prob MMR) : the probabilistic MMR model using query-likelihood scores from LM baseline and the AvgMix novelty score. Pruning : removing documents from the LM baseline on cosine similarity FM : the set-based facet model

Experiment – Retrieval Engines FM : the set-based facet model FM-RM: each of the top m documents and their K nearest neighbors becomes a “facet model ” P(w|F j ), then compute the probability P(D i |F j) FM-LDA: use LDA to discover subtopics z j, and get P(z j|D ), we extract 50 subtopics

Experiments - Evaluation Use five-fold cross-validation to train and test systems 48 queries in four folds to train model parameters Parameters are used to obtain ranked results on the remaining 12 queries At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP

Results

Results

Conclusion We defined a type of novelty retrieval task called faceted topic retrieval  retrieve the facets of information need in a small set of documents. We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models. Both models are competitive with MMR, and outperform another probabilistic model.