Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science University of Delaware Newark, DE ( CIKM ’09 ) Date: 2010/05/03 Speaker: Lin, Yi-Jhen Advisor: Dr. Koh, Jia-Ling

Agenda Introduction - Motivation, Goal Faceted Topic Retrieval - Task, Evaluation Faceted Topic Retrieval Models - 4 kinds of models Experiment & Results Conclusion

Introduction - Motivation Modeling documents as independently relevant does not necessarily provide the optimal user experience.

Traditional evaluation measure would reward System1 since it has higher recall Introduction - Motivation Actually, we prefer System2 (since it has more information) Actually, we prefer System2 (since it has more information) System2 is better !

Introduction Novelty and diversity become the new definition of relevance and evaluation measures. They can be achieved through retrieving documents that are relevant to query, but cover different facets of the topic.  we call faceted topic retrieval !

Introduction - Goal The faceted topic retrieval system must be able to find a small set of documents that covers all of the facets 3 documents that cover 10 facets is preferable to 5 documents that cover 10 facets

Faceted Topic Retrieval - Task Define the task in terms of Information need : A faceted topic retrieval information need is one that has a set of answers – facets – that are clearly delineated How that need is best satisfied : Each answer is fully contained within at least one document

Faceted Topic Retrieval - Task Information need invest in next generation technologies increase use of renewable energy sources Invest in renewable energy sources double ethanol in gas supply shift to biodiesel shift to coal Facets (a set of answers)

Faceted Topic Retrieval Our System A Query : A sort list of keywords A Query : A sort list of keywords A ranked list of documents that contain as many unique facets as possible. D1 Dn D2

Faceted Topic Retrieval - Evaluation S-recall S-precision Redundancy

Evaluation – an example for S-recall and S-precision Total : 10 facets (assume all facets in documents are non-overlapped)

Evaluation – an example for Redundancy

Faceted topic retrieval models 4 kinds of models - MMR (Maximal Marginal Relevance) - Probabilistic Interpretation of MMR - Greedy Result Set Pruning - A Probabilistic Set-Based Approach

1. MMR 2. Probabilistic Interpretation of MMR Let c 1 =0, c 3 = c 4

3. Greedy Result Set Pruning First, rank without considering novelty (in order of relevance) Second, step down the list of documents, prune documents with similarity greater than some threshold  I.e., at rank i, remove any document Dj, j > i, with sim(Dj,Di) >

4. A Probabilistic Set-Based Approach P(F D) :Probability of D contains F the probability that a facet Fj occurs in at least one document in a set D is the probability that all of the facets in a set F are captured by the documents D is

4. A Probabilistic Set-Based Approach 4.1 Hypothesizing Facets 4.2 Estimating Document-Facet Probabilities 4.3 Maximizing Likelihood

4.1 Hypothesizing Facets Two unsupervised probabilistic methods : Relevance modeling Topic modeling with LDA  Instead of extract facets directly from any particular word or phrase, we build a “ facet model ” P(w|F)

4.1 Hypothesizing Facets Since we do not know the facet terms or the set of documents relevant to the facet, we will estimate them from the retrieved documents Obtain m models from the top m retrieved documents by taking each document along with its k nearest neighbors as the basis for a facet model

Relevance modeling Estimate m ”facet models“ P(w|Fj) from a set of retrieved documents using the so-called RM2 approach: D Fj : the set of documents relevant to facet F j f k : facet terms

Topic modeling with LDA Probabilistic P(w|F j ) and P(F j ) can found through expectation maximization

4.2 Estimating Document-Facet Probabilities Both the facet relevance model and LDA model produce generation probabilistic P(D i |F j ) P(D i |F j ) : the probability that sampling terms from the facet model F j will produce document D i

4.3 Maximizing Likelihood Define the likelihood function Constrain : K : hypothesized minimum number required to cover the facets  Maximizing L(y) is a NP-Hard problem  Approximate solution :  For each facet F j, take the document D i with maximum

Experiment - Data TDT5 Corpus (278,109 docs) TDT5 Corpus (278,109 docs) A Query : A sort list of keywords A Query : A sort list of keywords Top 130 retrieved documents D1 D130 D2 Query Likelihood L.M.

Experiment - Data Top 130 retrieved documents D1 D130 D2 2 assessors to judge 2 assessors to judge 44.7 relevant documents per query Each document contains 4.3 facets 39.2 unique facets on average ( for average one unique facet per relevant document ) Agreement : 72% of all relevant documents were judged relevant by both assesso rs For 60 queries :

Experiment - Data TDT5 sample topic definition Judgments Query

Experiment – Retrieval Engines Using Lemur toolkit LM baseline : a query-likelihood language model RM baseline : a pseudo-feedback with relevance model MMR : query similarity scores from LM baseline and cosine similarity for novelty AvgMix (Prob MMR) : the probabilistic MMR model using query-likelihood scores from LM baseline and the AvgMix novelty score. Pruning : removing documents from the LM baseline on cosine similarity FM : the set-based facet model

Experiment – Retrieval Engines FM : the set-based facet model FM-RM: each of the top m documents and their K nearest neighbors becomes a “facet model ” P(w|F j ), then compute the probability P(D i |F j) FM-LDA: use LDA to discover subtopics z j, and get P(z j|D ), we extract 50 subtopics

Experiments - Evaluation Use five-fold cross-validation to train and test systems 48 queries in four folds to train model parameters Parameters are used to obtain ranked results on the remaining 12 queries At the minimum optimal rank S-rec, we report S-recall, redundancy, MAP

Results

Conclusion We defined a type of novelty retrieval task called faceted topic retrieval  retrieve the facets of information need in a small set of documents. We presented two novel models: One that prunes a retrieval ranking and one a formally-motivated probabilistic models. Both models are competitive with MMR, and outperform another probabilistic model.

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.

Similar presentations

Presentation on theme: "Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.

Similar presentations

Presentation on theme: "Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science."— Presentation transcript:

Similar presentations

About project

Feedback