Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business.

Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business Technology Seoul National University Seoul, Korea Presented by Sung Eun, Park 1/25/2011 Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, Samuel Ieong Microsoft Research

Copyright  2010 by CEBT Contents  Introduction Intuition Preliminaries  Model Problem Formulation Complexity Greedy algorithm  Evaluation Measure Empirical analysis 2

Copyright  2010 by CEBT Introduction  Ambiguity and diversification For the ambiguous queries, diversification may help users to find at least one relevant document Ex) the other day, we were trying to find the meaning of the word “ 왕 건 ”. – In the context of “ 우와 저거 진짜 왕건이다 ” – But search result was all about the king of Goguryu 3 King 왕건 왕건 as a Big thing

Copyright  2010 by CEBT Preliminaries  4

Copyright  2010 by CEBT Problem Formulation  d fails to satisfy user that issues query q with the intended category c Multiple intents The probability that some document will satisfy category c

Copyright  2010 by CEBT Complexity 

Copyright  2010 by CEBT A Greedy Algorithm  R(q) be the top k documents selected by some classical ranking algorithm for the target query The algorithm reorder the R(q) to maximize the objective P(S|q) Input: k, q, C, D, P(c | q), V (d | q, c), Output : set of documents S 0.4 0.9 0.5 0.4 DV(d | q, c) 0.08 0.72 0.40 0.32 0.08 g(d | q, c) U(R | q) = U(B | q) = 0.8 0.2 × 0.8 × 0.2 × 0.08 × 0.2 0.08 0.04 0.03 0.08 0.12 × 0.08 × 0.12 0.05 0.4 0.9 0.4 0.07 S Produces an ordered set of results Results not proportional to intent distribution Results not according to (raw) quality

Copyright  2010 by CEBT Greedy Algorithm (IA-SELECT) Input: k, q, C, D, P(c | q), V (d | q, c) Output : set of documents S  When documents may belong to multiple categories, IA-SELECT is no longer guaranteed to be optimal.(Notice this problem is NP- hard) S = ∅ ∀c ∈ C, U(c | q) ← P(c | q) while |S| < k do for d ∈ D do g(d | q, c) ←  c U(c | q)V (d | q, c) end for d ∗ ← argmax g(d | q, c) S ← S ∪ {d ∗ } ∀c ∈ C, U(c | q) ← (1 − V (d ∗ | q, c))U(c | q) D ← D \ {d ∗ } end while Marginal Utility U(c | q): conditional prob of intent c given query q g(d | q, c): current prob of d satisfying q, c

Copyright  2010 by CEBT Classical IR Measures(1)  CG,DCG,NDCG Cumulative Gain – 3+3+2+0+1=9 – Ranking order is not important Discounted Cumulative Gain – 3+3/log2+3/log3+0/log4+1/log5+2/log6 Normalized Discounted Cumulative Gain  Devided by Ideal Discounted Cumulative Gain  In this case, (3,3,2,2,1,0) = 3+(3/log2 + 2/log3 + 2/log4 + 1/log5) 1. Doc 1, rel=3 2. Doc 2, rel=3 3. Doc 3, rel=2 4. Doc 4, rel=0 5. Doc 5, rel=1 6. Doc 6, rel=2 1. Doc 1, rel=3 2. Doc 2, rel=3 3. Doc 3, rel=2 4. Doc 4, rel=0 5. Doc 5, rel=1 6. Doc 6, rel=2 Result Doc Set

Copyright  2010 by CEBT Classical IR Measures(2)  RR,MRR Navigational Search/ Question Answering – A need for a few high-ranked result Reciprocal Ranking – How far is an answer document from rank 1? Example) ½=0.5 Mean Reciprocal Ranking – Mean of RR of the query test set 1. Doc N 2. Doc P 3. Doc N 4. Doc N 5. Doc N 1. Doc N 2. Doc P 3. Doc N 4. Doc N 5. Doc N Result Doc Set

Copyright  2010 by CEBT Classical IR Measures(3)  MAP Average Precision – ( 1.00 + 1.00 + 0.75 + 0.67 + 0.38 ) / 6 = 0.633 Mean Average Precision – Average of the average precision value for a set of queries – MAP = ( AP1 + AP2 +... + APn ) / (# of Queries)

Copyright  2010 by CEBT Evaluation Measure 

Copyright  2010 by CEBT Empirical Evaluation  10,000 queries randomly sampled from logs Queries classified acc. to ODP (level 2) Keep only queries with at least two intents (~900)  Top 50 results from Live, Google, and Yahoo!  Documents are rated on a 5-pt scale >90% docs have ratings Docs without ratings are assigned random grade according to the distribution of rated documents Query intents category intents category doc ODP Proprietary repository of human judgment A query classifier A query classifier

Copyright  2010 by CEBT Evaluation using Mechanical Turk  Sample 200 queries from the dataset used in Experiment 1 query category1 category2 category3 + a category they most closely associate with the given query 1. Doc 1, rel=? 2. Doc 2, rel=? 3. Doc 3, rel=? 4. Doc 4, rel=? 5. Doc 5, rel=? Result Doc Set Judge the corresponding results with respect to the chosen category using the same 4-point scale

Evaluation using Mechanical Turk

Copyright  2010 by CEBT Conclusion  How best to diversify results in the presence of ambiguous queries  Provided a greed algorithm for the objective with good approximation guarantees

Q&A Thank you 19

Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business.

Similar presentations

Presentation on theme: "Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business.

Similar presentations

Presentation on theme: "Diversifying Search Result WSDM 2009 Intelligent Database Systems Lab. School of Computer Science & Engineering Seoul National University Center for E-Business."— Presentation transcript:

Similar presentations

About project

Feedback