A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB Presented by JongHeum Yeon
Probabilistic Databases Motivation Increasing amounts of uncertain data Information retrieval, data integration and cleaning, text analytics, social network analysis Probabilistic Databases Annotate tuples with existence probabilistics, and/or attribute values with probability distribution Interpretation according to the “possible worlds semantics” 2Copyright© 2010 by CEBTCenter for E-Business Technology
Possible World Semantics IDScoreProb t1t t2t t3t Copyright© 2010 by CEBTCenter for E-Business Technology IDScore t1t1 200 t2t2 150 t3t3 100 IDScore t1t1 200 t2t2 150 IDScore t2t2 150 t3t3 100 pw1 pw2 pw3 w.p A probabilistic table (assume tuple-independence) w.p w.p
Possible World Semantics IDScoreProb t1t t2t t3t Copyright© 2010 by CEBTCenter for E-Business Technology IDScore t1t1 200 t2t2 150 t3t3 100 IDScore t1t1 200 t2t2 150 IDScore t2t2 150 t3t3 100 pw1 pw2 pw3 w.p A probabilistic table (assume tuple-independence) w.p w.p Score values are used to rank the tuples in every pw The top-1 answer for each possible world
Probabilistic Databases (cont’d) 5Copyright© 2010 by CEBTCenter for E-Business Technology
Top-k Queries Multi-criteria optimization problem [score=100, Pr(t 1 )=0.5] vs. [score=50, pr(t 2 )=1.0] Many Prior Proposals Return k tuples t with the highest score(t)Pr(t) [exp. score] Return the most probable top k-answer [U-top-k] – [Solimanet al. ’07] At rank i, return tuple with max. prob. Of being at rank i [U-rank-k] – [Solimanet al. ’07] Return k tuples t with the largest Pr(r(t)≤k) values [PT-k/GT-k] – [Huaet al. ’08] [Zhang et al. ’08] Return k tuples t with smallest expected rank:∑ pw Pr(pw)r pw (t) – [Cormodeet al. ’09] 6Copyright© 2010 by CEBTCenter for E-Business Technology
Top-k Queries (cont’d) Which one should we use? Comparing different ranking functions Normalized Kendall Distance between two top-k answers: – Penalizes #reversals and #mismatches – Lies in [0,1], 0: Same answers; 1: Disjoint answers 7Copyright© 2010 by CEBTCenter for E-Business Technology
Unified Approach Define two parameterized ranking functions: PRF w ; PRF e that can simulate or approximate a variety of ranking functions PRF e much more efficient to evaluate (than PRF w ) 8Copyright© 2010 by CEBTCenter for E-Business Technology
Outline Parameterized Ranking Functions Computing PRF Independent Tuples Computing PRF e (α) x-Tuples Summery of Other Results Approximating Ranking Functions Experiments 9Copyright© 2010 by CEBTCenter for E-Business Technology
Parameterized Ranking Function Weight Function Parameterized Ranking Function (PRF) Return k tuples with the highest values. 10Copyright© 2010 by CEBTCenter for E-Business Technology
Parameterized Ranking Function (cont’d) ω(t,i)= 1 : Rank the tuples by probabilities ω(t,i)=score(t): E-Score PRF e (α): ω(i)= α i where α can be a real or a complex number PRF ω (h): Generalizes PT/GT-k and U-Rank 11Copyright© 2010 by CEBTCenter for E-Business Technology
Computing PRF: Independent Tuples 12Copyright© 2010 by CEBTCenter for E-Business Technology
Computing PRF: Independent Tuples (cont’d) 13Copyright© 2010 by CEBTCenter for E-Business Technology
Computing PRF: Independent Tuples (cont’d) 14Copyright© 2010 by CEBTCenter for E-Business Technology
Computing PRF e (α): Independent Tuples 15Copyright© 2010 by CEBTCenter for E-Business Technology
Summary of Other Results PRF computation for probabilistic and/xor trees Generalizes x-tuplesby allowing both mutual exclusivity and coexistence PRF computation : O(n3) PRF e computation : O(nlogn+nd) (d = height of the tree) PRF computation on graphical models A polynomial time algorithm when the junction tree has bounded treewidth A nontrivial dynamic program combined with the generating function method. 16Copyright© 2010 by CEBTCenter for E-Business Technology
Approximating Ranking Functions 17Copyright© 2010 by CEBTCenter for E-Business Technology
Experiments: Approximation Ability 18Copyright© 2010 by CEBTCenter for E-Business Technology
Experiments: Execution Time 19Copyright© 2010 by CEBTCenter for E-Business Technology
Conclusion Proposed a unifying framework for ranking over probabilistic databases through: Parameterized ranking functions Incorporation of user feedback Designed highly efficient algorithms for computing PRF and PRF e Developed novel approximation techniques for approximating PRF w with PRF e 20Copyright© 2010 by CEBTCenter for E-Business Technology