Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]
Introduction: Uncertain Data Management Modeling Uncertain Data Possible Worlds Model Uncertain data management Top-k, Join, kNN, Skyline, Indexing, etc. Uncertain Data Mining Clustering, Classification, Frequent Pattern, Outlier Detection
Introduction: Data Representation A simple way to representing probabilistic data Each tuple has a confidence Pr(instance)= ∏ Pr(attendance) x ∏ Pr(absence) Mutual Exclusion Constraints for each tuple* Scoring function*
Introduction: Other Works K tuples that co-exist in a possible world U-Topk Returning tuples according to marginal distribution of top-k results U-kRanks and PT-k
Introduction: Other Works (Example)
Introduction: Other Works (drawback) The top-k result may be atypical The distribution of scores is not used
Introduction: c-Typical-Top k 3-Typical-Top 2 scores of this example is {118, 183, 235} Expected distance is 6.6 The vectors are {(t2, t6), (T7,T6), (T7,T3)}
Algorithm Distribution of top-2 tuples’ scores
Algorithm – Naïve approach INPUT: tuples with membership probabilities OUTPUT: Top-k scores distribution IDEA: recursively go through all possible worlds to calculate all probabilities, until reaching a threshold
Algorithm – a DP approach D(i,j): score distribution of top-j starting at Ti. The main problem is D(1,k) (?)
Algorithm – a DP approach Transformation: D(i,j) = TF[D(i+1,j),D(i+1,j-1)] D(i+1,j): For each (v,p) add (v, p(1-pi)) D(i+1,j-1): For each (v,p) add (v+si, p*pi) Merge duplicate items Bottom up DP Approximation
Handling More Real Scenarios Handling Mutually Exclusive Rules Compress the ME group Refine by lead tuple region Handling Ties When two tuples have the same score, rank them according to probability
Algorithm 3-Typical-Top 2 scores
c-Typical-Top k 3-Typical-Top 2 scores of this example is {118, 183, 235} Expected distance is 6.6 The vectors are {(t2, t6), (T7,T6), (T7,T3)}
Computing c-Typical-Top k Define F^a(j) to be the optimal objective over {sj, …, sn} where a is the number of typical scores. G^a(j) means the same
Computing c-Typical-Top k Just solve the two function optimization problem, using DP Boundary conditions
Empirical Study 3 -Typical VS U-Topk
Empirical Study
Q&A
Reference [1] Charu C. Aggarwal, Philip S. Yu “A Survey of Uncertain Data Algorithms and Applications”, IEEE Transactions on Knowledge and Data Engineering, 2009 [2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. SIGMOD, 2009