Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Introduction: Uncertain Data Management Modeling Uncertain Data Possible Worlds Model Uncertain data management Top-k, Join, kNN, Skyline, Indexing, etc. Uncertain Data Mining Clustering, Classification, Frequent Pattern, Outlier Detection

Introduction: Data Representation A simple way to representing probabilistic data Each tuple has a confidence Pr(instance)= ∏ Pr(attendance) x ∏ Pr(absence) Mutual Exclusion Constraints for each tuple* Scoring function*

Introduction: Other Works K tuples that co-exist in a possible world U-Topk Returning tuples according to marginal distribution of top-k results U-kRanks and PT-k

Introduction: Other Works (Example)

Introduction: Other Works (drawback) The top-k result may be atypical The distribution of scores is not used

Introduction: c-Typical-Top k 3-Typical-Top 2 scores of this example is {118, 183, 235} Expected distance is 6.6 The vectors are {(t2, t6), (T7,T6), (T7,T3)}

Algorithm Distribution of top-2 tuples’ scores

Algorithm – Naïve approach INPUT: tuples with membership probabilities OUTPUT: Top-k scores distribution IDEA: recursively go through all possible worlds to calculate all probabilities, until reaching a threshold

Algorithm – a DP approach D(i,j): score distribution of top-j starting at Ti. The main problem is D(1,k) (?)

Algorithm – a DP approach Transformation: D(i,j) = TF[D(i+1,j),D(i+1,j-1)] D(i+1,j): For each (v,p) add (v, p(1-pi)) D(i+1,j-1): For each (v,p) add (v+si, p*pi) Merge duplicate items Bottom up DP Approximation

Handling More Real Scenarios Handling Mutually Exclusive Rules Compress the ME group Refine by lead tuple region Handling Ties When two tuples have the same score, rank them according to probability

Algorithm 3-Typical-Top 2 scores

c-Typical-Top k 3-Typical-Top 2 scores of this example is {118, 183, 235} Expected distance is 6.6 The vectors are {(t2, t6), (T7,T6), (T7,T3)}

Computing c-Typical-Top k Define F^a(j) to be the optimal objective over {sj, …, sn} where a is the number of typical scores. G^a(j) means the same

Computing c-Typical-Top k Just solve the two function optimization problem, using DP Boundary conditions

Empirical Study 3 -Typical VS U-Topk

Empirical Study

Reference [1] Charu C. Aggarwal, Philip S. Yu “A Survey of Uncertain Data Algorithms and Applications”, IEEE Transactions on Knowledge and Data Engineering, 2009 [2] Tingjian Ge, Stan Zdonik, Samuel Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. SIGMOD, 2009

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Similar presentations

Presentation on theme: "Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]

Similar presentations

Presentation on theme: "Top-k Queries on Uncertain Data: On score Distribution and Typical Answers Presented by Qian Wan, HKUST Based on [1][2]"— Presentation transcript:

Similar presentations

About project

Feedback