指導教授:陳良弼 老師 報告者:鄧雅文
Introduction Related Work Problem Formulation Future Work
Top-k query on certain data ◦ Rank results according to a user-defined score ◦ Important for explore large databases ◦ E.g., top-2 = {T 1, T 2 } TIDPIDScore T1T1 A100 T2T2 B90 T3T3 C80 T4T4 D70
Uncertain database ◦ How to define top-k on uncertain data? ◦ Mutually exclusive rules E.g., T 1 ♁ T 4 TIDPIDScorePr. T1T1 A T2T2 B900.9 T3T3 C800.6 T4T4 A700.8 …………
C. C. Aggarwal and P. S. Yu. A Survey of Uncertain Data Algorithms and Applications. In TKDE, ◦ Causes: Sensor networks, privacy, trajectories prediction… ◦ The main areas of research on the uncertain data: Modeling of uncertain data Uncertain data management Top-k query, range query, NN query… Uncertain data mining Clustering, classification, frequent pattern, outliers…
M. Soliman, I. Ilyas, and K. Chang. Top-k Query Processing in Uncertain Databases. In ICDE, ◦ Possible Worlds
◦ U-Topk query Return k tuples that can co-exist in a possible world with the highest probability E.g., {T 1, T 2 } as U-Top2 ◦ U-kRanks query Return k tuples each of which is a clear winner in its rank over all possible worlds E.g., {T 2, T 6 } as U-2Ranks
s 1,1 = {t1} p = 0.4 U-Topk s 2,2 = {t1, t2} p = 0.28 s 1,2 = {t2} p = 0.42 s 2,3 = {t2, t5} p = s 0,1 = {} p = 0.6 s 0,2 = {} p = 0.18 s 1,3 = {t2} p = s 1,2 = {t1} p = 0.12 s 0,0 = {} p = 1 1 t1: t2: t5: 0.6 Storage Layer buffer: probability priority queue Complete! return {t1, t2} as top-2 Find U-Top2 query answer.
U-kRanks i=1i=2 {} 1 {} 0.6 {} 0.18 Find U-2Ranks query answer. answer: ubound: 11 Storage Layer Report: t1: 0.4 {t1} 0.4 P t1,1 = 0.4 t t2: 0.7 {t2} P t2,1 = 0.42 t top1: t2(0.42) top1top2 {t1} 0.12 {t1, t2} P t2,2 = 0.28 t t5: 0.6 {} {t5} {t1} {t1, t5} {t2} {t2, t5} P t5,2 = t t6: 1 {} 0 {t6} {t1} 0 {t2} 0 {t5} 0 {t1, t6} {t2, t6} {t5, t6} P t6,2 = t top2: t6(0.324)
M. Hua, J. Pei, W. Zhang, X. Lin. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach. In SIGMOD, ◦ PT-k query Return a set of all tuples whose top-k probability values are at least p E.g., {T 1, T 2, T 5 } as PT-2 (with p=0.4)
C. Jin, K. Yi, L. Chen, J. Yu, X. Lin. Sliding- Window Top-k Queries on Uncertain Streams. In VLDB, ◦ Applicable to those definitions of top-k above ◦ Maintain compact sets A compact set of the window guarantees that tuples not in this compact set would not be the top-k answer of this window ◦ Both time- and space-efficient
T. Ge, S. Zdonik, and S. Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. In SIGMOD, ◦ The tradeoff between reporting high-scoring tuples and tuples with a high probability of being in the top-k ◦ Return a number of typical vectors that efficiently sample the distribution of all potential top-k tuple vectors
Example: ◦ In an International Tenpin Bowling Championship, the events include single, double, and trio. Due to the budget, the coach can only choose 3 players to attend. Therefore, we hope these 3 players can have relatively high probability to perform well over these 3 types of events.
◦ U-Top3={T 2, T 5, T 6 } ◦ But U-Top2={T 1, T 2 }, U-Top1={T 1 } ◦ How about also considering {T 1, T 2, T 5 } as top-3? TIDPlayerPr. T1T1 A T2T2 D T3T3 B T4T4 C T5T5 C T6T6 B T7T7 D T8T8 A Possible WorldPr.Possible WorldPr. PW1T1, T2, T3, T PW9T2, T3, T4, T PW2T1, T2, T3, T PW10T2, T3, T5, T PW3T1, T2, T4, T PW11T2, T4, T6, T PW4T1, T2, T5, T PW12T2, T5, T6, T PW5T1, T3, T4, T PW13T3, T4, T7, T PW6T1, T3, T5, T PW14T3, T5, T7, T PW7T1, T4, T6, T PW15T4, T6, T7, T PW8T1, T5, T6, T PW16T5, T6, T7, T
We choose the answers of a top-k query not only depending on the probability (P) but also on the confidence (C). ◦ Confidence: to express the top-(k-1) probabilities of the sets formed by k-1 tuples of this possible top-k answer E.g., k=3 {T1, T2, T3} as a possible top-k with P= C is composed in some way of Pr({T1, T2}) to be top-2= and its confidence, Pr({T1, T3}) to be top-2= and its confidence, Pr({T2, T3}) to be top-2= and its confidence
Since every possible top-k answer has two features—probability (P) and confidence (C), we only return those non-dominated ones as a result set. ◦ E.g., {T 1, T 3, T 5 }: P=0.8, C=0.4 {T 1, T 4, T 7 }: P=0.5, C=0.7 {T 2, T 6, T 7 }: P=0.3, C=0.2 this will not be returned
Formulate the confidence function Find an algorithm to generate the result set Try to calculate the confidence in an efficient way Carry out an empirical study on datasets