Download presentation
Presentation is loading. Please wait.
Published byEdwin Franklin Modified over 9 years ago
1
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Shuxiang Yang, Ying Zhang, Xuemin Lin (UNSW & NICTA)
2
Outline DB@UNSW 2 Background and Preliminaries Probabilistic Threshold Range Aggregate Query Exact query processing Approximate query processing: Simple Sampling & Double Sampling Experiments Conclusion
3
Applications DB@UNSW 3 Many applications involve data that is imperfect due to data randomness and incompleteness limitation of equipment delay or lose in data transfer … … Applications Sensor networks Environmental surveillance Moving objects Data cleaning and integration … …
4
Applications DB@UNSW 4 Sensor Networks: Sensor readings are often imprecise due to equipment limitation and periodical reporting mechanism. (figures are borrowed from Jian et al, SIGMOD08)
5
Applications DB@UNSW 5 Mobile Equipments / Moving Objects A mobile object reports its location periodically, the exact location is often uncertain.
6
Applications DB@UNSW 6 Satellite data
7
Applications DBG @ UNSW Data Quality Social Data Collection: Errors and estimation inherent in customer surveys and sampling 7
8
Outline DB@UNSW 8 Background and Preliminaries Modeling Uncertainty & Related Work Probabilistic Threshold Range Query Conclusion
9
Modeling Uncertainty ( cont. ) DB@UNSW 9 Uncertain Objects Model 1. Continuous case: described using a probability density function (PDF) f U such that. E.g., uniform distribution, normal distribution.
10
Modeling Uncertainty ( cont. ) DB@UNSW 10 Uncertain Objects Model 2. Discrete case : described using a set of instances each instance u has an occurrence probability p u
11
Possible World Semantics DB@UNSW 11 Given a set of uncertain objects {U 1,U 2,..., U n }, a possible world W = {u 1,u 2,.., u n } is a set of n instances --- one instance per uncertain object The probability of a possible worlds is P(W) = Let Ω be the set of all possible world, clearly,
12
Probabilistic Queries: DB@UNSW 12 Query Evaluation [CKP03, CXPSV04, DS04, DS05, DS07, SD07] Aggregate Queries [BDJR05, MJ07, CG07] Join Queries [CSP06, AW07] Top-k queries [SIC07, YLSK08, RDS07, HJZL08] Nearest Neighbor Queries [KKR07, CCMC08] Skyline Queries [PJLY07] … …
13
Range query DBG @ UNSW 13 Uncertain objects, exact query Probability threshold is often assigned
14
Related Work DB@UNSW 14 Range Queries [TCXNKP05, BPS06, AY08] Given a rectangle r and a probabilistic threshold t, find all objects that appear in r with probability at least t. Appearance probability
15
U-tree DB@UNSW 15 Probabilistically Constrained Region ( PCR ) [TCXNKP05] PCR (0.2)Multi PCRs
16
Outline DB@UNSW 16 Introduction Modeling Uncertainty & Related Work Probabilistic Threshold Range Aggregate Query (PTRA) Conclusion
17
Contribution DB@UNSW 17 Formally define PTRA query aU-Tree structure for exact PTRA query singleSample and doubleSample techniques for approximate answer.
18
Problem Statement DB@UNSW 18 Given a set of uncertain objects and query q, return the number of uncertain objects with appearance probability no less than threshold p q
19
Problem Definition DB@UNSW 19 Assume threshold = 0.5, if the appearance probability computed for b is > 0.5 and for c is < 0.5, then the aggregate returned is 2 (a & b)
20
Exact Query Processing ( aU-Tree) DB@UNSW 20 Main idea: add aggregate information on U-tree Advantage: stop at intermediate level if pruned or fully covered by the query Disadvantage: otherwise, still need to drill down to the leaf nodes. For a large portion of uncertain objects, appearance probability needs to be computed Expensive for a massive number of instances per object!
21
Exact Query Processing ( aU-Tree) DB@UNSW 21
22
singleSample DB@UNSW 22 Sampling the instances of the uncertain objects. If m’ out of m sampled instances are inside query region, then the approximate appearance probability is m’/m
23
singleSample ( cont. ) DB@UNSW 23 An immediate application of Chernoff-Hoeffding bound
24
doubleSample DB@UNSW 24 Single Sampling is expensive when there is a massive number of objects! Sampling the uncertain objects as well. Naive : uniform sampling objects from all uncertain objects.
25
doubleSample: Accuracy DB@UNSW 25 Note: “ appearance probability” of each object follows uniform distribution means spatial location is uniformly distributed. Using Chernoff-Hoeffding bound.
26
doubleSample: Our Approach DB@UNSW 26 Skew! Aim: select K disjoint groups covering all objects with the minimum “skew”; i.e. objects in each group with “uniform” distribution. (Then do uniform sampling of objects in each group.) The optimization problem is NP-hard. Observation: Min-skew is a good heuristic to conduct such a group. aU-tree groups objects with a similar principle to the min- skew.
27
doubleSample: Our Approach DB@UNSW 27 Step 1: choose K subtrees to cover all objects with the total minimum skew. NP-hard! Find a level L such that the number of nodes at level L is smaller than K but the number of nodes at level L-1 is larger than K. Feed the min-skew algorithm with the subtrees at level L. (note: if at a level L, the number of nodes = K, then these K subtrees are chosen.) Step 2: sample objects in each subtree. Step 3. sample instances in each sampled object.
28
Experiments DB@UNSW 28 Algorithms: exact, singleSample, doubleSample Data set: LB : 53k objects at long beach country CA : 62k objects at California Synthetic aircraft dataset in 3D 10k instances for each points follow Uniform or constrained-Gaussian Setting : C++, P4 2.8GHz, 2G memory, Debian linux, Page size 8K
29
Efficiency DB@UNSW 29
30
Accuracy DB@UNSW 30
31
Accuracy ( cont. ) DB@UNSW 31
32
Conclusion DB@UNSW 32 Definition of PTRA aU-Tree technique Sampling technique Future work. Any approach with theoretic guarantee?
33
DB@UNSW 33 Thanks
34
Min-Skew technique DB@UNSW 34
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.