Download presentation
Presentation is loading. Please wait.
1
Range Queries on Uncertain Data
Jian Li, Tsinghua University Haitao Wang, Utah State University ISAAC 2014
2
One dimensional range queries
Input: a set of points on a line Given a query interval š¼, return the points in the interval š¼ A trivial solution: balanced binary search tree
3
An uncertain point p p can appear in different locations with probabilities Give a query interval š¼, Pr[šāš¼]: the probability of p in š¼, called the š¼-probability of p 0.1 0.3 0.2 0.4 š¼ Pr[šāš¼] = 0.5
4
An uncertain point p: A general case
The location of p is specified by its PDF (probability density function) š š š„ , which is a step function or histogram Give a query interval š¼, Pr[šāš¼]: the š¼-probability of p š š š„ 0.25 0.22 0.2 0.15 š„ š¼
5
The cumulative distribution function (CDF)
a piecewise linear function š¶ š š„ 1 š¶ š š„ā² š š š„ š„ š„ā²
6
Computing the š¼-probability using CDF
a piecewise linear function A query interval š¼ =[ š„ š , š„ š ] Pr[pāš¼]= š¶ š ( š„ š )ā š¶ š š„ š š¶ š š„ š¶ š ( š„ š ) š¶ š š„ š š„ š„ š š„ š
7
Range query problems on uncertain points
Input: a set P of n uncertain points For any query interval š¼: top-1 query: return the point in P with largest š¼-probability top-k query: return the k points in P with largest š¼-probabilities threshold query: given any threshold t, return the points in P with š¼-probabilities ā„ t Goal: build data structures on P to quickly answer these queries
8
An application on deterministic data
9
An application on deterministic data (cont.)
A query interval š¼=[7,+ā) top-1 query: find the movie whose total percentage of the ratings ā„ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ā„ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ā„ 7 are ā„ 80%
10
Previous work: only on threshold queries
A heuristic solution using R-trees, Cheng et al. VLDB 04ā fast in practice, but O(n) time in the worst case Theoretical results: Agarwal et al. PODS 09ā preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size A special case: t is fixed for all queries, preprocessing: O(n) space and O(n log n) time query: O(m + log n) time, where m is the output size Heuristic solutions in 2-D or higher-D, Tao et al. 2005 O(n) time in the worst case
11
An application on deterministic data (cont.)
A query interval š¼=[7,+ā) top-1 query: find the movie whose total percentage of the ratings ā„ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ā„ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ā„ 7 are ā„ 80%
12
Variations four variations The query interval š¼ =[ š„ š , š„ š ]
unbounded query: either š„ š =āā or š„ š =+ā bounded query: otherwise š š š„ : PDF of each uncertain point p uniform distribution: š š š„ has only one interval histogram distribution: otherwise four variations š š š„ š„
13
Our results: uniform unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)
14
Our results: histogram unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k T threshold O(m + log n) T=O(k) if k = ā¦(log n loglog n) and O(log n + k log k) otherwise
15
Our results: uniform bounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(n log2 n) T threshold O(m + log n) T=O(k) if k = ā¦(log n loglog n) and O(log n + k log k) otherwise
16
Future work: histogram bounded
No new results Previous work only on threshold queries, P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size
17
The š¼-probability: unbounded
Given a query interval š¼ =[ š„ š , š„ š ]: Pr[pāš¼]= š¶ š ( š„ š )ā š¶ š š„ š If š„ š =āā, š¶ š š„ š =0 and Pr[pāš¼]= š¶ š ( š„ š ) This is why the unbounded case is easier š¶ š š„ š¶ š ( š„ š ) š¶ š š„ š š¶ š ( š„ š ) š„ š„ š š„ š
18
The arrangement of CDFs
Key: the intersections of all CDFs with line šæ š„ š top-1: the highest intersection top-k: the highest k intersections threshold: the intersections above the threshold t šæ( š„ š ) š¶ š š„ t š„ š„ š
19
Top-1: unbounded Preprocessing: compute the upper envelop of all CDFs Query: find the intersection of šæ š„ š with the upper envelop š¶ š š„ š„ š„ š
20
Difficulty for top-k queries
Arrangements of segments: difficult! Arrangements of lines: much easier! Uniform case: change each CDF to a line š¶ š š„ 1 š„ šæ( š„ š )
21
Uniform unbounded Given an arrangement of n lines, for any query vertical line šæ š„ š top-k: return the top k intersections threshold: return the intersections above t š¶ š š„ t š„ š„ š
22
A half-plane range reporting data structure
Problem: Given a line arrangement, for any query point q, return the lines above q Data structure: Partition lines into layers: each layer consists of lines in the upper envelop after removing the previous layers
23
Threshold query: uniform unbounded
Given š¼ =(āā, š„ š ] and the threshold t determine the intersections of šæ( š„ š ) and the upper envelops above t for each such intersection, walk along the envelop towards left and right to find the lines that intersect šæ( š„ š ) above t query time: O(log n + m) šæ( š„ š ) t š„ š
24
Top-k query: uniform unbounded
Use a heap: O(log n + k log k) query time Observation: largest k elements in O(k) sorted arrays a selection algorithm on sorted matrices, Frederickson and Johnson, 82ā > O(log n + k) time šæ( š„ š ) š„ š
25
Our results: uniform unbounded
preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)
26
Uniform bounded Transform the problem to the unbounded case
If š„ š ā¤ the left endpoint of the blue interval Pr[pāš¼] = Pr[pāš¼ā] for š¼ā² =(āā, š„ š ] It becomes the unbounded case! š„ š š¼ā² š¼ š„ š
27
Uniform bounded (cont.)
Classify blue intervals into three types L-type: left endponits ā„ š„ š R-type: right endponits ā¤ š„ š M-type: each contains š¼ š¼ š„ š š„ š
28
Uniform bounded (cont.)
Top-1 queries: L-type and R-type: use a persistent data structure to maintain O(n) upper envelops in the preprocessing M-type: transform to segment dragging queries in 2D Top-k queries: L-type and R-type: use a binary tree T, and on each node, build a data structure as in the unbounded case build a fractional cascading structure on T M-type: transform to a range query in 3D Threshold queries: Similar as for top-k queries
29
Histogram unbounded A segment query problem
Given a set of n segments, for any point q, return all segments vertically above q P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n) space and O(n log n) time query: O(log n + m) time q
30
Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.