Range Queries on Uncertain Data Jian Li, Tsinghua University Haitao Wang, Utah State University ISAAC 2014
One dimensional range queries Input: a set of points on a line Given a query interval 𝐼, return the points in the interval 𝐼 A trivial solution: balanced binary search tree
An uncertain point p p can appear in different locations with probabilities Give a query interval 𝐼, Pr[𝑝∈𝐼]: the probability of p in 𝐼, called the 𝐼-probability of p 0.1 0.3 0.2 0.4 𝐼 Pr[𝑝∈𝐼] = 0.5
An uncertain point p: A general case The location of p is specified by its PDF (probability density function) 𝑓 𝑝 𝑥 , which is a step function or histogram Give a query interval 𝐼, Pr[𝑝∈𝐼]: the 𝐼-probability of p 𝑓 𝑝 𝑥 0.25 0.22 0.2 0.15 𝑥 𝐼
The cumulative distribution function (CDF) a piecewise linear function 𝐶 𝑝 𝑥 1 𝐶 𝑝 𝑥′ 𝑓 𝑝 𝑥 𝑥 𝑥′
Computing the 𝐼-probability using CDF a piecewise linear function A query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ] Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 )− 𝐶 𝑝 𝑥 𝑙 𝐶 𝑝 𝑥 𝐶 𝑝 ( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 𝑙 𝑥 𝑥 𝑙 𝑥 𝑟
Range query problems on uncertain points Input: a set P of n uncertain points For any query interval 𝐼: top-1 query: return the point in P with largest 𝐼-probability top-k query: return the k points in P with largest 𝐼-probabilities threshold query: given any threshold t, return the points in P with 𝐼-probabilities ≥ t Goal: build data structures on P to quickly answer these queries
An application on deterministic data
An application on deterministic data (cont.) A query interval 𝐼=[7,+∞) top-1 query: find the movie whose total percentage of the ratings ≥ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ≥ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ≥ 7 are ≥ 80%
Previous work: only on threshold queries A heuristic solution using R-trees, Cheng et al. VLDB 04’ fast in practice, but O(n) time in the worst case Theoretical results: Agarwal et al. PODS 09’ preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size A special case: t is fixed for all queries, preprocessing: O(n) space and O(n log n) time query: O(m + log n) time, where m is the output size Heuristic solutions in 2-D or higher-D, Tao et al. 2005 O(n) time in the worst case
An application on deterministic data (cont.) A query interval 𝐼=[7,+∞) top-1 query: find the movie whose total percentage of the ratings ≥ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ≥ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ≥ 7 are ≥ 80%
Variations four variations The query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ] unbounded query: either 𝑥 𝑙 =−∞ or 𝑥 𝑟 =+∞ bounded query: otherwise 𝑓 𝑝 𝑥 : PDF of each uncertain point p uniform distribution: 𝑓 𝑝 𝑥 has only one interval histogram distribution: otherwise four variations 𝑓 𝑝 𝑥 𝑥
Our results: uniform unbounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)
Our results: histogram unbounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k T threshold O(m + log n) T=O(k) if k = Ω(log n loglog n) and O(log n + k log k) otherwise
Our results: uniform bounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(n log2 n) T threshold O(m + log n) T=O(k) if k = Ω(log n loglog n) and O(log n + k log k) otherwise
Future work: histogram bounded No new results Previous work only on threshold queries, P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size
The 𝐼-probability: unbounded Given a query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ]: Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 )− 𝐶 𝑝 𝑥 𝑙 If 𝑥 𝑙 =−∞, 𝐶 𝑝 𝑥 𝑙 =0 and Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 ) This is why the unbounded case is easier 𝐶 𝑝 𝑥 𝐶 𝑝 ( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 𝑙 𝐶 𝑝 ( 𝑥 𝑙 ) 𝑥 𝑥 𝑙 𝑥 𝑟
The arrangement of CDFs Key: the intersections of all CDFs with line 𝐿 𝑥 𝑟 top-1: the highest intersection top-k: the highest k intersections threshold: the intersections above the threshold t 𝐿( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 t 𝑥 𝑥 𝑟
Top-1: unbounded Preprocessing: compute the upper envelop of all CDFs Query: find the intersection of 𝐿 𝑥 𝑟 with the upper envelop 𝐶 𝑝 𝑥 𝑥 𝑥 𝑟
Difficulty for top-k queries Arrangements of segments: difficult! Arrangements of lines: much easier! Uniform case: change each CDF to a line 𝐶 𝑝 𝑥 1 𝑥 𝐿( 𝑥 𝑟 )
Uniform unbounded Given an arrangement of n lines, for any query vertical line 𝐿 𝑥 𝑟 top-k: return the top k intersections threshold: return the intersections above t 𝐶 𝑝 𝑥 t 𝑥 𝑥 𝑟
A half-plane range reporting data structure Problem: Given a line arrangement, for any query point q, return the lines above q Data structure: Partition lines into layers: each layer consists of lines in the upper envelop after removing the previous layers
Threshold query: uniform unbounded Given 𝐼 =(−∞, 𝑥 𝑟 ] and the threshold t determine the intersections of 𝐿( 𝑥 𝑟 ) and the upper envelops above t for each such intersection, walk along the envelop towards left and right to find the lines that intersect 𝐿( 𝑥 𝑟 ) above t query time: O(log n + m) 𝐿( 𝑥 𝑟 ) t 𝑥 𝑟
Top-k query: uniform unbounded Use a heap: O(log n + k log k) query time Observation: largest k elements in O(k) sorted arrays a selection algorithm on sorted matrices, Frederickson and Johnson, 82’ ----> O(log n + k) time 𝐿( 𝑥 𝑟 ) 𝑥 𝑟
Our results: uniform unbounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)
Uniform bounded Transform the problem to the unbounded case If 𝑥 𝑙 ≤ the left endpoint of the blue interval Pr[p∈𝐼] = Pr[p∈𝐼’] for 𝐼′ =(−∞, 𝑥 𝑟 ] It becomes the unbounded case! 𝑥 𝑙 𝐼′ 𝐼 𝑥 𝑟
Uniform bounded (cont.) Classify blue intervals into three types L-type: left endponits ≥ 𝑥 𝑙 R-type: right endponits ≤ 𝑥 𝑟 M-type: each contains 𝐼 𝐼 𝑥 𝑙 𝑥 𝑟
Uniform bounded (cont.) Top-1 queries: L-type and R-type: use a persistent data structure to maintain O(n) upper envelops in the preprocessing M-type: transform to segment dragging queries in 2D Top-k queries: L-type and R-type: use a binary tree T, and on each node, build a data structure as in the unbounded case build a fractional cascading structure on T M-type: transform to a range query in 3D Threshold queries: Similar as for top-k queries
Histogram unbounded A segment query problem Given a set of n segments, for any point q, return all segments vertically above q P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n) space and O(n log n) time query: O(log n + m) time q
Thank you for your attention!