Range Queries on Uncertain Data

Slides:



Advertisements
Similar presentations
Computational Geometry
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
An Optimal Dynamic Interval Stabbing-Max Data Structure? Pankaj K. Agarwal, Lars Arge and Ke Yi Department of Computer Science Duke University.
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
Dynamic Planar Convex Hull Operations in Near- Logarithmic Amortized Time TIMOTHY M. CHAN.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
1 Voronoi Diagrams. 2 Voronoi Diagram Input: A set of points locations (sites) in the plane.Input: A set of points locations (sites) in the plane. Output:
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.
Point Location Computational Geometry, WS 2007/08 Lecture 5 Prof. Dr. Thomas Ottmann Algorithmen & Datenstrukturen, Institut für Informatik Fakultät für.
I/O-Algorithms Lars Arge University of Aarhus March 7, 2005.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Lecture 6: Point Location Computational Geometry Prof. Dr. Th. Ottmann 1 Point Location 1.Trapezoidal decomposition. 2.A search structure. 3.Randomized,
AALG, lecture 11, © Simonas Šaltenis, Range Searching in 2D Main goals of the lecture: to understand and to be able to analyze the kd-trees and.
Orthogonal Range Searching I Range Trees. Range Searching S = set of geometric objects Q = query object Report/Count objects in S that intersect Q Query.
1 Geometric Intersection Determining if there are intersections between graphical objects Finding all intersecting pairs Brute Force Algorithm Plane Sweep.
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Basics Set systems: (X,F) where F is a collection of subsets of X. e.g. (R 2, set of half-planes) µ: a probability measure on X e.g. area/volume is a.
UNC Chapel Hill M. C. Lin Orthogonal Range Searching Reading: Chapter 5 of the Textbook Driving Applications –Querying a Database Related Application –Crystal.
Nearest Neighbor Searching Under Uncertainty
14/13/15 CMPS 3130/6130 Computational Geometry Spring 2015 Windowing Carola Wenk CMPS 3130/6130 Computational Geometry.
Sorting with Heaps Observation: Removal of the largest item from a heap can be performed in O(log n) time Another observation: Nodes are removed in order.
The Lower Bounds of Problems
Mehdi Mohammadi March Western Michigan University Department of Computer Science CS Advanced Data Structure.
Orthogonal Range Search
Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.
CMPS 3130/6130 Computational Geometry Spring 2015
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
UNC Chapel Hill M. C. Lin Geometric Data Structures Reading: Chapter 10 of the Textbook Driving Applications –Windowing Queries Related Application –Query.
School of Computing Clemson University Fall, 2012
CSC317 Selection problem q p r Randomized‐Select(A,p,r,i)
Computational Geometry
Geometric Data Structures
CMPS 3130/6130 Computational Geometry Spring 2017
CMPS 3130/6130 Computational Geometry Spring 2017
Randomized Algorithms
Segment tree and Interval Tree
Orthogonal Range Searching and Kd-Trees
Fast Trie Data Structures
R-tree: Indexing Structure for Data in Multi-dimensional Space
Chapter 6 Transform and Conquer.
Algorithm design and Analysis
The
Computational Geometry Capter:1-2.1
Search Sorted Array: Binary Search Linked List: Linear Search
Reporting (1-D) Given a set of points S on the line, preprocess them to build structure that allows efficient queries of the from: Given an interval I=[x1,x2]
Computing Shortest Path amid Pseudodisks
Computing Maximum Non-Crossing Matching in Convex Bipartite Graphs
Efficient Algorithms for the Weighted k-Center Problem on a Real Line
Dynamic Data Structures for Simplicial Thickness Queries
Quickest Visibility Queries in Polygonal Domains
Covering Uncertain Points in a Tree
Minimizing the Aggregate Movements for Interval Coverage
Approximating Points by A Piecewise Linear Function: I
Haitao Wang Utah State University WADS 2017, St. John’s, Canada
CMPS 3130/6130 Computational Geometry Spring 2017
An O(n log n)-Time Algorithm for the k-Center Problem in Trees
Joseph S.B. Mitchell, Stony Brook University
Danny Z. Chen1, Yan Gu2, Jian Li2, and Haitao Wang1
Aggregate-Max Nearest Neighbor Searching in the Plane
Haitao Wang Utah State University SoCG 2017, Brisbane, Australia
Minmax Regret 1-Facility Location on Uncertain Path Networks
Visibility and Ray Shooting Queries in Polygonal Domains
Outlier Respecting Points Approximation
Weak Visibility Queries of Line Segments in Simple Polygons
CMPS 3130/6130 Computational Geometry Spring 2017
Search Sorted Array: Binary Search Linked List: Linear Search
Presentation transcript:

Range Queries on Uncertain Data Jian Li, Tsinghua University Haitao Wang, Utah State University ISAAC 2014

One dimensional range queries Input: a set of points on a line Given a query interval 𝐼, return the points in the interval 𝐼 A trivial solution: balanced binary search tree

An uncertain point p p can appear in different locations with probabilities Give a query interval 𝐼, Pr[𝑝∈𝐼]: the probability of p in 𝐼, called the 𝐼-probability of p 0.1 0.3 0.2 0.4 𝐼 Pr[𝑝∈𝐼] = 0.5

An uncertain point p: A general case The location of p is specified by its PDF (probability density function) 𝑓 𝑝 𝑥 , which is a step function or histogram Give a query interval 𝐼, Pr[𝑝∈𝐼]: the 𝐼-probability of p 𝑓 𝑝 𝑥 0.25 0.22 0.2 0.15 𝑥 𝐼

The cumulative distribution function (CDF) a piecewise linear function 𝐶 𝑝 𝑥 1 𝐶 𝑝 𝑥′ 𝑓 𝑝 𝑥 𝑥 𝑥′

Computing the 𝐼-probability using CDF a piecewise linear function A query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ] Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 )− 𝐶 𝑝 𝑥 𝑙 𝐶 𝑝 𝑥 𝐶 𝑝 ( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 𝑙 𝑥 𝑥 𝑙 𝑥 𝑟

Range query problems on uncertain points Input: a set P of n uncertain points For any query interval 𝐼: top-1 query: return the point in P with largest 𝐼-probability top-k query: return the k points in P with largest 𝐼-probabilities threshold query: given any threshold t, return the points in P with 𝐼-probabilities ≥ t Goal: build data structures on P to quickly answer these queries

An application on deterministic data

An application on deterministic data (cont.) A query interval 𝐼=[7,+∞) top-1 query: find the movie whose total percentage of the ratings ≥ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ≥ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ≥ 7 are ≥ 80%

Previous work: only on threshold queries A heuristic solution using R-trees, Cheng et al. VLDB 04’ fast in practice, but O(n) time in the worst case Theoretical results: Agarwal et al. PODS 09’ preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size A special case: t is fixed for all queries, preprocessing: O(n) space and O(n log n) time query: O(m + log n) time, where m is the output size Heuristic solutions in 2-D or higher-D, Tao et al. 2005 O(n) time in the worst case

An application on deterministic data (cont.) A query interval 𝐼=[7,+∞) top-1 query: find the movie whose total percentage of the ratings ≥ 7 is the largest top-k query: find the top-k movies whose total percentages of the ratings ≥ 7 are the largest threshold query: e.g., for t = 0.8, find the movies whose total percentages of the ratings ≥ 7 are ≥ 80%

Variations four variations The query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ] unbounded query: either 𝑥 𝑙 =−∞ or 𝑥 𝑟 =+∞ bounded query: otherwise 𝑓 𝑝 𝑥 : PDF of each uncertain point p uniform distribution: 𝑓 𝑝 𝑥 has only one interval histogram distribution: otherwise four variations 𝑓 𝑝 𝑥 𝑥

Our results: uniform unbounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)

Our results: histogram unbounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k T threshold O(m + log n) T=O(k) if k = Ω(log n loglog n) and O(log n + k log k) otherwise

Our results: uniform bounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(n log2 n) T threshold O(m + log n) T=O(k) if k = Ω(log n loglog n) and O(log n + k log k) otherwise

Future work: histogram bounded No new results Previous work only on threshold queries, P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n log2 n) space and O(n log3 n) expected time query: O(m+log3 n) time, where m is the output size

The 𝐼-probability: unbounded Given a query interval 𝐼 =[ 𝑥 𝑙 , 𝑥 𝑟 ]: Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 )− 𝐶 𝑝 𝑥 𝑙 If 𝑥 𝑙 =−∞, 𝐶 𝑝 𝑥 𝑙 =0 and Pr[p∈𝐼]= 𝐶 𝑝 ( 𝑥 𝑟 ) This is why the unbounded case is easier 𝐶 𝑝 𝑥 𝐶 𝑝 ( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 𝑙 𝐶 𝑝 ( 𝑥 𝑙 ) 𝑥 𝑥 𝑙 𝑥 𝑟

The arrangement of CDFs Key: the intersections of all CDFs with line 𝐿 𝑥 𝑟 top-1: the highest intersection top-k: the highest k intersections threshold: the intersections above the threshold t 𝐿( 𝑥 𝑟 ) 𝐶 𝑝 𝑥 t 𝑥 𝑥 𝑟

Top-1: unbounded Preprocessing: compute the upper envelop of all CDFs Query: find the intersection of 𝐿 𝑥 𝑟 with the upper envelop 𝐶 𝑝 𝑥 𝑥 𝑥 𝑟

Difficulty for top-k queries Arrangements of segments: difficult! Arrangements of lines: much easier! Uniform case: change each CDF to a line 𝐶 𝑝 𝑥 1 𝑥 𝐿( 𝑥 𝑟 )

Uniform unbounded Given an arrangement of n lines, for any query vertical line 𝐿 𝑥 𝑟 top-k: return the top k intersections threshold: return the intersections above t 𝐶 𝑝 𝑥 t 𝑥 𝑥 𝑟

A half-plane range reporting data structure Problem: Given a line arrangement, for any query point q, return the lines above q Data structure: Partition lines into layers: each layer consists of lines in the upper envelop after removing the previous layers

Threshold query: uniform unbounded Given 𝐼 =(−∞, 𝑥 𝑟 ] and the threshold t determine the intersections of 𝐿( 𝑥 𝑟 ) and the upper envelops above t for each such intersection, walk along the envelop towards left and right to find the lines that intersect 𝐿( 𝑥 𝑟 ) above t query time: O(log n + m) 𝐿( 𝑥 𝑟 ) t 𝑥 𝑟

Top-k query: uniform unbounded Use a heap: O(log n + k log k) query time Observation: largest k elements in O(k) sorted arrays a selection algorithm on sorted matrices, Frederickson and Johnson, 82’ ----> O(log n + k) time 𝐿( 𝑥 𝑟 ) 𝑥 𝑟

Our results: uniform unbounded preprocessing time space query time top-1 O(n log n) O(n) O(log n) top-k O(k + log n) threshold O(m + log n)

Uniform bounded Transform the problem to the unbounded case If 𝑥 𝑙 ≤ the left endpoint of the blue interval Pr[p∈𝐼] = Pr[p∈𝐼’] for 𝐼′ =(−∞, 𝑥 𝑟 ] It becomes the unbounded case! 𝑥 𝑙 𝐼′ 𝐼 𝑥 𝑟

Uniform bounded (cont.) Classify blue intervals into three types L-type: left endponits ≥ 𝑥 𝑙 R-type: right endponits ≤ 𝑥 𝑟 M-type: each contains 𝐼 𝐼 𝑥 𝑙 𝑥 𝑟

Uniform bounded (cont.) Top-1 queries: L-type and R-type: use a persistent data structure to maintain O(n) upper envelops in the preprocessing M-type: transform to segment dragging queries in 2D Top-k queries: L-type and R-type: use a binary tree T, and on each node, build a data structure as in the unbounded case build a fractional cascading structure on T M-type: transform to a range query in 3D Threshold queries: Similar as for top-k queries

Histogram unbounded A segment query problem Given a set of n segments, for any point q, return all segments vertically above q P.K. Agarwal, S.-W. Cheng, Y. Tao, and K. Yi, PODS 2009 preprocessing: O(n) space and O(n log n) time query: O(log n + m) time q

Thank you for your attention!