DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian Procopiuc AT&T Shannon Labs SSTD’05 Presented by: Huiping.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Trees for spatial indexing
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Searching on Multi-Dimensional Data
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Continuous Intersection Joins Over Moving Objects Rui Zhang University of Melbourne Dan Lin Purdue University Kotagiri Ramamohanarao University of Melbourne.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
2-dimensional indexing structure
Spatio-Temporal Databases
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
Handling Location Imprecision in Moving Object Database Xinfa Hu March 2007.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Structure-Aware Sampling:
Mehdi Mohammadi March Western Michigan University Department of Computer Science CS Advanced Data Structure.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
BATON A Balanced Tree Structure for Peer-to-Peer Networks H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu.
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.
1.  RNN(q) – returns a set of data points that have the query point q as the nearest neighbor.  Advanced database applications: fixed wireless telephone.
Spatio-Temporal Databases. Term Project Groups of 2 students You can take a look on some project ideas from here:
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
Indexing Multidimensional Data
Spatial Data Management
Data Transformation: Normalization
Spatio-Temporal Databases
Mining Time-Changing Data Streams
KD Tree A binary search tree where every node is a
Query-Friendly Compression of Graph Streams
Orthogonal Range Searching and Kd-Trees
Spatio-Temporal Databases
Approximate Frequency Counts over Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Shape-based Registration
Efficient Aggregation over Objects with Extent
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian Procopiuc AT&T Shannon Labs SSTD’05 Presented by: Huiping Cao

DB group seminar2 Outline Background Related work Problem definition Online algorithm Experiments References

DB group seminar3 Background Streaming data Large volume Continuous arrival Data stream algorithms One pass Small amount of space Fast updating

DB group seminar4 Background Data stream model Operations of elements Insertion Most common case Deletion Updating Most difficult case Validation time of elements Whole history  Landmark window [Geh01], cash register [this paper] Partial recent history  Sliding window [Geh01], turnstile model [this paper]

DB group seminar5 Related work Classified according to operations Aggregation avg, max, min, sum [Dob02] count (distinct values) [Gib01] Quantile in 1D data [Gre01] Frequent items (Heavy hitter) [Man02] Query estimation Join size estimation [Dob02] K Nearest Neighbor seaching eKNN [Kou04], RNN aggregation[Kor01] Techniques Histogram [Tha02], sample [Joh05], special synopsis......

DB group seminar6 Problem definition D:, where p i  R d Cash register model Q i =, R i : a d-dimensional hyper-rectangle Selectivity of Q i : sel(Q i ) = |{p i | j ≤ i, p i  R i } | These points arrive before time step i They lie in R i Problem: Estimating sel(Q i ) Measurement, relative error

DB group seminar7 Online algorithm: rough steps Get random samples S of D Using reservoir sampling method [vit85] Using kd-tree([kd75])-like structure to index these sample points Maintenance of the sample and the kd-tree-like structure online Compute range selectivity: estimated_sel(Q) using kernel density estimator

DB group seminar8 Random sampling Theorem1: Let T be the data stream seen so far, size: |T| let S  T be a random sample chosen via the reservoir sampling technique, such that |S| =  ((d/  2 )log (1/  )+log(1/  )), where 0 < ,  <1 and |S| is the size of S. Then with probability 1- , for any axis-parallel hyper-rectangle| Q the following is true: sel(Q) = |Q  T| is the selectivity of Q with respect to the data stream seen so far, sel(Q, S) = |Q  S| is the selectivity of Q with respect to the random sample.

DB group seminar9 Sampling Random sampling Problem: when sel(Q) is smaller, relative error is bigger Better selectivity estimator: kernel density estimator

DB group seminar10 Kernel density estimator S = {s 1, …, s m }: random subset of D where x = (x 1, …, x d ) and s i = (s i1, …, s id ) are d- dimensional points B j : kernel bandwidth along dimension j [Sco92] Global parameter

DB group seminar11 One-dimensional kernels (a) Kernel function, B = 1; (b) Contribution of multiple kernels to estimate of range query

DB group seminar12 Local kernel density estimator kd-tree structure T(S): index of the sample data Each leaf contains one point s i  leaf(s i ) Two leaves are disjoint Union of all leaves is R d Each leaf maintain d+1 values:  i,  i1,  i2,…,  id  ij : approximates the standard distribution of the points in the cell centered at s i along dimension j R = [a 1, b 1 ]  …  [a d, b d ] T i : subset of points in tree leaf leaf(s i )

DB group seminar13 Update T(S) Purpose: maintain  i,  ij (1 ≤ j ≤ d)  i is the number of stream points contained in leaf(s i ) Assume p is the current point in the data stream If p is not selected in S according to sample algorithm Find the leaf that contains p, leaf(s i ), Increment  I Add (p j – s ij ) 2 to  ij If p is selected in S A point q will be deleted from S Delete leaf(q) Add a new leaf corresponding to p

DB group seminar14 Delete leaf(q) u: parent node of leaf(q) v: sibling of leaf(q) box(u): axis parallel hyper- rectangle of node u h(u): hyper-plane orthogonal to a coordinate axis that divides box(u) into two smaller boxes associated with the children of u. N(q), neighbors of leaf(q) leaves in the subtree of v that have one boundary contained in h(u)

DB group seminar15 Delete leaf(q) Redistribute points in leaf(q) to N(q) Extending the bounding box of each neighbor of leaf(q) past h(u), until it hits the left boundary of leaf(q) Update ,  values for all leaves in N(q) Notations: Leaf(r)  N(q) box e (r): the expanded box of r

DB group seminar16 Update  of leaf(r) Update  value for every leaf leaf(r)  N(q) compute selectivity sel(box e (r)) of the box e (r) w.r.t. leaf(q)  r =  r +sel(box e (r))

DB group seminar17 Update  of leaf(r) [ j,  j ] be the intersection of box e (r) and the kernal function of q along dimension j. Discretize it by  equidistant points (  is a large constant ) j = v 1, v 2, …, v  =  j Update  rj as following: Wt i is the approximate number of points of leaf(q) whose j’th coordinate lies in the interval [v i,v i+1 ].

DB group seminar18 Update  of leaf(r) Updating  rj by discretizing the intersection of box e (r) and the kernel of q along dimension j (the gray area represents wt 2 ) All points in this interval is approximated by its midpoint

DB group seminar19 Insert a leaf p: newly inserted point q: existing sample point such that p  leaf(q) Split leaf(q) by a hyperplane Pass through the midpoint (p+q)/2 Direction: alternative rule of kd-tree If i is the splitting dim for the parent of q, then the splitting dim for q is (i+1) mod d Update  and  values for p and q using similar procedure for updating

DB group seminar20 Extension Allow deletion of a point p from the data stream If p is not a kernel center Compute leaf(s i ) such that p  leaf(s i )  i =  i -1  ij =  ij – (p j - s ij ) 2 p is a kernel center Delete leaf(p) Replace p with a newly coming point p’ This does not follow the sample procedure, may make the sample not uniform w.r.t. points in D

DB group seminar21 Experiments Different number of dimensions Different query loads Range selectivity Measurement: Accuracy Trade-off between accuracy and space usage

DB group seminar22 Data Synthetic data, generator is for projected cluster [Agg99] SD2(2D), SD4 (4D) 1 million points, 90% are contained in clusters, 10% uniformly distributed Real data NM2 1 million 2D data with real-valued attributes Each point: an aggregate of measurements taken in 15m interval, reflecting minimum and maximum delay times between pairs of severs on AT&T’s backbone network

DB group seminar23 Query loads 2 query workload for each dataset Queries are chosen randomly in the attribute space Each workload contains 200 queries Each query in a workload has the same selectivity, 0.5% for the first workload (low selectivity) 10% for the second (high selectivity)

DB group seminar24 Accuracy measure Q i =, its relative error is Err i Let {Q i1, …, Q ik } be the query workload, the average relative error of this workload is avg_err

DB group seminar25 Validating local kernels in an off-line setting(1) MPLKernels (Multi-Pass Local kernels) Scan the data once, get random sample points Compute the kd-tree on them Scan the data second times, compute  and  Only useful in off-line setting GKernels (Global Kernels) [Gun00] Kernel bandwidth: function of global standard deviation of the data along each dimension One-pass approximation  Two-pass accurate computation Sample: Random sampling LKernels: one pass local kernels

DB group seminar26 Validating local kernels in an off-line setting(2)

DB group seminar27 Validating local kernels in an off-line setting(3)

DB group seminar28 Comparison with histogram methods(1) Histogram method [Tha02] faster heuristic: EGreedy

DB group seminar29 General online setting(1) Queries arrive interleaved with points Compare Sample LKernels MPLernels

DB group seminar30 General online setting(2)

DB group seminar31 General online setting(3)

DB group seminar32 General online setting(4)

DB group seminar33 References [kd75] J.L. Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Communication of the ACM, 18(9), September [vit85] J.S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1): 37-57, [Sco92] D. W. Scott. Multivariate Density Estimation. Wiley-Interscience, [Agg99] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In SIGMOD99, pages 61–72. [Gun00] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multidimensional aggregate range queries over real attributes. In SIGMOD00, pages 463–474. [Geh01] J. Gehrke, F. Korn and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD01. [Gib01] P. Gibbons. Distinct sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In VLDB01. [Gre01] M. Greenwald and S. Khanna, Space-Efficient Online Computation of Quantile Summaries. In SIGMOD01.

DB group seminar34 References [Dob02] A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi. Processing complex aggregate queries over data streams. In SiGMOD02. [Kor02] Flip Korn, S. Muthukrishnan, Divesh Srivastava. Reverse nearest neighbor aggregats over data streams. In VLDB02. [Tha02] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD02, pages 428–439. [Man02] G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In VLDB02, pages [Kou04] Nick Koudas and Beng Chin Ooi and Kian-Lee Tan and Rui Zhang, Approximate NN Queries on Streams with Guaranteed Error/performance Bounds. In VLDB04, pages [Joh05] T. Johnson, S. Muthukrishnan and I. Rozenbaum. Sampling Algorithms in a stream Operator. In SIGMOD05.

DB group seminar35 Appendix –reservoir sampling This algorithm (called Algorithm X in Vitter’s paper) obtains a random sample of size n during a single pass through the relation. The number of tuples in the relation does not need to be known beforehand. The algorithm proceeds by inserting the first n tuples into a “reservoir.” Then a random number of records are skipped, and the next tuple replaces a randomly selected tuple in the reservoir. Another random number of records are then skipped, and so forth, until the last record has been scanned.

DB group seminar36 Appendix: kd-tree Start from the root-cell and bisect recursively the cells through their longest axis, so that an equal number of particles lie in each sub- volume