Presentation is loading. Please wait.

Presentation is loading. Please wait.

DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian Procopiuc AT&T Shannon Labs SSTD’05 Presented by: Huiping.

Similar presentations


Presentation on theme: "DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian Procopiuc AT&T Shannon Labs SSTD’05 Presented by: Huiping."— Presentation transcript:

1 DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian Procopiuc AT&T Shannon Labs SSTD’05 Presented by: Huiping Cao

2 DB group seminar2 Outline Background Related work Problem definition Online algorithm Experiments References

3 DB group seminar3 Background Streaming data Large volume Continuous arrival Data stream algorithms One pass Small amount of space Fast updating

4 DB group seminar4 Background Data stream model Operations of elements Insertion Most common case Deletion Updating Most difficult case Validation time of elements Whole history  Landmark window [Geh01], cash register [this paper] Partial recent history  Sliding window [Geh01], turnstile model [this paper]

5 DB group seminar5 Related work Classified according to operations Aggregation avg, max, min, sum [Dob02] count (distinct values) [Gib01] Quantile in 1D data [Gre01] Frequent items (Heavy hitter) [Man02] Query estimation Join size estimation [Dob02] K Nearest Neighbor seaching eKNN [Kou04], RNN aggregation[Kor01] Techniques Histogram [Tha02], sample [Joh05], special synopsis......

6 DB group seminar6 Problem definition D:, where p i  R d Cash register model Q i =, R i : a d-dimensional hyper-rectangle Selectivity of Q i : sel(Q i ) = |{p i | j ≤ i, p i  R i } | These points arrive before time step i They lie in R i Problem: Estimating sel(Q i ) Measurement, relative error

7 DB group seminar7 Online algorithm: rough steps Get random samples S of D Using reservoir sampling method [vit85] Using kd-tree([kd75])-like structure to index these sample points Maintenance of the sample and the kd-tree-like structure online Compute range selectivity: estimated_sel(Q) using kernel density estimator

8 DB group seminar8 Random sampling Theorem1: Let T be the data stream seen so far, size: |T| let S  T be a random sample chosen via the reservoir sampling technique, such that |S| =  ((d/  2 )log (1/  )+log(1/  )), where 0 < ,  <1 and |S| is the size of S. Then with probability 1- , for any axis-parallel hyper-rectangle| Q the following is true: sel(Q) = |Q  T| is the selectivity of Q with respect to the data stream seen so far, sel(Q, S) = |Q  S| is the selectivity of Q with respect to the random sample.

9 DB group seminar9 Sampling Random sampling Problem: when sel(Q) is smaller, relative error is bigger Better selectivity estimator: kernel density estimator

10 DB group seminar10 Kernel density estimator S = {s 1, …, s m }: random subset of D where x = (x 1, …, x d ) and s i = (s i1, …, s id ) are d- dimensional points B j : kernel bandwidth along dimension j [Sco92] Global parameter

11 DB group seminar11 One-dimensional kernels (a) Kernel function, B = 1; (b) Contribution of multiple kernels to estimate of range query

12 DB group seminar12 Local kernel density estimator kd-tree structure T(S): index of the sample data Each leaf contains one point s i  leaf(s i ) Two leaves are disjoint Union of all leaves is R d Each leaf maintain d+1 values:  i,  i1,  i2,…,  id  ij : approximates the standard distribution of the points in the cell centered at s i along dimension j R = [a 1, b 1 ]  …  [a d, b d ] T i : subset of points in tree leaf leaf(s i )

13 DB group seminar13 Update T(S) Purpose: maintain  i,  ij (1 ≤ j ≤ d)  i is the number of stream points contained in leaf(s i ) Assume p is the current point in the data stream If p is not selected in S according to sample algorithm Find the leaf that contains p, leaf(s i ), Increment  I Add (p j – s ij ) 2 to  ij If p is selected in S A point q will be deleted from S Delete leaf(q) Add a new leaf corresponding to p

14 DB group seminar14 Delete leaf(q) u: parent node of leaf(q) v: sibling of leaf(q) box(u): axis parallel hyper- rectangle of node u h(u): hyper-plane orthogonal to a coordinate axis that divides box(u) into two smaller boxes associated with the children of u. N(q), neighbors of leaf(q) leaves in the subtree of v that have one boundary contained in h(u)

15 DB group seminar15 Delete leaf(q) Redistribute points in leaf(q) to N(q) Extending the bounding box of each neighbor of leaf(q) past h(u), until it hits the left boundary of leaf(q) Update ,  values for all leaves in N(q) Notations: Leaf(r)  N(q) box e (r): the expanded box of r

16 DB group seminar16 Update  of leaf(r) Update  value for every leaf leaf(r)  N(q) compute selectivity sel(box e (r)) of the box e (r) w.r.t. leaf(q)  r =  r +sel(box e (r))

17 DB group seminar17 Update  of leaf(r) [ j,  j ] be the intersection of box e (r) and the kernal function of q along dimension j. Discretize it by  equidistant points (  is a large constant ) j = v 1, v 2, …, v  =  j Update  rj as following: Wt i is the approximate number of points of leaf(q) whose j’th coordinate lies in the interval [v i,v i+1 ].

18 DB group seminar18 Update  of leaf(r) Updating  rj by discretizing the intersection of box e (r) and the kernel of q along dimension j (the gray area represents wt 2 ) All points in this interval is approximated by its midpoint

19 DB group seminar19 Insert a leaf p: newly inserted point q: existing sample point such that p  leaf(q) Split leaf(q) by a hyperplane Pass through the midpoint (p+q)/2 Direction: alternative rule of kd-tree If i is the splitting dim for the parent of q, then the splitting dim for q is (i+1) mod d Update  and  values for p and q using similar procedure for updating

20 DB group seminar20 Extension Allow deletion of a point p from the data stream If p is not a kernel center Compute leaf(s i ) such that p  leaf(s i )  i =  i -1  ij =  ij – (p j - s ij ) 2 p is a kernel center Delete leaf(p) Replace p with a newly coming point p’ This does not follow the sample procedure, may make the sample not uniform w.r.t. points in D

21 DB group seminar21 Experiments Different number of dimensions Different query loads Range selectivity Measurement: Accuracy Trade-off between accuracy and space usage

22 DB group seminar22 Data Synthetic data, generator is for projected cluster [Agg99] SD2(2D), SD4 (4D) 1 million points, 90% are contained in clusters, 10% uniformly distributed Real data NM2 1 million 2D data with real-valued attributes Each point: an aggregate of measurements taken in 15m interval, reflecting minimum and maximum delay times between pairs of severs on AT&T’s backbone network

23 DB group seminar23 Query loads 2 query workload for each dataset Queries are chosen randomly in the attribute space Each workload contains 200 queries Each query in a workload has the same selectivity, 0.5% for the first workload (low selectivity) 10% for the second (high selectivity)

24 DB group seminar24 Accuracy measure Q i =, its relative error is Err i Let {Q i1, …, Q ik } be the query workload, the average relative error of this workload is avg_err

25 DB group seminar25 Validating local kernels in an off-line setting(1) MPLKernels (Multi-Pass Local kernels) Scan the data once, get random sample points Compute the kd-tree on them Scan the data second times, compute  and  Only useful in off-line setting GKernels (Global Kernels) [Gun00] Kernel bandwidth: function of global standard deviation of the data along each dimension One-pass approximation  Two-pass accurate computation Sample: Random sampling LKernels: one pass local kernels

26 DB group seminar26 Validating local kernels in an off-line setting(2)

27 DB group seminar27 Validating local kernels in an off-line setting(3)

28 DB group seminar28 Comparison with histogram methods(1) Histogram method [Tha02] faster heuristic: EGreedy

29 DB group seminar29 General online setting(1) Queries arrive interleaved with points Compare Sample LKernels MPLernels

30 DB group seminar30 General online setting(2)

31 DB group seminar31 General online setting(3)

32 DB group seminar32 General online setting(4)

33 DB group seminar33 References [kd75] J.L. Bentley. Multidimensional Binary Search Trees Used for Associative Searching. Communication of the ACM, 18(9), September 1975. [vit85] J.S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1): 37-57, 1985. [Sco92] D. W. Scott. Multivariate Density Estimation. Wiley-Interscience, 1992. [Agg99] C. C. Aggarwal, C. M. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In SIGMOD99, pages 61–72. [Gun00] D. Gunopulos, G. Kollios, V. J. Tsotras, and C. Domeniconi. Approximating multidimensional aggregate range queries over real attributes. In SIGMOD00, pages 463–474. [Geh01] J. Gehrke, F. Korn and D. Srivastava. On computing correlated aggregates over continual data streams. In SIGMOD01. [Gib01] P. Gibbons. Distinct sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports. In VLDB01. [Gre01] M. Greenwald and S. Khanna, Space-Efficient Online Computation of Quantile Summaries. In SIGMOD01.

34 DB group seminar34 References [Dob02] A. Dobra, M. Garofalakis, J. Gehrke and R. Rastogi. Processing complex aggregate queries over data streams. In SiGMOD02. [Kor02] Flip Korn, S. Muthukrishnan, Divesh Srivastava. Reverse nearest neighbor aggregats over data streams. In VLDB02. [Tha02] N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In SIGMOD02, pages 428–439. [Man02] G. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In VLDB02, pages 346-357. [Kou04] Nick Koudas and Beng Chin Ooi and Kian-Lee Tan and Rui Zhang, Approximate NN Queries on Streams with Guaranteed Error/performance Bounds. In VLDB04, pages 804-815. [Joh05] T. Johnson, S. Muthukrishnan and I. Rozenbaum. Sampling Algorithms in a stream Operator. In SIGMOD05.

35 DB group seminar35 Appendix –reservoir sampling This algorithm (called Algorithm X in Vitter’s paper) obtains a random sample of size n during a single pass through the relation. The number of tuples in the relation does not need to be known beforehand. The algorithm proceeds by inserting the first n tuples into a “reservoir.” Then a random number of records are skipped, and the next tuple replaces a randomly selected tuple in the reservoir. Another random number of records are then skipped, and so forth, until the last record has been scanned.

36 DB group seminar36 Appendix: kd-tree Start from the root-cell and bisect recursively the cells through their longest axis, so that an equal number of particles lie in each sub- volume


Download ppt "DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian Procopiuc AT&T Shannon Labs SSTD’05 Presented by: Huiping."

Similar presentations


Ads by Google