1.  RNN(q) – returns a set of data points that have the query point q as the nearest neighbor.  Advanced database applications: fixed wireless telephone.

Slides:

Advertisements

Similar presentations

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Fast Algorithms For Hierarchical Range Histogram Constructions

Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster.

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.

Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.

Mining Data Streams.

Searching on Multi-Dimensional Data

Online Scheduling with Known Arrival Times Nicholas G Hall (Ohio State University) Marc E Posner (Ohio State University) Chris N Potts (University of Southampton)

From Counting Sketches to Equi-Depth Histograms CS240B Notes from a EDBT11 paper entitled: A Fast and Space-Efﬁcient Computation of Equi-Depth Histograms.

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku

2-dimensional indexing structure

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).

What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.

Similarity Search in High Dimensions via Hashing Aristides Gionis, Protr Indyk and Rajeev Motwani Department of Computer Science Stanford University presented.

A survey on stream data mining

CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.

Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.

Chapter 3 Sec 3.3 With Question/Answer Animations 1.

B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)

Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

CSC 211 Data Structures Lecture 13

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:

9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

Facility Location in Dynamic Geometric Data Streams Christiane Lammersen Christian Sohler.

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.

DB group seminar1 Density Estimation for Spatial Data Streams Celilia M. Procopiuc and Octavian Procopiuc AT&T Shannon Labs SSTD’05 Presented by: Huiping.

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Histograms for Selectivity Estimation, Part II Speaker: Ho Wai Shing Global Optimization of Histograms.

1 Reverse Nearest Neighbor Queries for Dynamic Databases SHOU Yu Tao Jan. 10 th, 2003 SIGMOD 2000.

Clustering Data Streams A presentation by George Toderici.

Da Yan, Raymond Chi-Wing Wong, and Wilfred Ng The Hong Kong University of Science and Technology.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Mining Data Streams (Part 1)

Clustering Data Streams

Frequency Counts over Data Streams

Updating SF-Tree Speaker: Ho Wai Shing.

Stream-based Geometric Algorithms

The Stream Model Sliding Windows Counting 1’s

Influence sets based on Reverse Nearest Neighbor Queries

RE-Tree: An Efficient Index Structure for Regular Expressions

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

K Nearest Neighbor Classification

Y. Kotidis, S. Muthukrishnan,

Approximate Frequency Counts over Data Streams

Minimizing the Aggregate Movements for Interval Coverage

Range-Efficient Computation of F0 over Massive Data Streams

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Heavy Hitters in Streams and Sliding Windows

Minwise Hashing and Efficient Search

Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)

Presentation transcript:

1

 RNN(q) – returns a set of data points that have the query point q as the nearest neighbor.  Advanced database applications: fixed wireless telephone access application – “ load ” detection problem: count how many users are currently using a specific base station q  if q’s load is too heavy  activating an inactive base station to lighten the load of that over loaded base station  Asymetric Property The Nearest Neighbor Relation is not symmetric, the set of points that are closest to a query point (i.e., the Nearest Neighbors) differs from the set of points that have the query point as their Nearest Neighbor (called the Reverse Nearest Neighbors) 2

NN(q) = p NN(p) = q If p is the nearest neighbor of q, then q need not be the nearest neighbor of p (in this case the nearest neighbor of p is r). those efficient NN algorithms cannot directly applied to solve the RNN problems. Algorithms for RNN problems are needed. A straight forward solution: -- check for each point whether it has q as its nearest neighbor -- not suitable for large data set! q p r 3

 Bichromatic Version: the data points are of two categories, say red and blue. The RNN query point q is in one of the categories, say blue. So RNN(q) must determine the red points which have the query point q as the closest blue point. e.g. fixed wireless telephone access application: clients/red (e.g. call initiation or termination) servers/blue (e.g. fixed wireless base stations)  Monochromatic Version: all points are of the same color is the monochromatic version. 4

 RNN queries have been studied for finite, stored data sets  RNN can identify "influence" of a data point on the database [F. Korn and S. Muthukrishnan, Influence Sets Based on Reverse Nearest Neighbor Queries] [I. Stanoi, M. Riedewald, D., Mirek Riedewald, D. Agrawal, A.E. Abbadi, Discovery of influence sets in frequently updated databases] [C. Yang, King-Ip Lin, An index structure for efficient reverse nearest neighbor queries ] 5

 Finding the set of customers affected by the opening of a new store outlet location  Notifying the subset of subscribers to a digital library who will find a newly added document most relevant  Finding set of users whose profiles are more similar to the new service offering than to any other service 6

 Fixed Physical Position  Defined Coverage Area  Calls Arrives in Streams  Worst-Case “Signal Strength” – RNN MAXDIST  “Load” on Base Station – RNN COUNT  Optimization RNNA problems 7

 Fixed Physical Position  Detect vehicles, estimate speed and length  User Queries Arrives in Streams  Periodic Updates of Closest Sensor  “Load” on Sensor – RNN COUNT  “Accuracy” of Information – RNN MAXDIST  Optimization RNNA problems 8

 Max-RNNA – Given K servers, return the maximum RNNA over all clients to any of the servers  List-RNNA – Given K servers, return the RNNA over all clients to each of the servers  Opt-RNNA – Find a set of at most K servers for which their RNNAs are below a given threshold 9

 Max-RNN-Count Insertion and Deletion – 3-approximation Insertion only – (1+  ) -approximation  Max-RNN-MAXDIST (1+  ) -approximation  List-RNN-COUNT & List-RNN-MAXDIST Lower- & Upper-bound as function of the true counts  Opt-RNN-COUNT 8-approximation  Opt-RNN-MAXDIST (1+  ) –approximation Space – near-linear in the number of available servers 10

 No previous works for RNNA over Data Streams  Algorithms over Data Streams  Algorithms for computing RNN over a conventional DB 11

1.Space requirements of Selection and Sorting as a function of the number of passes over the data [J. I. Munro and M. S. Paterson. Selection and Sorting with Limited Storage] 2.Formalization of the Data Stream Model [A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries] and [M. R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams] 12

3.Computing the approximate median and other quantiles in a single pass over data set [R. Agrawal, A. Swami, A One-Pass Space-Efficient Algorithm for Finding Quantiles] [G.S. Manku, S. Rajagopalan, B.G. Lindsay. Approximate Medians and other Quantiles in One Pass and with Limited Memory] [G.S. Manku, S. Rajagopalan, B.G. Lindsay. Random Sampling Techniques for Space Efficient Online Computation of Order Statistics of Large Datasets] [M. Greenwald and S. Khanna. Space- Efficient Online Computation of Quantile Summaries] 13

4.Computing the approximate online quantiles with probabilistic guaranties over data stream [A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles] 5.Histogram construction over data stream [A.C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Fast, Small-Space Algorithms for Approximate Histogram Maintenance ] 14

6.Maintaining summary structures for maintaining approximate aggregates over data stream [A.C. Gilbert, Y. Kotidis, S. Muthukrishnan, M.J. Strauss. Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries] and [M. R. Henzinger, P. Raghavan, S. Rajagopalan. Computing on data streams] [J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams] 15

7.Construction of decision trees [P. Domingos, G. Hulten. Mining High-Speed Data Streams] [J. Gehrke, V. Ganti, R. Ramakrishnan, W.-Y. Loh. BOAT Optimistic Decision Tree Construction] 8.Association rules [C. Hidber. Online Association Rule Mining] 9.Similarity matching [G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using Hamming Norms] 16

10.Clustering algorithms (k-median clustering problem) [M. Charikar, C. Chekuri, T. Feder, R. Motwani. Incremental Clustering and Dynamic Information Retrieval ] [S. Guha, N. Mishra, R. Motwani, L. O'Callaghan. Clustering Data Streams] 17

11.Lp norms [P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation] 12.Hamming norms [G. Cormode, M. Datar, P. Indyk, S. Muthukrishnan. Comparing Data Streams Using Hamming Norms] 13.Quantiles [A.C. Gilbert, Y.Kotidis, S. Muthukrishnan, M.J. Strauss. How to Summarize the Universe: Dynamic Maintenance of Quantiles] 14.Sliding window [M. Datar. Maintaining Stream Statistics over Sliding Windows ] 18

15.Study of RNN in data bases [F. Korn and S. Muthukrishnan, Influence Sets Based on Reverse Nearest Neighbor Queries] 16.Efficient access methods for indexing RNN [I. Stanoi, M. Riedewald, D., Mirek Riedewald, D. Agrawal, A.E. Abbadi, Discovery of influence sets in frequently updated databases] [C. Yang, King-Ip Lin, An index structure for efficient reverse nearest neighbor queries ] 19

Collection of n available servers (not necessary active) l i – location of server i Clients arrive and depart Lj – location of client j RNN of server i is the set of all clients that have i as their NN server 20

 RNN-COUNT(i) – number of clients currently in the system for which i is the NN – “LOAD” for active servers  RNN-MAXDIST(i ) – largest distance to a client that has i as its NN – “QUALITY” for active servers  Streams of clients are large – can’t be stored in memory – computing approximate RNNA values 21

 Max-RNNA – Given K active servers, return the maximum RNNA over all clients to their closest active server – “Worst-case Load” or “Quality”  List-RNNA – Given K active servers, return a list of the RNNA over all clients to each of the K active servers - “Maximum Load” or “Worst-case Quality”  Opt-RNNA – Find a set of at most K servers from the available ones to be active, for which their RNNAs are below a given threshold – “Optimization” 22

Assumption: Servers are on as straight line Counters for servers i, j and client k: CL ij -> L k  [l i, (l i +l j )/2) CR ij -> L k  ((l i +l j )/2, l j ] 23

The algorithm: Let l be the closest active server from the left of i and r from the right. RNN-COUNT(i) = CL il + CR ir We want – space near-linear and less updates  Approximation is needed Require O(n 2 ) space O(n 2 ) updates 24

Definitions: s 1,.. s k are the K servers designated to be active Assumption: Servers are sorted l 1  …  l n Counter number of clients for server i: C(i) -> L k  [l i, l i+1 ) – at the right side of server i C(0) – at left side of server 1 Require: O(n) space O(log n) updates (look for wanted server) 25

Max-RNNA(s 1,.. s k ) = max i RNN-COUNT(s i ) 26

C(1)C(2)C(3) 23 C(4) 4 C(0) 1 J<J< J >+ 1 RNN-COUNT(s 0 )RNN-COUNT(s 1 ) 27

28

M i for each s i The Proof is similar to previous theorem 29

Greedy Algorithm finds: Minimal Number of active servers – K max i RNN-COUNT(s i )  C 30

31

32

Given upper bound on number of servers K Minimize max i RNN-COUNT(s i ) Algorithm 1.Choose different values of C 2.Run Greedy Algorithm of Opt-RNNA 3.Repeat until solve with number of servers K*  K 33

Assumption: Servers are sorted l 1  …  l n Counter number of clients for server i: C(i) -> L k  [l i, l i+1 ) – at the right side of server i C(0) – at left side of server 1 Maintain l-quantiles (Greenwald & Khanna) c i 1 …c i l – number of clients lying in [l i, Lc i k ] Within (1  )kC(i)/l, where 1  k  l Require: O(logC(i)/  ) space 34

Max-RNNA(s 1,.. s k ) = max i RNN-COUNT(s i ) 35

36

Implementation in the same way Maintenance of data structure for deletion ? 37

The algorithm: Histogram based on space partitioning Assumption: Servers are sorted l 1  …  l n Exponential sized buckets Domain size U, such that U = [min(L j,l i ), max(L j,l i )] Dividers between servers i and (i+1) – g i j at distance (1+  ) j from l i Number of dividers is O(log 1+  [l i+1 -l i ]) 38

Counter number of clients between g i k and g i k+1 is #g i k For updates of client j:  Find i such that L j  [l i, l i+1 )  Find k such that L j  [g i k, g i k+1 )  Update value #g i k Require O(n log 1+  U) space O(log 1+  U) updates 39

Max-RNNA(s 1,.. s k ) = max i RNN-MAXDIST(s i ) 40

Details of the proof will be given in the future paper. 41

D i =max{RD i,LD i } for each s i The Proof is similar to previous theorem 42

Greedy Algorithm with limited backtracking finds: Minimal Number of active servers – K max i RNN-MAXDIST(s i )  D 43

The proof will be given in the future paper. 44

Given upper bound on number of servers K Minimize max i RNN-MAXDIST(s i ) Algorithm 1.Choose different values of D 2.Run Greedy Algorithm of Opt-RNNA 3.Repeat until solve with number of servers K*  K 45

Assumption: the clients are on the same axis as the servers Nearest Neighbor and Reverse Nearest Neighbor Queries for Moving Objects R.Benetis, C.S.Jensen,G.Karciauskas, S.Saltenis Reverse Nearest Neighbor Queries for Dynamic Databases SHOU Yu Tao Let the space around a query point q be divided into six equal regions Si (1<=i<=6) by straight lines intersecting q. Si therefore is the space between two space dividing lines. For a given 2-dimensional dataset, RNN(q) will return at most six data points. And they are must be on the same circle centered at q. s1 s6 s5 s4 s3 s2 q L1 L3 L2 46

47

The following aspects were tested: Experimental data: CALIFORNIA – latitude of 63k buildings in California, uniform and binomial distributions 48

49

50

51

52

RNNA supports computations based on geographical distances or vector-space similarity between servers and clients Applications of RNNA: o Classical – facility location o Emerging – fixed wireless telephony access and sensor-based traffic monitoring Data of RNNA arrives in streams RNNA performs online computations 53

We study three problems:  Max-RNNA  List-RNNA  Opt-RNNA Two aggregates:  COUNT  MAXDIST Approximate algorithms with near-linear space usage 54

Any Questions?