Download presentation
Presentation is loading. Please wait.
1
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002
2
CS 591 A1 2 Motivation data setsTraditional DBMS – data stored in finite, persistent data sets data streamsNew Applications – data input as continuous, ordered data streams –Network monitoring and traffic engineering –Telecom call records –Financial applications –Sensor networks –Web logs and clickstreams
3
CS 591 A1 3 Data Stream Model Data elements in the stream arrive online System has no control over arrival order, either within a data stream or across many streams Data streams are potentially bounded in size Once an element from a data stream has been processed, it is discarded unless otherwise archived.
4
CS 591 A1 4 Goals To identify the needs of data stream applications To study algorithms for data stream applications
5
CS 591 A1 5 Sample Applications Network security (e.g., iPolicy, NetForensics/Cisco, Niksun) –Network packet streams, user session information –Queries: URL filtering, detecting intrusions & DOS attacks & viruses Financial applications (e.g., Traderbot) –Streams of trading data, stock tickers, news feeds –Queries: arbitrage opportunities, analytics, patterns
6
CS 591 A1 6 Distributed Streams Evaluation Logical stream = many physical streams –maintain top 100 Yahoo pages Correlate streams at distributed servers –network monitoring Many streams controlled by few servers –sensor networks Issues –Move processing to streams, not streams to processors
7
CS 591 A1 7 Synopses Queries may access or aggregate past data Need bounded-memory history-approximation Synopsis? –Succinct summary of old stream tuples –Like indexes, but base data is unavailable Examples –Sliding Windows –Samples –Sketches –Histograms –Wavelet representation
8
CS 591 A1 8 Model of Computation 1 0 1 1 1 0 1 0 0 1 1 Increasing time Synopses/Data Structures Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # tuples so far, or window size ε: error parameter
9
CS 591 A1 9 Algorithmic Issues Sketching Techniques S = {x 1,…x N }, x i {1,..,d}, m i = |{ j |x j = i}| Kth frequency moment F k of S = m k i Wavelets –Coefficients are projections of the given signals onto an orthogonal set of basis vector –Higher valued coefficients retain most information Sliding Windows –Prevents stale data from influencing analysis and statistics –Statistics including sketches can be maintained over sliding windows
10
CS 591 A1 10 Streaming Algorithms [Yossef, Kumar, Sivakumar] Input: A string (x), error parameter , a confidence parameter 0 <1, one pass access to (x) Output: A streaming algorithm gives -approx of f(x) with probability 1- , for any input x and for any permutation Frequency moments can be computed to find # of distinct elements in a stream –F 0 can be computed using O(1/ 3 log(1/ )log(m)) space and processing time per data item Count # triangles in a graph presented as a stream –Each edge is a data item (adjacency stream)or –Each node with neighbors is a data item (incidence stream)
11
CS 591 A1 11 Streaming Algorithms [Ajtai et. al] Measures Sortedness –Estimates the number of inversions in a permutation to within a factor of Motivation –Smart engineering of sorting algorithms –Evaluate ranking function that defines permutation Complexity –Requires space O(log(N)loglog(N)) and time O(log (N)) per data element.
12
CS 591 A1 12 Clustering Data Streams K-median problem for data streams [ Guha, Mishra, Motwani and Callaghan ] In k-Median problem, the objective is to minimize the average distance from data points to their closest cluster centers K-Median problem can be related to facility-location problem. In k-Center problem, the objective is to minimize the maximum radius of a cluster
13
CS 591 A1 13 Algorithm –Algorithm based on divide-and-conquer –Running time is O(n 1+ ) and uses O(n ) memory –Makes a single pass over the data –Randomization reduces running time to O(nk) in one pass –No deterministic algorithm can achieve bounded- approximation in deterministic o(nk) time
14
CS 591 A1 14 Divide-and-Conquer Algorithm Small-space(S) 1. Divide S into L disjoint pieces X 1, …, X L 2. For each i, find O(k) centers in X i. Assign each point in X i to its closest center 3. Let X’ be the O(lk) centers obtained in (2), where each center c is weighted by the number of points assigned to it 4. Cluster X’ to find k centers [Using c- approximation algorithm].
15
CS 591 A1 15 Theorems… Theorem 1: Given an instance of the k-median problem with a solution of cost C, where the medians may not belong to the set of input points, a solution of cost 2C where all the medians belong to the set of input points. Proof: Let j 1, …, j q be assigned to median i in solution with cost C. Consider j l which is closest to i as median (instead of i). Thus the assignment distance of every point j r at most doubles since c xy c xi + c iy. Over all n points in the original set, the assignment can at most double, summing to at most 2C
16
CS 591 A1 16 Theorems.. Theorem 2: If the sum of the costs of the l optimum k- median solutions for X 1,.., X L is C and if C* is the cost of the optimum k-median solution for the entire set S, then a solution of cost 2(C+C*) to the new weighted instance X’. Proof: The cost of X’ = I’ C i’ (i’) d i’ = i C i’ (i’) Cost of X’ i C i’ (i’) since is optimal for X’. i C i’ (i’) i ( C i’i + C i (i) ) = C+C*. The cost 2(C+C*) follows from Theorem 1.
17
CS 591 A1 17 Data Stream Algorithm Input the first m points and reduce them to O(k) points. The weight of intermediate medians is # points assigned to it. Repeat above till we see m 2 /(2k) of the original data points. There are m intermediate medians now. Cluster m first-level medians into 2k second level medians In general, maintain m level-i medians, on seeing m, generate 2k level-i+1 medians with weights as defined earlier On seeing all the original data points, cluster all the intermediate medians into k final medians # levels = O(log(n/m)/log(m/k)) If k << m and m = O(n ) for constant , we have an O(1)- approximation. Running time is O(n 1+ ).
18
CS 591 A1 18 Randomized Clustering Input O(M/k) points and sample to cluster this to 2k intermediate medians (M = memory size) Use local search algorithm to cluster O(M) intermediate medians of level i to 2k medians of level i+1 Use primal dual to cluster the final O(k) medians to k medians Running time is O(nk log(n)) in one pass and it uses n memory for small k
19
CS 591 A1 19 Open Problems Are there any ``killer apps’’ for data stream systems ? Techniques which maintain correlated aggregates with provable bounds How to cluster, maintain summary using sliding windows ? How to deal with distributed streams and perform clustering on them ?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.