CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

The Average Case Complexity of Counting Distinct Elements David Woodruff IBM Almaden.
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
DIMACS Streaming Data Working Group II On the Optimality of the Holistic Twig Join Algorithm Speaker: Byron Choi (Upenn) Joint Work with Susan Davidson.
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.
Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
CS Lecture 9 Storeing and Querying Large Web Graphs.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
ICNP'061 Benefit-based Data Caching in Ad Hoc Networks Bin Tang, Himanshu Gupta and Samir Das Computer Science Department Stony Brook University.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Introduction to Evolutionary Computation  Genetic algorithms are inspired by the biological processes of reproduction and natural selection. Natural selection.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
CS 580S Sensor Networks and Systems Professor Kyoung Don Kang Lecture 7 February 13, 2006.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Neighbourhood Sampling for Local Properties on a Graph Stream A. Pavan, Iowa State University Kanat Tangwongsan, IBM Research Srikanta Tirthapura, Iowa.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Near Optimal Streaming algorithms for Graph Spanners Surender Baswana IIT Kanpur.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Data Stream Management Systems
Data in Motion Michael Hoffman (Leicester) S Muthukrishnan (Google) Rajeev Raman (Leicester)
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Data Stream Algorithms Lower Bounds Graham Cormode
Data Mining: Concepts and Techniques Mining data streams
Calculating frequency moments of Data Stream
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
Lower bounds on data stream computations Seminar in Communication Complexity By Michael Umansky Instructor: Ronitt Rubinfeld.
The Message Passing Communication Model David Woodruff IBM Almaden.
Clustering Data Streams A presentation by George Toderici.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Mining Data Streams (Part 1)
Algorithms for Big Data: Streaming and Sublinear Time Algorithms
Clustering Data Streams
The Stream Model Sliding Windows Counting 1’s
Approximating the MST Weight in Sublinear Time
A paper on Join Synopses for Approximate Query Answering
Data-Streams and Histograms
Finding Frequent Items in Data Streams
Streaming & sampling.
Rank Aggregation.
Y. Kotidis, S. Muthukrishnan,
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Heavy Hitters in Streams and Sliding Windows
By: Ran Ben Basat, Technion, Israel
Approximation and Load Shedding Sampling Methods
Clustering.
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002

CS 591 A1 2 Motivation data setsTraditional DBMS – data stored in finite, persistent data sets data streamsNew Applications – data input as continuous, ordered data streams –Network monitoring and traffic engineering –Telecom call records –Financial applications –Sensor networks –Web logs and clickstreams

CS 591 A1 3 Data Stream Model Data elements in the stream arrive online System has no control over arrival order, either within a data stream or across many streams Data streams are potentially bounded in size Once an element from a data stream has been processed, it is discarded unless otherwise archived.

CS 591 A1 4 Goals To identify the needs of data stream applications To study algorithms for data stream applications

CS 591 A1 5 Sample Applications Network security (e.g., iPolicy, NetForensics/Cisco, Niksun) –Network packet streams, user session information –Queries: URL filtering, detecting intrusions & DOS attacks & viruses Financial applications (e.g., Traderbot) –Streams of trading data, stock tickers, news feeds –Queries: arbitrage opportunities, analytics, patterns

CS 591 A1 6 Distributed Streams Evaluation Logical stream = many physical streams –maintain top 100 Yahoo pages Correlate streams at distributed servers –network monitoring Many streams controlled by few servers –sensor networks Issues –Move processing to streams, not streams to processors

CS 591 A1 7 Synopses Queries may access or aggregate past data Need bounded-memory history-approximation Synopsis? –Succinct summary of old stream tuples –Like indexes, but base data is unavailable Examples –Sliding Windows –Samples –Sketches –Histograms –Wavelet representation

CS 591 A1 8 Model of Computation Increasing time Synopses/Data Structures Data Stream Memory: poly(1/ε, log N) Query/Update Time: poly(1/ε, log N) N: # tuples so far, or window size ε: error parameter

CS 591 A1 9 Algorithmic Issues Sketching Techniques S = {x 1,…x N }, x i  {1,..,d}, m i = |{ j |x j = i}| Kth frequency moment F k of S =  m k i Wavelets –Coefficients are projections of the given signals onto an orthogonal set of basis vector –Higher valued coefficients retain most information Sliding Windows –Prevents stale data from influencing analysis and statistics –Statistics including sketches can be maintained over sliding windows

CS 591 A1 10 Streaming Algorithms [Yossef, Kumar, Sivakumar] Input: A string  (x), error parameter , a confidence parameter 0   <1, one pass access to  (x) Output: A streaming algorithm gives  -approx of f(x) with probability  1- , for any input x and for any permutation  Frequency moments can be computed to find # of distinct elements in a stream –F 0 can be computed using O(1/  3 log(1/  )log(m)) space and processing time per data item Count # triangles in a graph presented as a stream –Each edge is a data item (adjacency stream)or –Each node with neighbors is a data item (incidence stream)

CS 591 A1 11 Streaming Algorithms [Ajtai et. al] Measures Sortedness –Estimates the number of inversions in a permutation to within a factor of  Motivation –Smart engineering of sorting algorithms –Evaluate ranking function that defines permutation Complexity –Requires space O(log(N)loglog(N)) and time O(log (N)) per data element.

CS 591 A1 12 Clustering Data Streams K-median problem for data streams [ Guha, Mishra, Motwani and Callaghan ] In k-Median problem, the objective is to minimize the average distance from data points to their closest cluster centers K-Median problem can be related to facility-location problem. In k-Center problem, the objective is to minimize the maximum radius of a cluster

CS 591 A1 13 Algorithm –Algorithm based on divide-and-conquer –Running time is O(n 1+  ) and uses O(n  ) memory –Makes a single pass over the data –Randomization reduces running time to O(nk) in one pass –No deterministic algorithm can achieve bounded- approximation in deterministic o(nk) time

CS 591 A1 14 Divide-and-Conquer Algorithm Small-space(S) 1. Divide S into L disjoint pieces X 1, …, X L 2. For each i, find O(k) centers in X i. Assign each point in X i to its closest center 3. Let X’ be the O(lk) centers obtained in (2), where each center c is weighted by the number of points assigned to it 4. Cluster X’ to find k centers [Using c- approximation algorithm].

CS 591 A1 15 Theorems… Theorem 1: Given an instance of the k-median problem with a solution of cost C, where the medians may not belong to the set of input points,  a solution of cost 2C where all the medians belong to the set of input points. Proof: Let j 1, …, j q be assigned to median i in solution with cost C. Consider j l which is closest to i as median (instead of i). Thus the assignment distance of every point j r at most doubles since c xy  c xi + c iy. Over all n points in the original set, the assignment can at most double, summing to at most 2C

CS 591 A1 16 Theorems.. Theorem 2: If the sum of the costs of the l optimum k- median solutions for X 1,.., X L is C and if C* is the cost of the optimum k-median solution for the entire set S, then  a solution of cost  2(C+C*) to the new weighted instance X’. Proof: The cost of X’ =  I’ C i’  (i’) d i’ =  i C i’  (i’) Cost of X’   i C i’  (i’) since  is optimal for X’.  i C i’  (i’)   i ( C i’i + C i  (i) ) = C+C*. The cost 2(C+C*) follows from Theorem 1.

CS 591 A1 17 Data Stream Algorithm Input the first m points and reduce them to O(k) points. The weight of intermediate medians is # points assigned to it. Repeat above till we see m 2 /(2k) of the original data points. There are m intermediate medians now. Cluster m first-level medians into 2k second level medians In general, maintain m level-i medians, on seeing m, generate 2k level-i+1 medians with weights as defined earlier On seeing all the original data points, cluster all the intermediate medians into k final medians # levels = O(log(n/m)/log(m/k)) If k << m and m = O(n  ) for constant , we have an O(1)- approximation. Running time is O(n 1+  ).

CS 591 A1 18 Randomized Clustering Input O(M/k) points and sample to cluster this to 2k intermediate medians (M = memory size) Use local search algorithm to cluster O(M) intermediate medians of level i to 2k medians of level i+1 Use primal dual to cluster the final O(k) medians to k medians Running time is O(nk log(n)) in one pass and it uses n  memory for small k

CS 591 A1 19 Open Problems Are there any ``killer apps’’ for data stream systems ? Techniques which maintain correlated aggregates with provable bounds How to cluster, maintain summary using sliding windows ? How to deal with distributed streams and perform clustering on them ?