Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.
1 11. Streaming Data Management Chapter 18 Current Issues: Streaming Data and Cloud Computing The 3rd edition of the textbook.
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
ABSTRACT We consider the problem of computing information theoretic functions such as entropy on a data stream, using sublinear space. Our first result.
Mining Data Streams.
Resource-oriented Approximation for Frequent Itemset Mining from Bursty Data Streams SIGMOD’14 Toshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.
Optimal Workload-Based Weighted Wavelet Synopsis
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Finding Aggregates from Streaming Data in Single Pass Medha Atre Course Project for CS631 (Autumn 2002) under Prof. Krithi Ramamritham (IITB).
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Maximum Network lifetime in Wireless Sensor Networks with Adjustable Sensing Ranges Mihaela Cardei, Jie Wu, Mingming Lu, and Mohammad O. Pervaiz Department.
Mariam Salloum (YP.com) Xin Luna Dong (Google) Divesh Srivastava (AT&T Research) Vassilis J. Tsotras (UC Riverside) 1 Online Ordering of Overlapping Data.
SWIM 1/9/20031 QoS in Data Stream Systems Rajeev Motwani Stanford University.
Computer Science Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
Fast Approximate Wavelet Tracking on Streams Graham Cormode Minos Garofalakis Dimitris Sacharidis
Tasking Sensor Networks Johannes Gehrke Cornell University Research Associate: Manuel Calimlim PhD Students:Rohit Ananthakrishna, Adina Costea, Abhinandan.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
07/21/2005 Senmetrics1 Xin Liu Computer Science Department University of California, Davis Joint work with P. Mohapatra On the Deployment of Wireless Sensor.
Providing Resiliency to Load Variations in Distributed Stream Processing Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik Brown University.
Mobile Relay Configuration in Data-Intensive Wireless Sensor Networks.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
RESOURCES, TRADE-OFFS, AND LIMITATIONS Group 5 8/27/2014.
Cayuga: A General Purpose Event Monitoring System Mirek Riedewald Joint work with Alan Demers, Johannes Gehrke, Biswanath Panda, Varun Sharma (IIT Delhi),
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
Histograms for Selectivity Estimation
PermJoin: An Efficient Algorithm for Producing Early Results in Multi-join Query Plans Justin J. Levandoski Mohamed E. Khalefa Mohamed F. Mokbel University.
Using Polynomial Approximation as Compression and Aggregation Technique in Wireless Sensor Networks Bouabdellah KECHAR Oran University.
On the Approximability of Geometric and Geographic Generalization and the Min- Max Bin Covering Problem Michael T. Goodrich Dept. of Computer Science joint.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.
Memory Requirements of Data Streams Reynold Cheng 19 th July, 2002.
Approximating Data Stream using histogram for Query Evaluation Huiping Cao Jan. 03, 2003.
Data Mining: Concepts and Techniques Mining data streams
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Calculating frequency moments of Data Stream
By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
1.  RNN(q) – returns a set of data points that have the query point q as the nearest neighbor.  Advanced database applications: fixed wireless telephone.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
Sketch-Based Multi-Query Processing over Data Streams Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Mining Data Streams (Part 1)
Approximation and Load Shedding for QoS in DSMS*
S. Sudarshan CS632 Course, Mar 2004 IIT Bombay
The Stream Model Sliding Windows Counting 1’s
Computing and Compressive Sensing in Wireless Sensor Networks
Load Shedding CS240B notes.
Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Load Shedding Techniques for Data Stream Systems
Streaming Sensor Data Fjord / Sensor Proxy Multiquery Eddy
Smita Vijayakumar Qian Zhu Gagan Agrawal
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Probabilistic Databases
Load Shedding CS240B notes.
Heavy Hitters in Streams and Sliding Windows
Adaptive Query Processing (Background)
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring

Acknowledgements l Al Demers l Abhinandan Das l Alin Dobra l Sasha Evfimievski l Johannes Gehrke l KD-D initiative (Art Becker et al.)

Introduction l Data streams versus databases l Infinite stream, continuous queries l Limited resources l Network monitoring l High arrival rates, approximation [CGJSS02] l Stock trading l Complex computation [ZS02] l Retail, E-business, Intelligence, Medical Surveillance l Identify relevant information on-the-fly, archive for data mining l Exact results, error guarantees

Information Spheres l Local Information Sphere l Within each organization l Continuous processing of distributed data streams l Online evaluation of thousands of triggers l Storage/archival of important data l Global Information Sphere l Between organizations l Share data in privacy preserving way

Local Information Sphere Distributed data stream event processing and online data mining l Technical challenges l Blocking operators, unbounded state l Graceful degradation under increasing load l Integration with archive l Processing of physically distributed streams

Event Matching, Correlation l Join of data streams BrandMpixPrice Canon MpixPrice >2.0<250

Event Matching, Correlation l Join of data streams BrandMpixPrice Canon Fuji MpixPrice >2.0<250 >4.0<400

Event Matching, Correlation l Join of data streams l Equi-join, text similarity, geographical proximity,… l Problem: unbounded state, computation BrandMpixPrice Canon Fuji Kodak MpixPrice > 2.0< 250 > 4.0< 400 = 3.0< 200

Window Joins l Restrict join to window of most recent records (tuples) l Landmark window l Sliding window based on time or number of records l Problem definition l Window based on time: size w l Synchronous record arrival l Equi-join

Abstract Model l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w R S (r0,s2), (r1,s2), (r2,s2)

Abstract Model (cont.) l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3)

Abstract Model (cont.) l Data streams R(A,…), S(A,…) l Compute equi-join on A l Match all r and s of streams R, S such that r.A=s.A l Sliding window of size w R S (r0,s2), (r1,s2), (r2,s2) (r3,s1), (r1,s3), (r2,s3) No new output

Limited Resources l Focus on limited memory M<2w l State of the art: random load shedding [KNV03] l Random sample of streams l Desired approach: semantic load shedding l Goal: graceful degradation l Approximation l Set-valued result: Error measure?

Set-Approximation Error l What is a good error measure? l Information Retrieval, Statistics, Data Mining l Matching coefficient l Dice coefficient l Jaccard coefficient l Cosine coefficient l Overlap coefficient l Earth Mover’s Distance (EMD) [RTG98] l Match And Compare (MAC) [IP99] l Join: subset of output result l EMD, Overlap coefficient trivially 0 or 1 l Others (except MAC) reduce to MAX-subset error measure

Optimization Problem Select records to be kept in memory such that the result size is maximized subject to memory constraints l Lightweight online technique l Adaptivity in presence of memory fluctuations

Optimal Offline Algorithm l What is the best possible that can be achieved? l Optimal sampling strategy for MAX-subset l Bottom-line for evaluation of any online algorithm l Same optimization problem, but knows future l Finite subsets of input streams l Formulate as linear flow problem

Generation of Flow Model R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace

Correspondence to Windows R=1,1,1,3 S=2,3,1,1

Correspondence to Windows R=1,1,1,3 S=2,3,1,1

Correspondence to Windows R=1,1,1,3 S=2,3,1,1

Correspondence to Windows R=1,1,1,3 S=2,3,1,1

Complexity l Integer solution exists l Optimal solution found in O(n 2 m log n) l N input size of single stream l #nodes: n < 2wN + N + 2 l #arcs: m < 2n + M + 1 l Reasonable costs for benchmarking l Approx. 1GB memory (w=800, M=800) l Approx. 1h computation time

Optimal Flow R=1,1,1,3 S=2,3,1,1 M=2, w=3 Fixed memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace

Easy to Extend R=1,1,1,3 S=2,3,1,1 M=2, w=3 Variable memory allocation 3 -3 cost Capacity: 0..1, linear cost Keep in memory Replace

Online Heuristics l Maximize expected output l PROB: sort tuples by join partner arrival probability l LIFE: sort tuples by product of partner arrival probability and remaining lifetime l Maintain stream statistics l Histograms (DGIM02, TGIK02), wavelets (GKMS01), quantiles (GKMS02, GK01)

Approximation Quality

Effect of Skew

Summary l Information sphere architecture l Optimal algorithm and fast efficient heuristic for sliding window joins l Open problems l Other set error measures, resource models l Other joins: compress records l Complex queries l Distributed processing l Integration with other techniques into local information sphere

Related Work l Aurora (Brown, MIT), STREAM (Stanford), Telegraph (Berkeley), NiagaraCQ (Wisconsin, OGI) l Memory requirements [ABBMW02,TM02] l Aggregation l Alon, Bar-Yossef, Datar, Dobra, Garofalakis, Gehrke, Gibbons, Gilbert, Indyk, Korn, Kotidis, Koudas, Matias, Motwani, Muthukrishnan, Rastogi, Srivastava, Strauss, Szegedy

Other Results [DGR03] l Integration with archive l Load smoothing, not shedding l Novel “error” measure: archive access cost l Static join for sensor networks l Maximize result size subject to constraints on energy consumption l Polynomial dynamic programming solution l Fast 2-approximation algorithms l NP-hardness proof for join of 3 or more streams

Other Results (cont.) [DGGR02] l Computation of aggregates over streams for multiple joins l Small pseudo-random sketch synopses (randomized linear projections) l Explicit, tunable error guarantees l Sketch partitioning to boost accuracy (intelligently partition join attribute space)

Thanks! Questions? ? ? ? ? ? ? ?