Stream Data Introduction or “Stream Data in 30 minutes or less…” Magdiel Galán CSE591: DataMining Dr. Huan Liu Spring 2004.

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
Fast Algorithms For Hierarchical Range Histogram Constructions
3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.
Mining High-Speed Data Streams Presented by: Tyler J. Sawyer UVM Spring CS 332 Data Mining Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International.
Ariel Rosenfeld Network Traffic Engineering. Call Record Analysis. Sensor Data Analysis. Medical, Financial Monitoring. Etc,
Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.
1 A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch Rensselaer Polytechnic Institute Srikanta Tirthapura.
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Adaptive Frequency Counting over Bursty Data Streams Bill Lin, Wai-Shing Ho, Ben Kao and Chun-Kit Chui Form CIDM07.
Mining Data Streams.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Extension of DGIM to More Complex Problems
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
CURE Algorithm Clustering Streams
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Distributed Streams Algorithms for Sliding Windows Phillip B. Gibbons, Srikanta Tirthapura.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
MOVING AVERAGES AND EXPONENTIAL SMOOTHING
Evaluating Hypotheses
1 Mining Decision Trees from Data Streams Tong Suk Man Ivy CSIS DB Seminar February 12, 2003.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
A survey on stream data mining
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
1 Mining Data Streams The Stream Model Sliding Windows Counting 1’s.
DAST, Spring © L. Joskowicz 1 Data Structures – LECTURE 1 Introduction Motivation: algorithms and abstract data types Easy problems, hard problems.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
CSE 373 Data Structures Lecture 15
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Mirek Riedewald Department of Computer Science Cornell University Efficient Processing of Massive Data Streams for Mining and Monitoring.
Maintaining Variance and k-Medians over Data Stream Windows Paper by Brian Babcock, Mayur Datar, Rajeev Motwani and Liadan O’Callaghan. Presentation by.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Introduction to Algorithms Jiafen Liu Sept
Stream Data Introduction
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
The Haar + Tree: A Refined Synopsis Data Structure Panagiotis Karras HKU, September 7 th, 2006.
Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.
1 Mining Decision Trees from Data Streams Thanks: Tong Suk Man Ivy HKU.
Data Mining: Concepts and Techniques Mining data streams
The Impact of Duality on Data Representation Problems Panagiotis Karras HKU, June 14 th, 2007.
Clustering Data Streams A presentation by George Toderici.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Mining Data Streams (Part 1)
Data Transformation: Normalization
The Stream Model Sliding Windows Counting 1’s
Mining Time-Changing Data Streams
Streaming & sampling.
Load Shedding Techniques for Data Stream Systems
Optimal Elephant Flow Detection Presented by: Gil Einziger,
Feifei Li, Ching Chang, George Kollios, Azer Bestavros
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
Decision Trees for Mining Data Streams
Mining Decision Trees from Data Streams
Approximation and Load Shedding Sampling Methods
Learning from Data Streams
Maintaining Stream Statistics over Sliding Windows
Presentation transcript:

Stream Data Introduction or “Stream Data in 30 minutes or less…” Magdiel Galán CSE591: DataMining Dr. Huan Liu Spring 2004

Outline Streaming Data description Uses/Applications Problems/Challenges Main Concepts variance & k-means aging & sliding windows algorithms References

Warning… Disclaimer Just Quick/Short tour on Data Stream concepts Although some algorithms, proof equations are included on the presentation, they are for reference and will be briefly discussed during class presentation.

Sizing the challenge WalMart Records 20 Million Transactions Google Handles 100 Million Searches AT&T produces 275 million call records Earth sensing satellite produces GBs of data This just in a day!

Characteristics/Description Stream data sets are… Continuous Massive Unbounded Possibly infinite Fast changing and requires fast, real-time response

Example: Network Management Application Network Management involves monitoring and configuring network hardware and software to ensure smooth operation Monitor link bandwidth usage, estimate traffic demands Quickly detect faults, congestion and isolate root cause Load balancing, improve utilization of network resources AT&T collects 100 GBs of NetFlow data each day! from [GGR02]

Network Management Application (cont.) Network Operations Center Network Measurements Alarms from [GGR02]

Uses/Applications Banking/Stocks/Financials credit card fraud detection stock trends monitoring Sensors power grid balancing engine controls collision avoidance driver sleep monitor

Problems/Challenges ‘Zillions of data Continuous/Unbounded Examples arrive faster than they can be mined Application may require fast, real-time response Examples: life threatening: collision avoidance lost revenue/transactions: hung-up networks

Problems/Challenges Time/Space constrained Not enough memory Can’t afford storing/revisiting the data Single pass computation External memory algorithms for handling data sets larger than main memory cannot be used. Do not support continuous queries Too slow real-time response

Problems/Challenges In summary… Can’t stop to smell the roses… Only one chance/single pass/look at the data

Problems/Challenges Other Considerations Classical algorithms (i.e. CART, C4.5) do not scale up to data stream [DH00] Most need entire data set for analysis Random access (or multiple passes) to the data Difficult to compute answers accurately with limited memory With probability at least 1 - , algorithms compute an approximate answer within a factor  of the actual answer Noise (bad sensors, outliers) Aging/Old/Stale data

Computation Model Stream Processing Engine (Approximate) Answer Data Streams Synopsis in Memory Decision Making from [GGR02]

Model Components Synopsis Summary of the data Samples, Histograms Processing Engine Implementation/Management System STREAM (Stanford): general-purpose Aurora (Brown/MIT): sensor monitoring, dataflow Telegraph (Berkeley): adaptive engine for sensors Decision Making Apply Data Mining techniques Decision Trees, Clusters, Association Rules

Model Components The remaining of the slides will focus on Decision Making and Synopsis Calculation

Synopsis: Dealing with Time/Space Constraints Since data can’t be contained, or revisited, the best alternative is to summarize what has been seen. Basic stream synopsis computation Random Sampling: Generate statistics using a representative sample of the data Histograms: Distribution/Grouping data representation Wavelets: Mathematical tool for hierarchical decomposition of functions/signals For this discusion, will focus on Histograms

Types of Histograms Equi-Depth Element counts per bucket are kept constant V-Optimal Minimize frequency variance within buckets Exponential Histograms (EH) Bucket sizes are non-decreasing powers of 2 Size: Total number of 1’s in the bucket. For every bucket other than the last bucket, there are at least k/2 and at most k/2+1 buckets of that size Example: k=4: (1,1,2,2,2,4,4,4,8,8,..) Essential component of “sliding windows” technique addressing “aging” data.

Equi-Depth[GGR02] V-Optimal[GGR02] Exponential Histograms (EH) Notice the total count of elements from [GGR02]

Sliding Windows Technique Background: Some applications rely on ALL historical data But for most applications, OLD data is considered less relevant and could skew results from NEW trends or conditions new processes/procedures new hardware/sensors new fashion trends

Sliding Windows (cont.) Common approaches addressing Old data: Aging Model [BDMO03] elements are associated with “weights” that decrease over time may use some exponential decay formulas Sliding Windows Model Only last “N” elements are considered Incorporate examples as they arrive The record “expires” at time t+N (N is the window length) Count only the “1’s” in bit-stream data

Sliding Windows Description Sliding Windows Approach (pseudo-pseudo code) Consider only the last N elements. Define k=1/ε, and approximate k/2 to nearest integer. Time Stamp each “1” that arrives in the stream and insert into a first bucket, shifting any initial ones. First bucket value is “1” since there is only one “1” If the number of buckets with same value exceeds k/2 +2, merge the oldest buckets, but keeping at least k/2 buckets of the same value Merging creates a new bucket with size equal to the sum Eliminate last bucket if its last 1 time stamp exceeds N

Sliding Window (SW) Model … … Time Increases Current Time Window Size N = 7 from [BDMO03]

Example Run Assume k/2 = 2 32,16,8,8,4,4,2,1,1

Example Run Assume k/2 = 2 32,16,8,8,4,4,2,1,1 1 Data Stream segment could be “00010”

Example Run Assume k/2 = 2 32,16,8,8,4,4,2,1,1,1 32,16,8,8,4,4,2,2,1 Merge! Merged!

Example Run Assume k/2 = 2 32,16,8,8,4,4,2,1,1 32,16,8,8,4,4,2,2,1 32,16,8,8,4,4,2,2,1,1 32,16,16,8,4,2,1 * Example from [DGIM02]/[GGR02]

Example Run Assume k/2 = 2 32,16,8,8,4,4,2,1,1 32,16,8,8,4,4,2,2,1 32,16,8,8,4,4,2,2,1,1 32,16,16,8,4,2,1 Keep in mind, the values represent TOTAL 1’s

Statistics Over Sliding Windows Bitstream: Count the number of ones [DGIM02] Exact solution: Θ(N) bits Algorithm BasicCounting: 1 + ε approximation (relative error!) Space: O(1/ε (log 2 N)) bits Time: O(log N) worst case, O(1) amortized per record Lower Bound: Space: Ω(1/ε (log 2 N)) bits

Complexity Number of buckets m: m  [# of buckets of size j]*[# of different bucket sizes]  (k/2 +1) * ((log(2N/k)+1) = O(k* log(N)) Each bucket requires O(log N) bits. Total memory: O(k log 2 N) = O(1/ε * log 2 N) bits from [BDMO03]

Benefits of Sliding Windows Incorporates new elements as they appear. Easy to calculate statistics over data streams with respect to the last N elements based on the histogram. Can estimate the number of 1’s within a factor of (1 + ε) using only θ((1/ε)(log 2 N)) bits of memory. from [BDMO03]

Expansion of Sliding Windows The original Sliding Window Method was not fully applicable to two important statistics during the “merging” of the buckets: k-median and variance A solution was devised by Babcock, Datar, Motwani and O’Callaghan [BDMO02] Their work derived a methodology for Variance, that was also applied for k-medians.

Variance and k-Medians Variance: Σ(x i – μ) 2, μ = Σ x i /N k-median clustering: Given: N points (x 1… x N ) in a metric space Find k points C = {c 1, c 2, …, c k } that minimize Σ d(x i, C) (the assignment distance) Clustering to be covered in detail future presentation from [BDMO03]

Notation V i = Variance of the i th bucket n i = number of elements in i th bucket μ i = mean of the i th bucket B1B1 B m B2B2 ……………… Current window, size = N B m-1 from [BDMO03]

Variance – composition B i,j = concatenation of buckets i and j from [BDMO03]

Decision Making The problem of addressing time changing data had also significant influence on decision algorithms. Pedro Domingos, who had originally developed a successful decision table algorithm (VFDT), also conceptualized the need to work with recent data, resulting in a new algorithm known as CVFDT. VFDT - Very Fast Decision Tree CVFDT - Concept Drift Very Fast Decision Tree Implemented a window approach

Decision Making Both VFDT and CVFDT make use of a statistical result known as Hoeffding * bound Used to estimate the minimum number of necessary examples needed to make a decision for a node in a decision tree. This is the key concept for these algorithms to work. * W.Hoefding, Probability Inequalities sums bounded Variables, Journal American Statistics Association, 1963

Hoeffding Bound random variable a whose range is R n independent observations of a; Mean: ā Hoeffding bound states: With probability 1- , the true mean of a is at least ā - , where from [DH00]/[HSD01]

Hoeffding Bound Significance… This estimate/bound is incorporated into an ID3 type decision tree, hence VFDT/CVFDT The information gain is evaluated against 

VFDT Algorithm from [DH00]

VFDT Algorithm Results from [DH00]

CVFDT vs. VFDT CVFDT is an extension to VFDT that incorporated “windowing” CFVDT concept: Generate tree as regular but using a window of “w” elements. Monitor changes in gain for attributes. If changes, generate alternate subtree with new “best” attribute, but keep on background. Replace if new subtree becomes more accurate.

The END – “El Final” Concepts Covered: Data Streams Constraints (time/space) Data Streams Model Synopsis Decision Maker Histograms Exponential Histogram Sliding Windows Variance Hoeffding Bounds Decision Tree Classifier

References [BDMO03] B. Babcock, M. Datar, R. Motwani, and J. L. O’Callaghan. “Maintaining Variance and k-Medians over Data Stream Windows”. ACM PODS, [DH00] P. Domingos and G. Hulten. “Mining High-Speed Data Streams”. ACM KDD, [HSD01] G. Hulten, L. Spencer and P. Domingos. “Mining Time-Changing Data Streams”. ACM KDD, [DGIM02] Mayur Datar, Aristides Gionis, Piotr Indyk and Rajeev Motwani. “Maintaining Stream Statistics over Sliding Windows” ACM-SIAM SODA [GGR02] Minos Garofalakis, Johannes Gehrke and Rajeev Rastogi. “Querying and Mining Data Streams: You Only Get One Look”. SIGMOD 2002 (tutorial).