Data Mining: Concepts and Techniques Mining data streams

Slides:



Advertisements
Similar presentations
Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science.
Advertisements

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Fast Algorithms For Hierarchical Range Histogram Constructions
Maintaining Variance over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O ’ Callaghan, Stanford University ACM Symp. on Principles.
Mining Data Streams.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
1 Continuous Queries over Data Streams Vitaly Kroivets, Lyan Marina Presentation for The Seminar on Database and Internet The Hebrew University of Jerusalem,
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Data Mining: Concepts and Techniques Mining time-series data.
Processing Data-Stream Joins Using Skimmed Sketches Minos Garofalakis Internet Management Research Department Bell Labs, Lucent Technologies Joint work.
1 Stream-based Data Management IS698 Min Song 2 Characteristics of Data Streams  Data Streams Data streams — continuous, ordered, changing, fast, huge.
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CS490D: Introduction to Data Mining Prof. Chris Clifton
Evaluating Hypotheses
Chapter 3 Forecasting McGraw-Hill/Irwin
A survey on stream data mining
1 PODS 2002 Motivation. 2 PODS 2002 Data Streams data sets Traditional DBMS – data stored in finite, persistent data sets data streams New Applications.
CS591A1 Fall Sketch based Summarization of Data Streams Manish R. Sharma and Weichao Ma.
Part II – TIME SERIES ANALYSIS C2 Simple Time Series Methods & Moving Averages © Angel A. Juan & Carles Serrat - UPC 2007/2008.
Forecasting McGraw-Hill/Irwin Copyright © 2012 by The McGraw-Hill Companies, Inc. All rights reserved.
CS 591 A11 Algorithms for Data Streams Dhiman Barman CS 591 A1 Algorithms for the New Age 2 nd Dec, 2002.
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Slides 13b: Time-Series Models; Measuring Forecast Error
Time series analysis and Sequence Segmentation
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.
Data Mining Techniques
Time Series “The Art of Forecasting”. What Is Forecasting? Process of predicting a future event Underlying basis of all business decisions –Production.
Estimating Entropy for Data Streams Khanh Do Ba, Dartmouth College Advisor: S. Muthu Muthukrishnan.
Business Forecasting Used to try to predict the future Uses two main methods: Qualitative – seeking opinions on which to base decision making – Consumer.
Bin Yao Spring 2014 (Slides were made available by Feifei Li) Advanced Topics in Data Management.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
Data Stream Systems Reynold Cheng 12 th July, 2002 Based on slides by B. Babcock et.al, “Models and Issues in Data Stream Systems”, PODS’02.
Streaming Algorithms Piotr Indyk MIT. Data Streams A data stream is a sequence of data that is too large to be stored in available memory Examples: –Network.
DAVIS AQUILANO CHASE PowerPoint Presentation by Charlie Cook F O U R T H E D I T I O N Forecasting © The McGraw-Hill Companies, Inc., 2003 chapter 9.
1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.
Time-Series Forecasting Overview Moving Averages Exponential Smoothing Seasonality.
Chapter 7 Probability and Samples: The Distribution of Sample Means.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Data Stream Management Systems
Data Warehousing Mining & BI Data Streams Mining DWMBI1.
1 Online Computation and Continuous Maintaining of Quantile Summaries Tian Xia Database CCIS Northeastern University April 16, 2004.
Data Streams Topics in Data Mining Fall 2015 Bruno Ribeiro © 2015 Bruno Ribeiro.
Learning Objectives Describe what forecasting is Explain time series & its components Smooth a data series –Moving average –Exponential smoothing Forecast.
Forecasting is the art and science of predicting future events.
Mining of Massive Datasets Ch4. Mining Data Streams.
Streaming Semantic Data COMP6215 Semantic Web Technologies Dr Nicholas Gibbins –
Chapter 11 – With Woodruff Modications Demand Management and Forecasting Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Discrete Methods in Mathematical Informatics Kunihiko Sadakane The University of Tokyo
Data Streams COMP3017 Advanced Databases Dr Nicholas Gibbins –
Confidence Intervals Cont.
Mining Data Streams (Part 1)
Data Transformation: Normalization
The Stream Model Sliding Windows Counting 1’s
Data Mining: Concepts and Techniques
“The Art of Forecasting”
Streaming & sampling.
Mining Dynamics of Data Streams in Multi-Dimensional Space
Counting How Many Elements Computing “Moments”
Mining Unusual Patterns in Data Streams in Multi-Dimensional Space
Advanced Topics in Data Management
Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Mining: Concepts and Techniques — Chapter 8 — 8
Range-Efficient Computation of F0 over Massive Data Streams
Introduction to Stream Computing and Reservoir Sampling
Online Analytical Processing Stream Data: Is It Feasible?
Data Mining: Concepts and Techniques — Chapter 8 — 8
Mining Data Streams Many slides are borrowed from Stanford Data Mining Class and Prof. Jiawei Han’s lecture slides.
Presentation transcript:

Data Mining: Concepts and Techniques Mining data streams . April 26, 2017 Data Mining: Concepts and Techniques

Chapter 8. Mining Stream, Time-Series, and Sequence Data Mining data streams Mining time-series data Mining sequence patterns in transactional databases Mining sequence patterns in biological data April 26, 2017 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Mining Data Streams What is stream data? Why Stream Data Systems? Stream data management systems: Issues and solutions Stream data cube and multidimensional OLAP analysis Stream frequent pattern analysis Stream classification Stream cluster analysis Research issues April 26, 2017 Data Mining: Concepts and Techniques

Characteristics of Data Streams Data streams—continuous, ordered, changing, fast, huge amount Traditional DBMS—data stored in finite, persistent data sets Characteristics Huge volumes of continuous data, possibly infinite Fast changing and requires fast, real-time response Data stream captures nicely our data processing needs of today Random access is expensive—single scan algorithm (can only have one look) Store only the summary of the data seen thus far Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing April 26, 2017 Data Mining: Concepts and Techniques

Stream Data Applications Telecommunication calling records Business: credit card transaction flows Network monitoring and traffic engineering Financial market: stock exchange Engineering & industrial processes: power supply & manufacturing Sensor, monitoring & surveillance: video streams, RFIDs Security monitoring Web logs and Web page click streams Massive data sets (even saved but random access is too expensive) April 26, 2017 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques DBMS versus DSMS Persistent relations One-time queries Random access “Unbounded” disk store Only current state matters No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-GB arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics Ack. From Motwani’s PODS tutorial slides April 26, 2017 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Mining Data Streams What is stream data? Why Stream Data Systems? Stream data management systems: Issues and solutions Stream data cube and multidimensional OLAP analysis Stream frequent pattern analysis Stream classification Stream cluster analysis Research issues April 26, 2017 Data Mining: Concepts and Techniques

Architecture: Stream Query Processing SDMS (Stream Data Management System) User/Application Continuous Query Results Multiple streams Stream Query Processor Scratch Space (Main memory and/or Disk) April 26, 2017 Data Mining: Concepts and Techniques

Challenges of Stream Data Processing Multiple, continuous, rapid, time-varying, ordered streams Main memory computations Queries are often continuous Evaluated continuously as stream data arrives Answer updated over time Queries are often complex Beyond element-at-a-time processing Beyond stream-at-a-time processing Beyond relational queries (scientific, data mining, OLAP) Multi-level/multi-dimensional processing and data mining Most stream data are at low-level or multi-dimensional in nature April 26, 2017 Data Mining: Concepts and Techniques

Processing Stream Queries Query types One-time query vs. continuous query (being evaluated continuously as stream continues to arrive) Predefined query vs. ad-hoc query (issued on-line) Unbounded memory requirements For real-time response, main memory algorithm should be used Approximate query answering With bounded memory, it is not always possible to produce exact answers High-quality approximate answers are desired Data reduction and synopsis construction methods Sketches, random sampling, histograms, wavelets, etc. April 26, 2017 Data Mining: Concepts and Techniques

Methodologies for Stream Data Processing Major challenges Keep track of a large universe, e.g., pairs of IP address, not ages Methodology Synopses (trade-off between accuracy and storage) Use synopsis data structure, much smaller (O(logk N) space) than their base data set (O(N) space) Compute an approximate answer within a small error range (factor ε of the actual answer) Major methods Random sampling Histograms Sliding windows Multi-resolution model Sketches Radomized algorithms April 26, 2017 Data Mining: Concepts and Techniques

Stream Data Processing Methods (1) Random sampling (but without knowing the total length in advance) Reservoir sampling: maintain a set of s candidates in the reservoir, which form a true random sample of the element seen so far in the stream. As the data stream flow, every new element has a certain probability (s/N) of replacing an old element in the reservoir. Sliding windows Make decisions based only on recent data of sliding window size w An element arriving at time t expires at time t + w Histograms Approximate the frequency distribution of element values in a stream Partition data into a set of contiguous buckets Equal-width (equal value range for buckets) vs. V-optimal (minimizing frequency variance within each bucket) Multi-resolution models Popular models: balanced binary trees, micro-clusters, and wavelets April 26, 2017 Data Mining: Concepts and Techniques

Stream Data Processing Methods (2) Sketches Histograms and wavelets require multi-passes over the data but sketches can operate in a single pass Frequency moments of a stream A = {a1, …, aN}, Fk: where v: the universe or domain size, mi: the frequency of i in the sequence Given N elts and v values, sketches can approximate F0, F1, F2 in O(log v + log N) space Randomized algorithms Monte Carlo algorithm: bound on running time but may not return correct result Chebyshev’s inequality: Let X be a random variable with mean μ and standard deviation σ Chernoff bound: Let X be the sum of independent Poisson trials X1, …, Xn, δ in (0, 1] The probability decreases expoentially as we move from the mean April 26, 2017 Data Mining: Concepts and Techniques

Mining Time-Series Data Time-series database Consists of sequences of values or events changing with time Data is recorded at regular intervals Characteristic time-series components Trend, cycle, seasonal, irregular Applications Financial: stock price, inflation Industry: power consumption Scientific: experiment results Meteorological: precipitation April 26, 2017 Data Mining: Concepts and Techniques

Categories of Time-Series Movements Long-term or trend movements (trend curve): general direction in which a time series is moving over a long interval of time Cyclic movements or cycle variations: long term oscillations about a trend line or curve e.g., business cycles, may or may not be periodic Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements Time series analysis: decomposition of a time series into these four basic movements Additive Modal: TS = T + C + S + I Multiplicative Modal: TS = T  C  S  I April 26, 2017 Data Mining: Concepts and Techniques

Estimation of Trend Curve The freehand method Fit the curve by looking at the graph Costly and barely reliable for large-scaled data mining The least-square method Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points The moving-average method April 26, 2017 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques Moving Average Moving average of order n Smoothes the data Eliminates cyclic, seasonal and irregular movements Loses the data at the beginning or end of a series Sensitive to outliers (can be reduced by weighted moving average) April 26, 2017 Data Mining: Concepts and Techniques