Data Mining: Concepts and Techniques — Chapter 8 — 8

Data Mining: Concepts and Techniques — Chapter 8 — 8
Data Mining: Concepts and Techniques — Chapter 8 — 8.2 Mining time-series data Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign ©2010 Jiawei Han and Micheline Kamber. All rights reserved.

Data Mining: Concepts and Techniques
November 9, 2018 Data Mining: Concepts and Techniques

Mining Time-Series Data
A time series is a sequence of data points, measured typically at successive times, spaced at (often uniform) time intervals Time series analysis: A subfield of statistics, comprises methods that attempt to understand such time series, often either to understand the underlying context of the data points or to make forecasts (or predictions) Methods for time series analyses Frequency-domain methods: Model-free analyses, well-suited to exploratory investigations spectral analysis vs. wavelet analysis Time-domain methods: Auto-correlation and cross-correlation analysis Motif-based time-series analysis Applications Financial: stock price, inflation Industry: power consumption Scientific: experiment results Meteorological: precipitation

Regression Analysis Trend Analysis Similarity Search in Time Series Data Motif-Based Search and Mining in Time Series Data Summary

Time-Series Data Analysis: Prediction & Regression Analysis
(Numerical) prediction is similar to classification construct a model use model to predict continuous or ordered value for a given input Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions Major method for prediction: regression model the relationship between one or more independent or predictor variables and a dependent or response variable Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model, Poisson regression, log-linear models, regression trees

What is Regression? Modeling the relationship between one response variable and one or more predictor variables Analyzing the confidence of the model E.g, height v.s weight

Regression Yields Analytical Model
Discrete data points →Analytical model General relationship Easy calculation Further analysis Application - Prediction

Application - Detrending
Obtain the trend for irregular data series Subtract trend Reveal oscillations trend

Linear Regression - Single Predictor
y: response variable x: predictor variable w1 w0 Model is linear y = w0 + w1 x where w0 (y-intercept) and w1 (slope) are regression coefficients Method of least squares:

Linear Regression – Multiple Predictor
Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|) E.g., for 2-D data or y = w0 + w1 x1+ w2 x2 Solvable by Extension of least square method (XTX ) W=Y →W = (XTX ) -1Y Commercial software (SAS, S-Plus) y x2 x1

Nonlinear Regression with Linear Method
Polynomial regression model E.g., y = w0 + w1 x + w2 x2 + w3 x3 Let x2 = x2, x3= x3 y = w0 + w1 x + w2 x2 + w3 x3 Log-linear regression model E. g., y = exp(w0 + w1 x + w2 x2 + w3 x3 ) Let y’=log(y) y’= w0 + w1 x + w2 x2 + w3 x3

Generalized Linear Regression
Response y Distribution function in the exponential family Variance of y depends on E( y), not a constant E( y) = g-1( w0 + w1 x + w2 x2 + w3 x3 ) Examples Logistic regression (binomial regression): probability of some event occurring Poisson regression: number of customers … References: Nelder and Wedderburn, 1972; McCullagh and Nelder, 1989

Regression Tree (Breiman et al., 1984)
Partition the domain space Leaf: (1) a continuous-valued prediction; (2) average value Figure source:

Model Tree (Quinlan, 1992) Leaf – a linear equation
More general than regression tree Figure source:

Regression Trees and Model Trees
Regression tree: proposed in CART system (Breiman et al. 1984) CART: Classification And Regression Trees Each leaf stores a continuous-valued prediction It is the average value of the predicted attribute for the training tuples that reach the leaf Model tree: proposed by Quinlan (1992) Each leaf holds a regression model—a multivariate linear equation for the predicted attribute A more general case than regression tree Regression and model trees tend to be more accurate than linear regression when the data cannot be represented well by a simple linear model

Predictive Modeling in Multidimensional Databases
Predictive modeling: Predict data values or construct generalized linear models based on the database data One can only predict value ranges or category distributions Method outline Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement, entropy analysis, expert judgment, etc. Multi-level prediction: drill-down and roll-up analysis

Predictive Modeling in Multidimensional Databases
Predictive modeling: Predict data values or construct generalized linear models based on the database data One can only predict value ranges or category distributions Method outline: Minimal generalization Attribute relevance analysis Generalized linear model construction Prediction Determine the major factors which influence the prediction Data relevance analysis: uncertainty measurement, entropy analysis, expert judgment, etc. Multi-level prediction: drill-down and roll-up analysis 17

Prediction: Numerical Data

References Nelder, J.A. and Wedderburn, R.W.M. (1972). Generalized linear models. Journal of the Royal Statistical Society A, 135, C. Chatfield. The Analysis of Time Series: An Introduction, 3rd ed. Chapman & Hall, 1984. McCullagh, P. and Nelder, J.A. (1989). Generalized linear models, 2nd ed. Chapman and Hall, London. Breiman L, Friedman JH, Olshen RA, Stone CJ. (1984). Classification and Regression Trees. Chapman &Hall (Wadsworth, Inc.): New York. Quinlan, J. R. (1992). Learning with continuous classes. In: Adams, , Sterling, (Eds.), Proceedings of artificial intelligence'92, World Scientific, Singapore. pp Acknowledgment This presentation integrates Xiaopeng Li’s slides in his CS 512 class presentation

A time series can be illustrated as a time-series graph which describes a point moving with the passage of time

Categories of Time-Series Movements
Long-term or trend movements (trend curve): general direction in which a time series is moving over a long interval of time Cyclic movements or cycle variations: long term oscillations about a trend line or curve e.g., business cycles, may or may not be periodic Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements Time series analysis: decomposition of a time series into these four basic movements Additive Modal: TS = T + C + S + I Multiplicative Modal: TS = T  C  S  I

Estimation of Trend Curve
The freehand method Fit the curve by looking at the graph Costly and barely reliable for large-scaled data mining The least-square method Find the curve minimizing the sum of the squares of the deviation of points on the curve from the corresponding data points The moving-average method

Moving Average Moving average of order n Smoothes the data
Eliminates cyclic, seasonal and irregular movements Loses the data at the beginning or end of a series Sensitive to outliers (can be reduced by weighted moving average)

Trend Discovery in Time-Series (1): Estimation of Seasonal Variations
Seasonal index Set of numbers showing the relative values of a variable during the months of the year E.g., if the sales during October, November, and December are 80%, 120%, and 140% of the average monthly sales for the whole year, respectively, then 80, 120, and 140 are seasonal index numbers for these months Deseasonalized data Data adjusted for seasonal variations for better trend and cyclic analysis Divide the original monthly data by the seasonal index numbers for the corresponding months

Seasonal Index Raw data from November 9, 2018 Data Mining: Concepts and Techniques

Trend Discovery in Time-Series (2)
Estimation of cyclic variations If (approximate) periodicity of cycles occurs, cyclic index can be constructed in much the same manner as seasonal indexes Estimation of irregular variations By adjusting the data for trend, seasonal and cyclic variations With the systematic analysis of the trend, cyclic, seasonal, and irregular components, it is possible to make long- or short-term predictions with reasonable quality

Similarity Search in Time-Series Analysis
Normal database query finds exact match Similarity search finds data sequences that differ only slightly from the given query sequence Two categories of similarity queries Whole matching: find a sequence that is similar to the query sequence Subsequence matching: find all pairs of similar sequences Typical Applications Financial market Market basket data analysis Scientific databases Medical diagnosis

Data Transformation Many techniques for signal analysis require the data to be in the frequency domain Usually data-independent transformations are used The transformation matrix is determined a priori discrete Fourier transform (DFT) discrete wavelet transform (DWT) The distance between two signals in the time domain is the same as their Euclidean distance in the frequency domain

Discrete Fourier Transform
DFT does a good job of concentrating energy in the first few coefficients If we keep only first a few coefficients in DFT, we can compute the lower bounds of the actual distance Feature extraction: keep the first few coefficients (F-index) as representative of the sequence

DFT (continued) Parseval’s Theorem
The Euclidean distance between two signals in the time domain is the same as their distance in the frequency domain Keep the first few (say, 3) coefficients underestimates the distance and there will be no false dismissals!

Multidimensional Indexing in Time-Series
Multidimensional index construction Constructed for efficient accessing using the first few Fourier coefficients Similarity search Use the index to retrieve the sequences that are at most a certain small distance away from the query sequence Perform post-processing by computing the actual distance between sequences in the time domain and discard any false matches

Subsequence Matching Break each sequence into a set of pieces of window with length w Extract the features of the subsequence inside the window Map each sequence to a “trail” in the feature space Divide the trail of each sequence into “subtrails” and represent each of them with minimum bounding rectangle Use a multi-piece assembly algorithm to search for longer sequence matches

Analysis of Similar Time Series

Enhanced Similarity Search Methods
Allow for gaps within a sequence or differences in offsets or amplitudes Normalize sequences with amplitude scaling and offset translation Two subsequences are considered similar if one lies within an envelope of  width around the other, ignoring outliers Two sequences are said to be similar if they have enough non-overlapping time-ordered pairs of similar subsequences Parameters specified by a user or expert: sliding window size, width of an envelope for similarity, maximum gap, and matching fraction

Steps for Performing a Similarity Search
Atomic matching Find all pairs of gap-free windows of a small length that are similar Window stitching Stitch similar windows to form pairs of large similar subsequences allowing gaps between atomic matches Subsequence Ordering Linearly order the subsequence matches to determine whether enough similar pieces exist

Similar Time Series Analysis
VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund Two similar mutual funds in the different fund group

Query Languages for Time Sequences
Time-sequence query language Should be able to specify sophisticated queries like Find all of the sequences that are similar to some sequence in class A, but not similar to any sequence in class B Should be able to support various kinds of queries: range queries, all-pair queries, and nearest neighbor queries Shape definition language Allows users to define and query the overall shape of time sequences Uses human readable series of sequence transitions or macros Ignores the specific details E.g., the pattern up, Up, UP can be used to describe increasing degrees of rising slopes Macros: spike, valley, etc.

Sequence Distance A function that measures the differentness of two sequences (of possibly unequal length) Example: Euclidean Distance between TS Q,C

Motif: Basic Concepts What is a motif? A previously unknown, frequently occurring sequential pattern Match: Given subsequences Q,C ⊆ T, C is a match for Q iff for some R Non-Trivial Match: C = T[p..*], Q = T[q..*] and C match Q. If p = q or ∄ non-match N = T[s..*] such that s between p,q then match is non-trivial. (i.e. C,Q must be separated by a non-match) 1-Motif: the subsequence with most non-trivial matches (least variance decides ties) k-Motif: Ck such that D(Ck,Ci) > 2R ∀i ∈ [1,k)

SAX: Symbolic Aggregate approXimation Dim. Reduction/Compression
“Symbolic Aggregate approXimation” SAX : ℝ → ∑ SAX : ↦ ccbaabbbabcbcb Essentially an alphabet over the Piecewise Aggregate Approximation (PAA) rank Faster, simpler, more compression, yet on par with DFT, DWT and other dim. reductions

SAX Illustration

SAX Algorithm Parameters: alphabet size, word (segment) length (or output rate) Select probability distribution for TS z-Normalize TS PAA: Within each time interval, calculate aggregated value (mean) of the segment Partition TS range by equal-area partitioning the PDF into n partitions (eq. freq. binning) Label each segment with arank ∈∑ for aggregate’s corresponding partition rank

Finding Motifs in a Time Series
EMMA Algorithm: Finds 1-(k-)motif of fixed length n SAX Compression (Dim. Reduction) Possible to store D(i,j) ∀(i,j) ∈ ∑∑ Allows use of various distance measures (Minkowski, Dynamic Time Warping) Multiple Tiers Tier 1: Uses sliding window to hash length-w SAX subsequences (aw addresses, total size O(m)). Bucket B with most collisions & buckets with MINDIST(B) < R form neighborhood of B. Tier 2: Neighborhood is pruned using more precise ADM algorithm. Ni with max. matches is 1-motif. Early stop if |ADM matches| > maxk>i(|neighborhoodk|)

Hashing … … … 5 5 5 … … … … w … … … … n 2 4 1 2 1 3 2 4 1 c e a b c b
1 2 1 3 2 4 1 5 5 5 … c e a b … c b d … c e a b … w … … … … n

Classification in Time Series
Application: Finance, Medicine 1-Nearest Neighbor Pros: accurate, robust, simple Cons: time and space complexity (lazy learning); results are not interpretable 200 400 600 800 1000 1200 As we all know, classification is an important problem in the data mining domain. Classification of time series has been extracting great interest over the past decade. And its applications are among a wide variety domains, such as medicine, finance and etc Recent research has suggested the nearest neighbor classification is very hard to beat for most problems. It is most accurate proved by extensive empirical tests. It is robust to noise. It is extremely simple and generic.

5.1 I yes 1 false nettles stinging nettles stinging nettles
Leaf Decision Tree Shapelet Dictionary 5.1 yes no I 1 false nettles Shapelet stinging nettles Given two species of leaves, the first one is called stinging nettle and the second one is colloquially called “false nettle”, since it is very similar to the first group and is easily confused. Now, we want to build a classifier on these two species based on shape. First we convert the outline shapelet to one-dimension time series representation. Because the global difference of the two types of shapes are subtle, Instead, We use a particularly distinguishing subsection. By “blushing” the subsection back, we can gain the insight of the difference, the stinging nettle has a stem connect to the leaf at an angle close to 90 degree. While for the false nettle, the stem connect to the leaf at a much wider angle. This distinguishing subsection is called “shapelet”. Now we build the classifier, the outline time series has a subsection similar to this shapelet are classified as false nettle, otherwise it is stinging nettle. So how can we extract these distinguishing shapelet?

Testing the utility of a candidate shapelet
Information gain Arrange the time series objects based on the distance from candidate Find the optimal split point (maximal information gain) Pick the candidate achieving best utility as the shapelet After we have all the candidates in the pool, we test and compare the utility of each candidate shapelet in the pool. Here we use the information gain as the utility function. The higher the better. First, we arrange the objects by the distance from each object to the candidate. For visual illustration, in the figure, the leaves are arranged in the real number line according to the candidate. Then we test every possible split point between each two objects and find the optimal one that can best divide the leaves into two original classes. Here, this is the optimal split point for the candidate, it is not perfect but pretty good. After we testing all the candidate, we can pick the one achieving best utility as the shapelet Since the brute force algorithm exhaustively check every single candidate, it guarantee to find the best candidate as the shapelet. However, the brute force method suffers a vital problem. To get the distance from each of the object to the candidate, we need to find the best matching location for the the candidate in the time series which introduces a subsequence searching procedure for every distance calculation. Split Point candidate

Classification 5.1 I yes 1 stinging nettles false nettles
Shapelet Classification false nettles stinging nettles Leaf Decision Tree Shapelet Dictionary 5.1 yes no I 1 After we find the shapelet, we can frame it as the decision tree. For each step of the decision tree induction, we determine the shapelet. For classification stage, we compare the time series with the shapelet at each step. Although the shapelet takes long in the training. It is very fast for classification. Since the number of the shapelets a time series should compare to is equal to the height of the decision tree.

Summary Time series analysis is an important research field in data mining Regression Analysis Trend Analysis Similarity Search in Time Series Data Motif-Based Search and Mining in Time Series Data

References on Time-Series Similarity Search
R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. FODO’93 (Foundations of Data Organization and Algorithms). R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. VLDB'95. R. Agrawal, G. Psaila, E. L. Wimmers, and M. Zait. Querying shapes of histories. VLDB'95. C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. SIGMOD'94. J. Lin, E. Keogh, S. Lonardi, and B. Chiu, “A Symbolic Representation of Time Series, with Implications for Streaming Algorithms”, Data Mining and Knowledge discovery, 2003 P. Patel, E. Keogh, J. Lin, and S. Lonardi, “Mining Motifs in Massive Time Series Databases”, ICDM’02 D. Rafiei and A. Mendelzon. Similarity-based queries for time series data. SIGMOD'97. Y. Moon, K. Whang, W. Loh. Duality Based Subsequence Matching in Time-Series Databases, ICDE’02 B.-K. Yi, H. V. Jagadish, and C. Faloutsos. Efficient retrieval of similar time sequences under time warping. ICDE'98. B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and A. Biliris. Online data mining for co-evolving time sequences. ICDE'00. D. Shasha and Y. Zhu. High Performance Discovery in Time Series: Techniques and Case Studies, SPRINGER, 2004 L. Ye and E. Keogh, “Time Series Shapelets: A New Primitive for Data Mining”, KDD’09

November 9, 2018 Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques — Chapter 8 — 8

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques — Chapter 8 — 8"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: Concepts and Techniques — Chapter 8 — 8

Similar presentations

Presentation on theme: "Data Mining: Concepts and Techniques — Chapter 8 — 8"— Presentation transcript:

Similar presentations

About project

Feedback