Time Series Shapelets: A New Primitive for Data Mining

Slides:

Advertisements

Similar presentations

Data not in the pre-defined feature vectors that can be used to construct predictive models. Applications: Transactional database Sequence database Graph.

Advertisements

Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei.

COMP3740 CR32: Knowledge Management and Adaptive Systems

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.

1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.

Vincent S. Tseng, Cheng-Wei Wu, Bai-En Shie, and Philip S. Yu SIG KDD 2010 UP-Growth: An Efficient Algorithm for High Utility Itemset Mining 2010/8/25.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Local Discriminative Distance Metrics and Their Real World Applications Local Discriminative Distance Metrics and Their Real World Applications Yang Mu,

K-NEAREST NEIGHBORS AND DECISION TREE Nonparametric Supervised Learning.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Locally Constraint Support Vector Clustering

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.

Lazy Associative Classification By Adriano Veloso,Wagner Meira Jr., Mohammad J. Zaki Presented by: Fariba Mahdavifard Department of Computing Science University.

Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.

1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.

Using Relevance Feedback in Multimedia Databases

Towards Scalable Critical Alert Mining Bo Zong 1 with Yinghui Wu 1, Jie Song 2, Ambuj K. Singh 1, Hasan Cam 3, Jiawei Han 4, and Xifeng Yan 1 1 UCSB, 2.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

Module 04: Algorithms Topic 07: Instance-Based Learning

Dynamic Programming.

K Nearest Neighbors Classifier & Decision Trees

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

On Node Classification in Dynamic Content-based Networks.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Abdullah Mueen Eamonn Keogh University of California, Riverside.

Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.

Learning Time-Series Shapelets Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme Information Systems and Machine Learning Lab University.

Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.

Turn angle function and elastic time series matching Turn angle function and elastic time series matching Presented by: Wang, Xinzhen Advisor: Dr. Longin.

Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.

Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

Semi-Supervised Time Series Classification Li Wei Eamonn Keogh University of California, Riverside {wli,

MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.

Fast Shapelets: All Figures in Higher Resolution.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.

Using Classification Trees to Decide News Popularity

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

KNN & Naïve Bayes Hongning Wang

1 Substructure Similarity Search in Graph Databases R 陳芃安.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

k-Nearest neighbors and decision tree

Prepared by: Mahmoud Rafeek Al-Farra

Instance Based Learning

Supervised Time Series Pattern Discovery through Local Importance

Source: Procedia Computer Science（2015）70:

Decision Tree Saed Sayad 9/21/2018.

A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.

We understand classification algorithms in terms of the expressiveness or representational power of their decision boundaries. However, just because your.

SEG 4630 E-Commerce Data Mining — Final Review —

A Fast and Scalable Nearest Neighbor Based Classification

Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.

Machine Learning: Lecture 3

Discriminative Frequent Pattern Analysis for Effective Classification

Graph Classification SEG 5010 Week 3.

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Data Mining CSCI 307, Spring 2019 Lecture 11

Presentation transcript:

Time Series Shapelets: A New Primitive for Data Mining KDD 2009 Time Series Shapelets: A New Primitive for Data Mining Lexiang Ye and Eamonn Keogh University of California, Riverside Hello, everyone. I am Lexiang Ye. And this is the joint work with my advisor Dr. Eamonn Keogh. Presented by: Zhenhui Li

Classification in Time Series Application: Finance, Medicine 1-Nearest Neighbor Pros: accurate, robust, simple Cons: time and space complexity (lazy learning); results are not interpretable As we all know, classification is an important problem in the data mining domain. Classification of time series has been extracting great interest over the past decade. And its applications are among a wide variety domains, such as medicine, finance and etc Recent research has suggested the nearest neighbor classification is very hard to beat for most problems. It is most accurate proved by extensive empirical tests. It is robust to noise. It is extremely simple and generic. 200 400 600 800 1000 1200

Solution Shapelets time series subsequence representative of a class discriminative from other classes In this work, we introduce a new time series primitive, time series shapelet, which addresses these limitations. Informally, shapelets are time series subsequences which are in some sense maximally representative of a class. There are some similar problem in other domains. Like distinguishing substring selection or probe design. However, the time series shapalets covers much wider problem than those.

MOTIVATING EXAMPLE First, Let’s begin with a toy example in shape for visual clarity.

stinging nettles false nettles Shapelet Shapelet Dictionary 5.1 I Leaf Decision Tree Shapelet Dictionary 5.1 yes no I 1 false nettles Shapelet stinging nettles Given two species of leaves, the first one is called stinging nettle and the second one is colloquially called “false nettle”, since it is very similar to the first group and is easily confused. Now, we want to build a classifier on these two species based on shape. First we convert the outline shapelet to one-dimension time series representation. Because the global difference of the two types of shapes are subtle, Instead, We use a particularly distinguishing subsection. By “blushing” the subsection back, we can gain the insight of the difference, the stinging nettle has a stem connect to the leaf at an angle close to 90 degree. While for the false nettle, the stem connect to the leaf at a much wider angle. This distinguishing subsection is called “shapelet”. Now we build the classifier, the outline time series has a subsection similar to this shapelet are classified as false nettle, otherwise it is stinging nettle. So how can we extract these distinguishing shapelet?

BRUTE-FORCE ALGORITHM We start with the simple brute-force algorithm of finding the shapelet.

Extract subsequences of all possible lengths Candidates Pool ca The first step is to construct the shapelet candidates pool. To exhaustively search all the possibilities, we extract subsequences of all possible length. Using a sliding window of certain size, we can extract the subsequence of that lengths. We use the sliding window, slide across each time series, and extract all the subsequences. Then we proceed to a sliding window of different size, and so on and so force. . . .

Testing the utility of a candidate shapelet Information gain Arrange the time series objects based on the distance from candidate Find the optimal split point (maximal information gain) Pick the candidate achieving best utility as the shapelet After we have all the candidates in the pool, we test and compare the utility of each candidate shapelet in the pool. Here we use the information gain as the utility function. The higher the better. First, we arrange the objects by the distance from each object to the candidate. For visual illustration, in the figure, the leaves are arranged in the real number line according to the candidate. Then we test every possible split point between each two objects and find the optimal one that can best divide the leaves into two original classes. Here, this is the optimal split point for the candidate, it is not perfect but pretty good. After we testing all the candidate, we can pick the one achieving best utility as the shapelet Since the brute force algorithm exhaustively check every single candidate, it guarantee to find the best candidate as the shapelet. However, the brute force method suffers a vital problem. To get the distance from each of the object to the candidate, we need to find the best matching location for the the candidate in the time series which introduces a subsequence searching procedure for every distance calculation. Split Point candidate

Problem Total number of candidate Candidates Pool Problem Total number of candidate Each candidate: compute the distance between this candidate and each training sample Trace dataset 200 instances, each of length 275 7,480,200 shapelet candidates approximately three days . . . There is really a huge number of candidate in the pool, even for a very small dataset. Take the well know trace dataset for example, it contains 200 instances, each of length 275. But checking every candidate about takes the brute force algorithm about three days. To enable to extend the shapelet method to larger dataset, we apply some speedup on the brute force algorithm.

Speedup Distance calculations from time series objects to shapelet candidates are the most expensive part Reduce the time in two ways Distance Early Abandon reduce the distance computation time between two time series Admissible Entropy Pruning reduce the number of distance calculatations candidate We realize the most expensive calculations in the brute force method is the distance calculation. The two speedups reduce the distance calculation time in two aspect; one is the distance early abandon, it is a known idea and have successfully applied to previous researches. And this method also works particularly efficient in our case. The other is our novel idea, called the admissible entropy prune.

DISTANCE EARLY ABANDON

10 20 30 40 50 60 70 80 90 100 T S

best matching location 10 20 30 40 50 60 70 80 90 100 best matching location S Dist= 0.4 T

calculation abandoned at this point T Dist> 0.4 S calculation abandoned at this point T 10 20 30 40 50 60 70 80 90 100

Distance Early Abandon We only need the minimum Dist Method Keep the best-so-far distance Abandon the calculation if the current distance is larger than best so far. Consider the subsequences of T in random order to reduce the best so far as much as possible in the early stage.

ADMISSIBLE ENTROPY PRUNING We now explain how this idea works

Admissible Entropy Pruning We only need the best shapelet for each class For a candidate shapelet We don’t need to calculate the distance for each training sample After calculating some training samples, the upper bound of information gain < best candidate shapelet Stop calculation Try next candidate As we mentioned before, we use the information gain as the utility function because, it is the tradition evaluation in the decision tree and easily generalized to the multi-class problem. And what’s more, we can exploit it to largely reduce the number of distance calculations from objects to candidates.

stinging nettles false nettles Suppose now we are considering a candidate. The distances of the first five time series objects to the candidate have been calculated, and their corresponding positions in a one-dimensional representation are shown in the figure.

I=0.42 I= 0.29 Before continue to calculate the remaining distances, we can first give an upper bound of the information gain. The best case is that the rest leaves in two species are far away from each other. So we arrange them at the two side. In this case, the upper bound of the information gain is 0.29. Suppose we have a previous best-so-far candidate, which has the entropy of 0.42. Our current information gain is lower than the best-so-far. Therefore, at this point, we can stop the distance calculation for the remaining objects and prune this candidate from consideration. So our speedup method still consider every single candidate as the brute force method, but try to avoid unnecessary distance calculations as many as possible.

Classification stinging nettles false nettles Shapelet Leaf Decision Tree Shapelet Dictionary 5.1 yes no I 1 Classification After we find the shapelet, we can frame it as the decision tree. For each step of the decision tree induction, we determine the shapelet. For classification stage, we compare the time series with the shapelet at each step. Although the shapelet takes long in the training. It is very fast for classification. Since the number of the shapelets a time series should compare to is equal to the height of the decision tree.

EXPERIMENTAL EVALUATION Now I will provide you several interesting examples and case studies.

Performance Comparison Lets start with a large dataset, Lightning dataset. In the original dataset, the length of each time series object is 2000. And there are separated training and testing set. The size of training is 2000, and testing 18000. One sample from each class are plotted in the right below figure here. In our experiment, we use the different fraction of data from training set, and calculate the accuracy on the original testing set. The left figure shows the performance comparison. The y-axis indicates the computing time and x-axis represents the different size of the training. As you can see, our pruning method works far efficiently than the brute force one. When the data set is size of 160, the brute force method take 5 days to complete the learning procedure, while our method only takes 2 hours. On the right figure, we shows the testing accuracy, when only 10 objects (out of the original 2,000) are examined, the decision tree is slightly worse than the best known result on this dataset but after examining just 1% of the training data, it is significantly more accurate. Original Lightning Dataset Length 2000 Training Testing 18000

Projectile Points Projectile point classification is an important and interesting topic in anthropology. The diversity and broken projectile points increase the difficulty to classify the projectile point correctly. In this example, we use our shapelet to classify these three types of projectile points on the shapes.

Arrowhead Decision Tree 11.24 85.47 Shapelet Dictionary (Clovis) (Avonlea) I II 200 400 1.0 Arrowhead Decision Tree 2 1 Clovis Avonlea We get the shapelet at each node of the decision tree. By looking at the corresponding portion in the shape, we found some interesting explanation of the classifiers. Clovis is distinguished from the others by an unnotched hafting area near the bottom connected by a deep concave bottom end. After the Clovis is distinguished, Avonlea is differentiated from the mixed class by a small notched hafting area connected by a shallow concave bottom end. Our method also gives better accuracy and much less time than the rotation invariant nearest neighbor method. Method Accuracy Time Shapelet 0.80 0.33 Rotation Invariant Nearest Neighbor 0.68 1013

one sample from each class Wheat Spectrography 200 400 600 800 1000 1200 0.5 1 one sample from each class Now lets see a multi-class example, the wheat spectrography problem. The wheat samples are grown in Canada between 1998 and 2005. The data is labeled by the year in which the wheat was grown. In the figures, we show one sample from each class. The objects are separated in the y-axis for visual clarity. As we can see the global difference is subtle. Thus, we need to reply on the local differences. Wheat Dataset Length 1050 Training 49 Testing 276

Shapelet Dictionary Wheat Decision Tree 0.4 II 0.3 III 0.2 IV 0.1 0.0 V VI 100 200 300 Wheat Decision Tree 2 4 1 3 6 5 I II III IV V VI Using the shapelet method, we build the decision tree, found the shapelets represent the local differences and got the better accuracy than the nearest neighbor. So our method works well in the multi-class problem. Method Accuracy Time Shapelet 0.720 0.86 Nearest Neighbor 0.543 0.65

Coffee Coffee Decision Tree chlorogenic acid 1 Shapelet Dictionary 100 200 300 Coffee Decision Tree 0.5 1.0 1.5 100 200 300 I 11.14 (Robusta) 1 Shapelet Dictionary caffeine 100 200 300 Method Accuracy Shapelet (28/28) 1.0 Nearest Neighbor (leave-one-out) 0.98

the Gun/NoGun Problem No Gun Gun (No Gun) 38.94 I Shapelet Dictionary 2 I Shapelet Dictionary Lets we considered a well-studied gun/nogun. In our experiment, we use the same training/testing split as the previous study. The shapelet indicates a phenomenon known as “overshoot”. we can see that the NoGun class has a “dip” where the actor put her hand down by her side. That is because the inertia carries her hand a little too far and she is forced to correct for it. And we also achieve better accuracy using less classification time in this well-studied case. 50 100 Gun Decision Tree I 1 Method Accuracy Time Shapelet 0.933 0.016 Rotation Invariant Nearest Neighbor 0.913 0.064

Gait Analysis Gait Dataset Length different lengths Training 18 For all the above experiments, we only consider one dimensional time series. In the gait analysis example, we extend the usage of shapelet to multi-variate time series. Here, we want to distinguish the normal walk, in the left video, from the abnormal walk on the right. We record the time series for both feet. The video are taken from different actors, different walking style and different step numbers. Gait Dataset Length different lengths Training 18 Testing 143

Reduces the sensitivity of alignment 0.909 0.902 1.0 0.860 right toe 144.075 I left toe (Normal Walk) 0.535 100 200 300 Walk Decision Tree I 1 The resulting shapelet on the left represent exactly one walking cycle. On the right figure, we shows the accuracy comparison between two different methods. Shapelet and Rotation invariant nearest neighbor. We have two types of data in the experiment, one is the raw data extracted from the video and the other is data after careful alignment and segmentation. Although the nearest neighbor method doesn’t lose too much on the well segmented data, the accuracy of nn suffers greatly on the raw data. so the shapelet method can reduce the sensitivity of alignment and thus alleviates human labors for preprocessing data. Time for Classification on 143 testing objects Shapelet 0.060s Rotation Invariant Nearest Neighbor 2.684s

Conclusions Interpretable results more accurate/robust significantly faster at classification As we have seen, The shapelet provide interpretable results, which may help domain practitioners better understand their data. The shapelet generate more accurate result. Also it is more robust to the noise and unalignment. What’s more, the shapelet is significantly fast at classification than existing state-of-art approaches.

Discussions - Comparison Hong Cheng, Xifeng Yan, Jiawei Han, and Chih-Wei Hsu , “Discriminative Frequent Pattern Analysis for Effective Classification” (ICDE'07) Hong Cheng, Xifeng Yan, Jiawei Han, and Philip S. Yu, "Direct Discriminative Pattern Mining for Effective Classification", (ICDE'08) Similarities: motivation: Discriminative frequent pattern = Shapelet technique: Use upper bound of information gain to speed up Differences: application: general feature selection v.s. time series (no explicit features) split node: binary (contain/not contain a pattern) v.s. numeric value (smaller/larger than a value)

Discussions – other topics Similar ideas could be applied to other research topics graph image spatio-temporal social network ….

Discussions – other topics Graph classification: Xifeng Yan, Hong Cheng, Jiawei Han, and Philip S. Yu, “Mining Significant Graph Patterns by Scalable Leap Search”, Proc. 2008 ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'08), Vancouver, BC, Canada, June 2008.

Discussions – other topics moving object classification Discriminative sub-movement

Discussions – other topics Social network classify normal/spamming users

Discussions – other topics

Discussions – other topics Social network classify normal/spamming users How to find discriminative features on social network? social network structure user behaviour

Discussions – other topics For different applications, this idea could be adapted to improve the performance; but not easily adapted.

Thank You  Question? That’s concludes my talk! And I am happy to answer any question.