Time Series Filtering Time Series

Slides:



Advertisements
Similar presentations
Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Advertisements

SAX: a Novel Symbolic Representation of Time Series
Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.
Indexing DNA Sequences Using q-Grams
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Learning Trajectory Patterns by Clustering: Comparative Evaluation Group D.
Clustering Categorical Data The Case of Quran Verses
Word Spotting DTW.
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Convex Hulls in Two Dimensions Definitions Basic algorithms Gift Wrapping (algorithm of Jarvis ) Graham scan Divide and conquer Convex Hull for line intersections.
Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
CBF Dataset Two-Pat Dataset Euclidean DTW Increasingly Large Training.
Efficient Query Filtering for Streaming Time Series
Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.
Distance Functions for Sequence Data and Time Series
1. 2 General problem Retrieval of time-series similar to a given pattern.
Based on Slides by D. Gunopulos (UCR)
Using Relevance Feedback in Multimedia Databases
Scaling and Warping in Time Series Querying Dear Reader: This file contains larger, full color versions of the images in “Scaling and Warping in Time Series.
CLUSTERING Eitan Lifshits Big Data Processing Seminar Prof. Amir Averbuch Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman, Jeffery.
VARIABILITY. PREVIEW PREVIEW Figure 4.1 the statistical mode for defining abnormal behavior. The distribution of behavior scores for the entire population.
Time Series I.
Exact Indexing of Dynamic Time Warping
Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh
Analysis of Algorithms
Fundamentals of Algorithms MCS - 2 Lecture # 7
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
So Far……  Clustering basics, necessity for clustering, Usage in various fields : engineering and industrial fields  Properties : hierarchical, flat,
1 CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu,
Learning Time-Series Shapelets Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme Information Systems and Machine Learning Lab University.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
Exact indexing of Dynamic Time Warping
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
COMP 5331 Project Roadmap I will give a brief introduction (e.g. notation) on time series. Giving a notion of what we are playing with.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Speed improvements to information retrieval-based dynamic time warping using hierarchical K-MEANS clustering Presenter: Kai-Wun Shih Gautam Mantena 1,2.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Chapter 9 sorting. Insertion Sort I The list is assumed to be broken into a sorted portion and an unsorted portion The list is assumed to be broken into.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Machine Learning for the Quantified Self
Clustering CSC 600: Data Mining Class 21.
MA/CSSE 473 Day 17 Divide-and-conquer Convex Hull
Supervised Time Series Pattern Discovery through Local Importance
Privacy Preserving Similarity Evaluation of Time Series Data
Chen Jimena Melisa Parodi Menashe Shalom
Haim Kaplan and Uri Zwick
Gephi Gephi is a tool for exploring and understanding graphs. Like Photoshop (but for graphs), the user interacts with the representation, manipulate the.
Distance Functions for Sequence Data and Time Series
CSCE350 Algorithms and Data Structure
Time Series Filtering Time Series
Distance Functions for Sequence Data and Time Series
Time Series Data and Moving Object Trajectory
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
Searching Similar Segments over Textual Event Sequences
Data Mining – Chapter 4 Cluster Analysis Part 2
Xuehan Ye, Yongcai Wang, Wei Hu, Lei Song, Zhaoquan Gu, Deying Li
SEEM4630 Tutorial 3 – Clustering.
Donghui Zhang, Tian Xia Northeastern University
Measuring the Similarity of Rhythmic Patterns
Presentation transcript:

Time Series Filtering Time Series 1 5 9 2 6 10 Matches Q11 Time Series 1 5 9 2 6 10 Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C. 3 7 11 4 8 12 Candidates

Filtering vs. Querying Query Database Database 1 5 9 6 1 2 6 10 2 7 8 (template) Database Database Matches Q11 Best match 1 5 9 6 1 2 6 10 2 7 8 3 3 7 11 9 4 4 8 12 5 10 Queries Database

Euclidean Distance Metric Given two time series Q = q1…qn and C = c1…cn , their Euclidean distance is defined as: 10 20 30 40 50 60 70 80 90 100 Q C

Early Abandon During the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation. 10 20 30 40 50 60 70 80 90 100 calculation abandoned at this point Q C

Every possible warping between two time series, is a path though the matrix. We want the best one… How is DTW Calculated? Q C C Q This recursive function gives us the minimum cost path (i,j) = d(qi,cj) + min{ (i-1,j-1), (i-1,j ), (i,j-1) } Warping path w

Classic Approach Time Series 1 5 9 2 6 10 Individually compare each candidate sequence to the query using the early abandoning algorithm. 3 7 11 4 8 12 Candidates

Euclidean Distance Lower Bound Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki ) They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck . We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L} A lower bounding measure for Euclidean distance between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W: C1 C2 U W L U L W Q

DTW Distance Lower Bound 2 1 U L W DTW_U DTW_L Q A B D Based on the wedge W and the allowed warping range R, we define two new sequences, DTW_U and DTW_L: DTW_Ui = max(Ui-R : Ui+R ) DTW_Li = min(Li-R : Li+R ) They form an additional envelope above and below the wedge, as illustrated in left figure. We can now define a lower bounding measure for DTW distance between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W :

Generalized Wedge Use W(1,2) to denote that a wedge is built from sequences C1 and C2 . Wedges can be hierarchally nested. For example, W((1,2),3) consists of W(1,2) and C3 . C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) W((1, 2), 3)

H-Merge Time Series 1 5 9 Compare the query to the wedge using LB_Keogh If the LB_Keogh function early abandons, we are done Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm 2 6 10 3 7 11 4 8 12 Candidates

Hierarchal Clustering W3 W2 W5 W1 W4 W(2,5) W(1,4) W((2,5),3) W(((2,5),3), (1,4)) K = 5 K = 4 K = 3 K = 2 K = 1 C3 (or W3) C5 (or W5) C2 (or W2) C4 (or W4) C1 (or W1) Which wedge set to choose ?

Which Wedge Set to Choose ? Test all k wedge sets on a representative sample of data Choose the wedge set which performs the best

Upper Bound on H-Merge Worst case Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset. But, what about streaming time series ? Streaming algorithms are limited by their worst case. Being efficient on average does not help. Worst case C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) Subsequence W((1, 2), 3)

? Triangle Inequality If dist(W((2,5),3), W(1,4)) >= 2 r < r Subsequence W3 W2 W5 W1 W4 W3 W(2,5) W1 W4 W3 W(2,5) W(1,4) W((2,5),3) W((2,5),3) < r W(((2,5),3), (1,4)) >= 2r ? W(1,4) K = 5 K = 4 K = 3 K = 2 K = 1 W(1,4) cannot fail on both wedges fails

Euclidean Distance: ECG Dataset Batch time series 650,000 data points (half an hour’s ECG signals) Candidate set 200 time series of length 40 r = 0.5 Algorithm Number of Steps brute force 5,199,688,000 classic 210,190,006 H-Merge 8,853,008 H-Merge-R 29,480,264 x 10 9 6 brute force 5 4 Number of Steps 3 2 1 classic H-Merge H-Merge-R Algorithms

Euclidean Distance: Stock Dataset Batch time series 2,119,415 data points Candidate set 337 time series with length 128 r = 4.3 Algorithm Number of Steps brute force 91,417,607,168 classic 13,028,000,000 H-Merge 3,204,100,000 H-Merge-R 10,064,000,000 brute force x 10 10 10 9 8 7 Number of Steps 6 5 4 3 classic 2 H-Merge-R H-Merge 1 Algorithms

Euclidean Distance: Audio Dataset Batch time series 46,143,488 data points (one hour’s sound) Candidate set 68 time series with length 101 r = 4.14 Sliding window 11,025 (1 second) Step 5,512 (0.5 second) Algorithm Number of Steps brute force 57,485,160 classic 1,844,997 H-Merge 1,144,778 H-Merge-R 2,655,816 brute force x 10 7 6 5 4 Number of Steps 3 2 1 H-Merge-R classic H-Merge Algorithms

DTW Distance: Gun Dataset Batch time series 18,750 data points Candidate set 80 time series of length 150 r = 1.23 offset = 15 warping window size = 3% Class Number of Interesting Segments Number of Hits (Euclidean) Number of Hits (DTW) Female-Gun 37 35 Female-Point 30 22 Male-Gun 13 19 Male-Point 28 14 17 Total 125 84 103 Accuracy 67.2% 82.4%

DTW Distance: ECG Dataset Batch time series 200,000 data points Candidate set 200 time series of length 40 r = 0.5 offset = 20 warping window size = 3% Class Number of Interesting Segments Number of Hits (Euclidean) Number of Hits (DTW) A 107 104 E 105 67 101 L 74 62 72 R 86 79 84 Total 373 312 364 Accuracy 83.87% 97.85%

Speedup by Sorting Wedge Random walk time series with length 1,000 Sorted 95,025 151,723 345,226 778,367 Unsorted 1,906,244 2,174,994 2,699,885 3,286,213

Semi Supervised Time Series Classification 10 30 50 70 90 110 130 150 170 190 210 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of examples in P Precision-recall breakeven point Training Set Testing Set ECG Dataset Positive (Abnormal) 208 312 520 Negative (Normal) 602 904 1,506 Total 810 1,216 2,026