Time Series Filtering Time Series 1 5 9 2 6 10 Matches Q11 Time Series 1 5 9 2 6 10 Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C. 3 7 11 4 8 12 Candidates
Filtering vs. Querying Query Database Database 1 5 9 6 1 2 6 10 2 7 8 (template) Database Database Matches Q11 Best match 1 5 9 6 1 2 6 10 2 7 8 3 3 7 11 9 4 4 8 12 5 10 Queries Database
Euclidean Distance Metric Given two time series Q = q1…qn and C = c1…cn , their Euclidean distance is defined as: 10 20 30 40 50 60 70 80 90 100 Q C
Early Abandon During the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation. 10 20 30 40 50 60 70 80 90 100 calculation abandoned at this point Q C
Every possible warping between two time series, is a path though the matrix. We want the best one… How is DTW Calculated? Q C C Q This recursive function gives us the minimum cost path (i,j) = d(qi,cj) + min{ (i-1,j-1), (i-1,j ), (i,j-1) } Warping path w
Classic Approach Time Series 1 5 9 2 6 10 Individually compare each candidate sequence to the query using the early abandoning algorithm. 3 7 11 4 8 12 Candidates
Euclidean Distance Lower Bound Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki ) They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck . We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L} A lower bounding measure for Euclidean distance between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W: C1 C2 U W L U L W Q
DTW Distance Lower Bound 2 1 U L W DTW_U DTW_L Q A B D Based on the wedge W and the allowed warping range R, we define two new sequences, DTW_U and DTW_L: DTW_Ui = max(Ui-R : Ui+R ) DTW_Li = min(Li-R : Li+R ) They form an additional envelope above and below the wedge, as illustrated in left figure. We can now define a lower bounding measure for DTW distance between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W :
Generalized Wedge Use W(1,2) to denote that a wedge is built from sequences C1 and C2 . Wedges can be hierarchally nested. For example, W((1,2),3) consists of W(1,2) and C3 . C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) W((1, 2), 3)
H-Merge Time Series 1 5 9 Compare the query to the wedge using LB_Keogh If the LB_Keogh function early abandons, we are done Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm 2 6 10 3 7 11 4 8 12 Candidates
Hierarchal Clustering W3 W2 W5 W1 W4 W(2,5) W(1,4) W((2,5),3) W(((2,5),3), (1,4)) K = 5 K = 4 K = 3 K = 2 K = 1 C3 (or W3) C5 (or W5) C2 (or W2) C4 (or W4) C1 (or W1) Which wedge set to choose ?
Which Wedge Set to Choose ? Test all k wedge sets on a representative sample of data Choose the wedge set which performs the best
Upper Bound on H-Merge Worst case Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset. But, what about streaming time series ? Streaming algorithms are limited by their worst case. Being efficient on average does not help. Worst case C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) Subsequence W((1, 2), 3)
? Triangle Inequality If dist(W((2,5),3), W(1,4)) >= 2 r < r Subsequence W3 W2 W5 W1 W4 W3 W(2,5) W1 W4 W3 W(2,5) W(1,4) W((2,5),3) W((2,5),3) < r W(((2,5),3), (1,4)) >= 2r ? W(1,4) K = 5 K = 4 K = 3 K = 2 K = 1 W(1,4) cannot fail on both wedges fails
Euclidean Distance: ECG Dataset Batch time series 650,000 data points (half an hour’s ECG signals) Candidate set 200 time series of length 40 r = 0.5 Algorithm Number of Steps brute force 5,199,688,000 classic 210,190,006 H-Merge 8,853,008 H-Merge-R 29,480,264 x 10 9 6 brute force 5 4 Number of Steps 3 2 1 classic H-Merge H-Merge-R Algorithms
Euclidean Distance: Stock Dataset Batch time series 2,119,415 data points Candidate set 337 time series with length 128 r = 4.3 Algorithm Number of Steps brute force 91,417,607,168 classic 13,028,000,000 H-Merge 3,204,100,000 H-Merge-R 10,064,000,000 brute force x 10 10 10 9 8 7 Number of Steps 6 5 4 3 classic 2 H-Merge-R H-Merge 1 Algorithms
Euclidean Distance: Audio Dataset Batch time series 46,143,488 data points (one hour’s sound) Candidate set 68 time series with length 101 r = 4.14 Sliding window 11,025 (1 second) Step 5,512 (0.5 second) Algorithm Number of Steps brute force 57,485,160 classic 1,844,997 H-Merge 1,144,778 H-Merge-R 2,655,816 brute force x 10 7 6 5 4 Number of Steps 3 2 1 H-Merge-R classic H-Merge Algorithms
DTW Distance: Gun Dataset Batch time series 18,750 data points Candidate set 80 time series of length 150 r = 1.23 offset = 15 warping window size = 3% Class Number of Interesting Segments Number of Hits (Euclidean) Number of Hits (DTW) Female-Gun 37 35 Female-Point 30 22 Male-Gun 13 19 Male-Point 28 14 17 Total 125 84 103 Accuracy 67.2% 82.4%
DTW Distance: ECG Dataset Batch time series 200,000 data points Candidate set 200 time series of length 40 r = 0.5 offset = 20 warping window size = 3% Class Number of Interesting Segments Number of Hits (Euclidean) Number of Hits (DTW) A 107 104 E 105 67 101 L 74 62 72 R 86 79 84 Total 373 312 364 Accuracy 83.87% 97.85%
Speedup by Sorting Wedge Random walk time series with length 1,000 Sorted 95,025 151,723 345,226 778,367 Unsorted 1,906,244 2,174,994 2,699,885 3,286,213
Semi Supervised Time Series Classification 10 30 50 70 90 110 130 150 170 190 210 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of examples in P Precision-recall breakeven point Training Set Testing Set ECG Dataset Positive (Abnormal) 208 312 520 Negative (Normal) 602 904 1,506 Total 810 1,216 2,026