Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping Authors: Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh Published In: SIGKDD 2012 Presenter: 0016062 胡喬峰 0016037 黃培綸
Outline Introduction Related work Background & Notations Algorithms Motivation Explicit Statement of the assumptions Related work Background & Notations Algorithms Known Optimizations New Optimizations Experiment results Conclusions & Discussions
Motivation Time series data: A time series is a collection of observations made sequentially in time. Most time series data mining algorithms require similarity comparisons as a subroutine The time is the bottleneck. Applications of DTW: Robotics, medicine, biometrics, climatology, and gesture recognition… … In spite of the ubiquity of DTW, it is still too computationally expensive.
Motivation Combine four novel ideas(UCR suite) to removes all the objections. Answer queries of length 1,000 under DTW with 95% accuracy, in a random walk dataset of one million objects. (5.65 sec 3.8 sec) Word spotting task in a speech by the new DTW-KNN method. (approximately 2 min less then a second) DTW took 128.26 minutes to run the 14,400 tests for a given subject’s 160 gestures.(under 3 seconds)
Explicit Statement of the Assumptions Trillion: More than all of the time series data considered in all papers ever published in all data mining combined. Some assumptions of the work Time Series Subsequences must be Normalized. Dynamic Time Warping is the Best Measure. Arbitrary Query Lengths cannot be Indexed. There Exists Data Mining Problems that we are Willing to Wait Some Hours to Answer.
Explicit Statement of the Assumptions Time Series Subsequences must be Normalized. Gun/NoGun classification problem The figure in the center was Z-normalized Use DTW 1-NN, the error rate is 0.087 Un-normalize by adding scaling or offset Use DTW 1-NN over 1000 runs, the error rate is 0.326.
Explicit Statement of the Assumptions Dynamic Time Warping is the Best Measure. Is DTW the right measure to speed up? The optimized DTW search is much faster than all current Euclidean distance searches. After an exhaustive literature search of more than 800 papers, no one has been shown to outperform DTW.
Explicit Statement of the Assumptions Arbitrary Query Lengths cannot be Indexed. If we know the length of queries, we can do the search by indexing the data. Suppose we have a query Q of length 65, and the index supports queries of length 64. We search the index for Q[1:64] and find the best match for it has distance 5.17. The best match for full Q? Surprising little since we should renormalize the subsequence.
Explicit Statement of the Assumptions There Exists Data Mining Problems that we are Willing to Wait Some Hours to Answer. Entomologists spent 3 years gathering 0.2 trillion datapoints Astronomers spent billions dollars to launch satellite Hospital charges 34,000 for a daylong EEG session to collect 0.3 trillion datapoints Spent hours of CPU time is OK.
Related work Jegou et al. have demonstrated very fast approximate main memory search of 10 million images . This paper only considering exact search, not on approximate search. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. J. Keogh. 2008. Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1, 2, 1542-52. P. Papapetrou, V. Athitsos, M. Potamias, G. Kollios, and D. Gunopulos. 2011. Embedding-based subsequence matching in time-series databases. ACM TODS 36, 3, 17*.
Background & Notations Definition 1: A Time Series T is an ordered list: T=t1,t2,...,tm. Definition 2: A subsequence Ti,k of a time series T is a shorter time series of length k which starts from position i. Formally, Ti,k = ti,ti+1,..,ti+k-1 , 1≤ i ≤ m-k+1. Definition 3: The Euclidean distance (ED) between Q and C, where |Q| =|C|, is defined as:
Background & Notations
Background & Notations DTW(Dynamic Time Warping): Construct an n-by-n matrix The (i,j) element of the matrix = ED(qi,cj) Warping path P: A contiguous set of matrix elements defines a mapping between Q and C Pt=(i,j)t P = p1,p2,…,pt,…,pT (n≦T≦2n-1)
Background & Notations Several constraints Warping path must start and finish in diagonally opposite corner cells of the matrix. The steps in the warping path are restricted to adjacent cells. Points in the warping path must be monotonically spaced in time. Constrain the warping path by limiting how far it may stray from the diagonal.(Sakoe-Chiba Band)
Algorithms Proposed method: UCR Suite An Algorithm to find the nearest neighbor Combine known optimizations and new optimizations Known optimizations: 1.Use the Squared Distance 2.Lower Bounding 3.Early Abandoning of ED and LB_Keogh 4.Early Abandoning of DTW 5.Exploiting Multicores New optimizations: 1.Early Abandoning Z-Normalization 2.Reordering Early Abandoning 3.Reversing the Query/Data Role in LB_Keogh 4.Cascading Lower Bounds
Known optimizations 1.Use the Squared Distance Both DTW and ED have square root calculation. Instead of using square root, we use squared distance. It does not change the relative rankings of nearest neighbor. Make later optimizations easier to explain. Simply the internal change in code.
Known optimizations 2.Lower Bounding Use cheap-to-compute lower bound to prune off unpromising candidates LB_Kim: O(1), uses the distances between the first(last) pair of points from C and Q as a lower bound. LB_Keogh: O(n), uses the ED between C and the closer of {U,L} as a lower bound.
Known optimizations 3.Early Abandoning of ED and LB_Keogh We have a best-so-far value of b If the incrementally summing of the distances have exceeded b, it is pointless to continue the calculation.
Known optimizations 4.Early Abandoning of DTW We have computed a full LB_Keogh. Begin to calculate of DTW. DTW(Q1:K,C1:K)+LB_Keogh(QK+1:n,CK+1:n)= A lower bound of DTW(Q1:n,C1:n) If at any time this lower bound exceeds the bsf, stop and prune this C.
New optimizations 1.Early Abandoning Z-Normalization Interleave the early abandoning calculations of ED(or LB_Keogh) with online Z-normalization. Pruning not just distance calculation but also normalization steps. Line 11 allows the abandoning.
New optimizations 2.Reordering Early Abandoning Compute ED or LB in another order instead of left to right There are n! possible orderings to consider. Conjecture that the optimal order is to sort the absolute height of the query point. Qi will be compared to many Ci’s, however, after Z-normalized, Ci’s will be normal distribution. Farthest sections have largest contribution
New optimizations 3.Reversing the Query/Data Role in LB_Keogh LB_KeoghEQ: The envelope is built around the query. LB_KeoghEC: The envelope is built around the candidate instead. It only needs to be done once, and thus saves time and space overhead.
New optimizations 4.Cascading Lower Bounds Use lower bound only on the skyline Start at LB_KimFL, if a candidate is not pruned in the current stage, we use the LB_KeoghEQ. For incrementally compute LB_KeoghEQ, we may abandon anywhere between O(1) and O(n) if the lower bound exceed best-so-far value.
New optimizations 4.Cascading Lower Bounds If the LB_KeoghEQ do not exceed best-so-far value, we reverse the role of query and data and compute LB_KeoghEC. If the LB_KeoghEC also can not prune, we begin early abandoning calculation of DTW
Experimental Result UCR Suite comparison Naïve State-of-the-art (SOTA) GOd’s ALgorithm(GOAL) 接下來我來談談關於這個實驗結果,這個paper在不同的應用,將UCR Suite與 下列三個方法作比較,我先簡略的為這四種方法作簡單的介紹 1.Naïve : 只作一般的Z軸 正規化的 DTW 或 Euclidean distance 2. State-of-the-art:目前科技水準的演算法,使用了Z-normalized from scratch, early abandoning is used, and the LB_ Keogh lower bound is used for DTW. 3.God’s Algorithm(GOAL): Oracle algorithm, 可以視為一個最完美(理想)的演算法 這篇paper舉了三個例子 第一 random walk 第二 EEG 人體標準腦波圖 第三DNA序列
Baseline Tests on Random Walk without the need to ship large hard drives to interested parties with queries of length 128 Baseline Tests on Random Walk 前面提到Trillion,Trillion 的data size是相當大的,我們將Time series的資料,使用disk來儲存將會花費太多成本,這邊改用random number generator and the seed.找一個高品質且範圍長於最長的資料的generator進行替代random walk的data 下圖,即同樣的query不同的data size,分別為 平均數字 1000 100 10 queries 可以看到 SOTA與UCR有著顯著的差別
Random walk: Varying size of the query the time for DTW is not significantly different than that for ED The query length 128 同樣的data size 但不同的query size 在這張圖我們有兩個值得注意的地方, 在query length 128,UCR DTW顯得相當平緩,與GOAL的假設相當接近(與query length獨立) 第二個要提的是UCR_ED與UCR_DTW之間的比例只有 1.18是相當接近的,這個UCR的方法大大的駁斥之前paper對於ED與DTW假設。
EEG EEG 為為人體標準腦波圖 EEG是第一個應用UCR suite的地方 Data_size 0.3 trillion Query =7000 Query=7000,從圖上可以看出在資料量 0.3 trillion 腦波點,找像蝙蝠俠的蝙蝠一樣的形狀 UCR-ED 與 SOTA-ED 有著極大的差距
DNA Algorithm for conversion to Time series 左圖為將DNA序列轉換成Time series DNA data相當的龐大,如果我們的演算法能夠快速且正確的找到我們要的 query那dna序列排序工作將會變得輕鬆 右圖為人類DNA序列中的 Human chromosome 2 (H2) 與其5個靈長類近親作比較。這個圖是從 query of the length 72500 Datasize 將近3億的基因組分析出 使用 NAIVE花費了38.7天 使用 SOTA 34.6 天 UCR 14.6 小時 可以看出此演算法的成功
Can we do better than UCR Suite? 1.some datasets are richly oversampled Q = [2.34, 2.01, 1.99,... ] QP = [2.34, 2.34, 2.34, 2.01, 2.01, 2.01, 1.99, 1.99, 1.99,... ] 2. ECGs were sampled at 256Hz, current machines typically sample at 2,048 Hz 我們從上面的結果可以看到UCR suite表現得相當優異,那還有甚麼方法可以讓他更快? 前面提到各種最佳化的方法,已知的加上創新的,我想我們剩下的方法就從資料取樣 Q是一群dataset QP是將Q進行補位,每個元素多三個(padding) 按照這種模式, GOAL will take exactly three times NAÏVE takes exactly nine time longer 但UCR 並不會到九倍這麼多,但仍然會大於三倍。 所以反過來想,如果我們知道資料有這種模式,那我們可以大大的剩下不必要花費的時間 如上面的QP 我們使用one in three downsample 那現在因為資料分析的應用關系現在大部分都取樣2048赫茲,但實際上只需要256就可以有相同的結果。 ((2,048/256)^2 )
Realtime Medical and Gesture Data With UCR anything that denied before is possible now. DTW can be used to spot gestures/brainwaves/musical patterns/anomalous heartbeats in real-time, even on low-powered devices, even with multiple channels of data, and even with multiple simultaneous queries. 之前提到的花費時間過多,經由UCR suite,這些都變得有可能實現了
Speeding up Existing Mining Algorithms Time Series Shapelets Online Time Series Motifs Classification of Historical Musical Scores Classification of Ancient Coins Clustering of Star Light Curves 接下來,這篇paper試著用他的方法去改進STATE_of_ART,也就是其他paper work 因為下列的演算法已經將計算距離時進行痕好的最佳化,這邊僅將各個演算法計算距離的部分改而使用UCRsuite,改善的成果不明顯,但僅僅將部分的副行程,做替換,何不為之? 1.這是一個改善Classification of time series的演算法,原本需時18.9min,經改善後,花費12.5min 2.time series Motifs 就是只找出 repeated subsequence,在先人的努力下,都是在處理離線時(in batch),找出time series Motifs,而這篇著重於in real time(即時) 在測試EEG dataset 原本需時(最快)436seconds,UCR 156 seconds 3.若對樂理方面有興趣的人,可能對這篇有興趣,將4027音樂符號轉乘timeseries 原本需時 142.4hr,改成UCR 僅需725.6min 差了11.8倍 4. blabla 5.光變曲線,是相對於時間的亮度變化圖形,可用來計算星體的字轉週期。 Using sota take16.57 day UCR (不同的層別) 1.47 day
conclusion Ultra fast algorithm for finding nearest neighbor Uses a combination of various optimization Can be used as a subroutine to speed up other algorithm conclusion
Discussions Strong part of this paper Weak part of this paper UCR-DTW is faster than all current Euclidean distance search UCR-DTW uses lots of novel and efficient optimizations Weak part of this paper Lots of data stores in the memory NAÏVE implementation of DTW is the recursion version Discussions 從磁碟存取資料將會造成,過多次的IO與耗費時間,改而存在memory using sequential access 而存在memory,當資料量大時,所需成本也高 所比較的NAIIVE是使用最慢的版本,遞迴版本與迴圈版本至少會差到三個level
Discussions Possible improvement Possible extension & applications The query of the length upper 128 Fragmentation of the data set Possible extension & applications Voice recognition Discussions 當query length 大於128時,無法達到goal 的效能 將dataset切成適當的大小,符合memory大小的成本 Voice recognition 可優化現在一個對其唱一段曲子,然後會替你找出可能的歌曲的app