Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh

Slides:

Advertisements

Similar presentations

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.

Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson

Aggregating local image descriptors into compact codes

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Word Spotting DTW.

Fast Algorithms For Hierarchical Range Histogram Constructions

Yasuhiro Fujiwara (NTT Cyber Space Labs)

Doruk Sart, Abdullah Mueen, Walid Najjar, Eamonn Keogh, Vit Niennatrakul 1.

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore,

Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,

Mining Time Series.

Fast Time Series Classification Using Numerosity Reduction DME Paper Presentation Jonathan Millin & Jonathan Sedar Fri 12 th Feb 2010.

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

CBF Dataset Two-Pat Dataset Euclidean DTW Increasingly Large Training.

Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.

Efficient Query Filtering for Streaming Time Series

Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.

Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,

Based on Slides by D. Gunopulos (UCR)

Finding Time Series Motifs on Disk-Resident Data

Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.

Using Relevance Feedback in Multimedia Databases

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.

Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.

Exact Indexing of Dynamic Time Warping

Accelerating the Dynamic Time Warping Distance Measure Using Logarithmic Arithmetic Joseph Tarango, Eamonn Keogh, Philip Brisk

 Optimal Packing of High- Precision Rectangles By Eric Huang & Richard E. Korf 25 th AAAI Conference, 2011 Florida Institute of Technology CSE 5694 Robotics.

FastDTW: Toward Accurate Dynamic Time Warping in Linear Time and Space

Analysis of Constrained Time-Series Similarity Measures

S DTW: COMPUTING DTW DISTANCES USING LOCALLY RELEVANT CONSTRAINTS BASED ON SALIENT FEATURE ALIGNMENTS K. Selçuk Candan Arizona State University Maria Luisa.

Module 3Special Relativity1 Module 3 Special Relativity We said in the last module that Scenario 3 is our choice. If so, our first task is to find new.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.

Mining Time Series.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

1 CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu,

Abdullah Mueen Eamonn Keogh University of California, Riverside.

Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.

Learning Time-Series Shapelets Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme Information Systems and Machine Learning Lab University.

Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin

Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

k-Shape: Efficient and Accurate Clustering of Time Series

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

Exact indexing of Dynamic Time Warping

DTW-D: Time Series Semi-Supervised Learning from a Single Example Yanping Chen 1.

An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)

Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.

COMP 5331 Project Roadmap I will give a brief introduction (e.g. notation) on time series. Giving a notion of what we are playing with.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

The Keogh Lab 1 Presented by Abdullah Mueen. Overview of our work Our Goal: Extract information from raw, noisy, massive, unstructured data. We develop.

Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.

NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.

Presented by: Dardan Xhymshiti Spring 2016:. Authors: Publication:  ICDM 2015 Type:  Research Paper 2 Michael ShekelyamGregor JosseMatthias Schubert.

Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets Chin-Chia Michael Yeh, Yan.

Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins Yan Zhu, Zachary Zimmerman,

A Time Series Representation Framework Based on Learned Patterns

Time Series Filtering Time Series

Spatio-temporal Pattern Queries

Spatial Online Sampling and Aggregation

Time Series Filtering Time Series

Donghui Zhang, Tian Xia Northeastern University

Presentation transcript:

Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping Authors: Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh Published In: SIGKDD 2012 Presenter: 0016062 胡喬峰 0016037 黃培綸

Outline Introduction Related work Background & Notations Algorithms Motivation Explicit Statement of the assumptions Related work Background & Notations Algorithms Known Optimizations New Optimizations Experiment results Conclusions & Discussions

Motivation Time series data: A time series is a collection of observations made sequentially in time. Most time series data mining algorithms require similarity comparisons as a subroutine  The time is the bottleneck. Applications of DTW: Robotics, medicine, biometrics, climatology, and gesture recognition… … In spite of the ubiquity of DTW, it is still too computationally expensive.

Motivation Combine four novel ideas(UCR suite) to removes all the objections. Answer queries of length 1,000 under DTW with 95% accuracy, in a random walk dataset of one million objects. (5.65 sec  3.8 sec) Word spotting task in a speech by the new DTW-KNN method. (approximately 2 min  less then a second) DTW took 128.26 minutes to run the 14,400 tests for a given subject’s 160 gestures.(under 3 seconds)

Explicit Statement of the Assumptions Trillion: More than all of the time series data considered in all papers ever published in all data mining combined. Some assumptions of the work Time Series Subsequences must be Normalized. Dynamic Time Warping is the Best Measure. Arbitrary Query Lengths cannot be Indexed. There Exists Data Mining Problems that we are Willing to Wait Some Hours to Answer.

Explicit Statement of the Assumptions Time Series Subsequences must be Normalized. Gun/NoGun classification problem The figure in the center was Z-normalized  Use DTW 1-NN, the error rate is 0.087 Un-normalize by adding scaling or offset Use DTW 1-NN over 1000 runs, the error rate is 0.326.

Explicit Statement of the Assumptions Dynamic Time Warping is the Best Measure. Is DTW the right measure to speed up? The optimized DTW search is much faster than all current Euclidean distance searches. After an exhaustive literature search of more than 800 papers, no one has been shown to outperform DTW.

Explicit Statement of the Assumptions Arbitrary Query Lengths cannot be Indexed. If we know the length of queries, we can do the search by indexing the data. Suppose we have a query Q of length 65, and the index supports queries of length 64. We search the index for Q[1:64] and find the best match for it has distance 5.17. The best match for full Q? Surprising little since we should renormalize the subsequence.

Explicit Statement of the Assumptions There Exists Data Mining Problems that we are Willing to Wait Some Hours to Answer. Entomologists spent 3 years gathering 0.2 trillion datapoints Astronomers spent billions dollars to launch satellite Hospital charges 34,000 for a daylong EEG session to collect 0.3 trillion datapoints Spent hours of CPU time is OK.

Related work Jegou et al. have demonstrated very fast approximate main memory search of 10 million images . This paper only considering exact search, not on approximate search. H. Ding, G. Trajcevski, P. Scheuermann, X. Wang, and E. J. Keogh. 2008. Querying and mining of time series data: experimental comparison of representations and distance measures. PVLDB 1, 2, 1542-52. P. Papapetrou, V. Athitsos, M. Potamias, G. Kollios, and D. Gunopulos. 2011. Embedding-based subsequence matching in time-series databases. ACM TODS 36, 3, 17*.

Background & Notations Definition 1: A Time Series T is an ordered list: T=t1,t2,...,tm. Definition 2: A subsequence Ti,k of a time series T is a shorter time series of length k which starts from position i. Formally, Ti,k = ti,ti+1,..,ti+k-1 , 1≤ i ≤ m-k+1. Definition 3: The Euclidean distance (ED) between Q and C, where |Q| =|C|, is defined as:

Background & Notations

Background & Notations DTW(Dynamic Time Warping): Construct an n-by-n matrix The (i,j) element of the matrix = ED(qi,cj) Warping path P: A contiguous set of matrix elements defines a mapping between Q and C Pt=(i,j)t  P = p1,p2,…,pt,…,pT (n≦T≦2n-1)

Background & Notations Several constraints Warping path must start and finish in diagonally opposite corner cells of the matrix. The steps in the warping path are restricted to adjacent cells. Points in the warping path must be monotonically spaced in time. Constrain the warping path by limiting how far it may stray from the diagonal.(Sakoe-Chiba Band)

Algorithms Proposed method: UCR Suite An Algorithm to find the nearest neighbor Combine known optimizations and new optimizations Known optimizations: 1.Use the Squared Distance 2.Lower Bounding 3.Early Abandoning of ED and LB_Keogh 4.Early Abandoning of DTW 5.Exploiting Multicores New optimizations: 1.Early Abandoning Z-Normalization 2.Reordering Early Abandoning 3.Reversing the Query/Data Role in LB_Keogh 4.Cascading Lower Bounds

Known optimizations 1.Use the Squared Distance Both DTW and ED have square root calculation. Instead of using square root, we use squared distance. It does not change the relative rankings of nearest neighbor. Make later optimizations easier to explain. Simply the internal change in code. 

Known optimizations 2.Lower Bounding Use cheap-to-compute lower bound to prune off unpromising candidates LB_Kim: O(1), uses the distances between the first(last) pair of points from C and Q as a lower bound. LB_Keogh: O(n), uses the ED between C and the closer of {U,L} as a lower bound.

Known optimizations 3.Early Abandoning of ED and LB_Keogh We have a best-so-far value of b If the incrementally summing of the distances have exceeded b, it is pointless to continue the calculation.

Known optimizations 4.Early Abandoning of DTW We have computed a full LB_Keogh. Begin to calculate of DTW. DTW(Q1:K,C1:K)+LB_Keogh(QK+1:n,CK+1:n)= A lower bound of DTW(Q1:n,C1:n) If at any time this lower bound exceeds the bsf, stop and prune this C.

New optimizations 1.Early Abandoning Z-Normalization Interleave the early abandoning calculations of ED(or LB_Keogh) with online Z-normalization. Pruning not just distance calculation but also normalization steps. Line 11 allows the abandoning.

New optimizations 2.Reordering Early Abandoning Compute ED or LB in another order instead of left to right There are n! possible orderings to consider. Conjecture that the optimal order is to sort the absolute height of the query point. Qi will be compared to many Ci’s, however, after Z-normalized, Ci’s will be normal distribution.  Farthest sections have largest contribution

New optimizations 3.Reversing the Query/Data Role in LB_Keogh LB_KeoghEQ: The envelope is built around the query. LB_KeoghEC: The envelope is built around the candidate instead. It only needs to be done once, and thus saves time and space overhead.

New optimizations 4.Cascading Lower Bounds Use lower bound only on the skyline Start at LB_KimFL, if a candidate is not pruned in the current stage, we use the LB_KeoghEQ. For incrementally compute LB_KeoghEQ, we may abandon anywhere between O(1) and O(n) if the lower bound exceed best-so-far value.

New optimizations 4.Cascading Lower Bounds If the LB_KeoghEQ do not exceed best-so-far value, we reverse the role of query and data and compute LB_KeoghEC. If the LB_KeoghEC also can not prune, we begin early abandoning calculation of DTW

Experimental Result UCR Suite comparison Naïve State-of-the-art (SOTA) GOd’s ALgorithm(GOAL) 接下來我來談談關於這個實驗結果，這個paper在不同的應用，將UCR Suite與下列三個方法作比較，我先簡略的為這四種方法作簡單的介紹 1.Naïve : 只作一般的Z軸正規化的 DTW 或 Euclidean distance 2. State-of-the-art:目前科技水準的演算法，使用了Z-normalized from scratch, early abandoning is used, and the LB_ Keogh lower bound is used for DTW. 3.God’s Algorithm(GOAL): Oracle algorithm, 可以視為一個最完美(理想)的演算法這篇paper舉了三個例子第一 random walk 第二 EEG 人體標準腦波圖第三DNA序列

Baseline Tests on Random Walk without the need to ship large hard drives to interested parties with queries of length 128 Baseline Tests on Random Walk 前面提到Trillion，Trillion 的data size是相當大的，我們將Time series的資料，使用disk來儲存將會花費太多成本，這邊改用random number generator and the seed.找一個高品質且範圍長於最長的資料的generator進行替代random walk的data 下圖，即同樣的query不同的data size，分別為平均數字 1000 100 10 queries 可以看到 SOTA與UCR有著顯著的差別

Random walk: Varying size of the query the time for DTW is not significantly different than that for ED The query length 128 同樣的data size 但不同的query size 在這張圖我們有兩個值得注意的地方，在query length 128，UCR DTW顯得相當平緩，與GOAL的假設相當接近(與query length獨立) 第二個要提的是UCR_ED與UCR_DTW之間的比例只有 1.18是相當接近的，這個UCR的方法大大的駁斥之前paper對於ED與DTW假設。

EEG EEG 為為人體標準腦波圖 EEG是第一個應用UCR suite的地方 Data_size 0.3 trillion Query =7000 Query=7000,從圖上可以看出在資料量 0.3 trillion 腦波點，找像蝙蝠俠的蝙蝠一樣的形狀 UCR-ED 與 SOTA-ED 有著極大的差距

DNA Algorithm for conversion to Time series 左圖為將DNA序列轉換成Time series DNA data相當的龐大，如果我們的演算法能夠快速且正確的找到我們要的 query那dna序列排序工作將會變得輕鬆右圖為人類DNA序列中的 Human chromosome 2 (H2) 與其5個靈長類近親作比較。這個圖是從 query of the length 72500 Datasize 將近3億的基因組分析出使用 NAIVE花費了38.7天使用 SOTA 34.6 天 UCR 14.6 小時可以看出此演算法的成功

Can we do better than UCR Suite? 1.some datasets are richly oversampled Q = [2.34, 2.01, 1.99,... ] QP = [2.34, 2.34, 2.34, 2.01, 2.01, 2.01, 1.99, 1.99, 1.99,... ] 2. ECGs were sampled at 256Hz， current machines typically sample at 2,048 Hz 我們從上面的結果可以看到UCR suite表現得相當優異，那還有甚麼方法可以讓他更快? 前面提到各種最佳化的方法，已知的加上創新的，我想我們剩下的方法就從資料取樣 Q是一群dataset QP是將Q進行補位，每個元素多三個(padding) 按照這種模式， GOAL will take exactly three times NAÏVE takes exactly nine time longer 但UCR 並不會到九倍這麼多，但仍然會大於三倍。所以反過來想，如果我們知道資料有這種模式，那我們可以大大的剩下不必要花費的時間如上面的QP 我們使用one in three downsample 那現在因為資料分析的應用關系現在大部分都取樣2048赫茲，但實際上只需要256就可以有相同的結果。 ((2,048/256)^2 )

Realtime Medical and Gesture Data With UCR anything that denied before is possible now. DTW can be used to spot gestures/brainwaves/musical patterns/anomalous heartbeats in real-time, even on low-powered devices, even with multiple channels of data, and even with multiple simultaneous queries. 之前提到的花費時間過多，經由UCR suite，這些都變得有可能實現了

Speeding up Existing Mining Algorithms Time Series Shapelets Online Time Series Motifs Classification of Historical Musical Scores Classification of Ancient Coins Clustering of Star Light Curves 接下來，這篇paper試著用他的方法去改進STATE_of_ART，也就是其他paper work 因為下列的演算法已經將計算距離時進行痕好的最佳化，這邊僅將各個演算法計算距離的部分改而使用UCRsuite，改善的成果不明顯，但僅僅將部分的副行程，做替換，何不為之? 1.這是一個改善Classification of time series的演算法，原本需時18.9min，經改善後，花費12.5min 2.time series Motifs 就是只找出 repeated subsequence，在先人的努力下，都是在處理離線時(in batch),找出time series Motifs,而這篇著重於in real time(即時) 在測試EEG dataset 原本需時(最快)436seconds,UCR 156 seconds 3.若對樂理方面有興趣的人，可能對這篇有興趣，將4027音樂符號轉乘timeseries 原本需時 142.4hr，改成UCR 僅需725.6min 差了11.8倍 4. blabla 5.光變曲線，是相對於時間的亮度變化圖形，可用來計算星體的字轉週期。 Using sota take16.57 day UCR (不同的層別) 1.47 day

conclusion Ultra fast algorithm for finding nearest neighbor Uses a combination of various optimization Can be used as a subroutine to speed up other algorithm conclusion

Discussions Strong part of this paper Weak part of this paper UCR-DTW is faster than all current Euclidean distance search UCR-DTW uses lots of novel and efficient optimizations Weak part of this paper Lots of data stores in the memory NAÏVE implementation of DTW is the recursion version Discussions 從磁碟存取資料將會造成，過多次的IO與耗費時間，改而存在memory using sequential access 而存在memory，當資料量大時，所需成本也高所比較的NAIIVE是使用最慢的版本，遞迴版本與迴圈版本至少會差到三個level

Discussions Possible improvement Possible extension & applications The query of the length upper 128 Fragmentation of the data set Possible extension & applications Voice recognition Discussions 當query length 大於128時，無法達到goal 的效能將dataset切成適當的大小，符合memory大小的成本 Voice recognition 可優化現在一個對其唱一段曲子，然後會替你找出可能的歌曲的app