1 CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu,

Slides:



Advertisements
Similar presentations
Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Advertisements

SAX: a Novel Symbolic Representation of Time Series
Lindsey Bleimes Charlie Garrod Adam Meyerson
Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.
Word Spotting DTW.
Unsupervised Learning
Time Series Classification under More Realistic Assumptions Bing Hu Yanping Chen Eamonn Keogh SIAM Data Mining Conference (SDM), 2013.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
Doruk Sart, Abdullah Mueen, Walid Najjar, Eamonn Keogh, Vit Niennatrakul 1.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore,
Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,
Instruction Set Extension for Dynamic Time Warping Joseph Tarango, Eammon Keogh, Philip Brisk
Mining Time Series.
Fast Time Series Classification Using Numerosity Reduction DME Paper Presentation Jonathan Millin & Jonathan Sedar Fri 12 th Feb 2010.
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
SASH Spatial Approximation Sample Hierarchy
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
CBF Dataset Two-Pat Dataset Euclidean DTW Increasingly Large Training.
Efficient Query Filtering for Streaming Time Series
Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.
Jessica Lin, Eamonn Keogh, Stefano Loardi
Based on Slides by D. Gunopulos (UCR)
Finding Time Series Motifs on Disk-Resident Data
Using Relevance Feedback in Multimedia Databases
Time Series I.
Exact Indexing of Dynamic Time Warping
Accelerating the Dynamic Time Warping Distance Measure Using Logarithmic Arithmetic Joseph Tarango, Eamonn Keogh, Philip Brisk
S DTW: COMPUTING DTW DISTANCES USING LOCALLY RELEVANT CONSTRAINTS BASED ON SALIENT FEATURE ALIGNMENTS K. Selçuk Candan Arizona State University Maria Luisa.
Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Mining Time Series.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Jan 10, 2001CSCI {4,6}900: Ubiquitous Computing1 Administrative Chores Add yourself to the mailing
Abdullah Mueen Eamonn Keogh University of California, Riverside.
Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.
Learning Time-Series Shapelets Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme Information Systems and Machine Learning Lab University.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
k-Shape: Efficient and Accurate Clustering of Time Series
Exact indexing of Dynamic Time Warping
DTW-D: Time Series Semi-Supervised Learning from a Single Example Yanping Chen 1.
Fast Shapelets: All Figures in Higher Resolution.
Clustering Prof. Ramin Zabih
COMP 5331 Project Roadmap I will give a brief introduction (e.g. notation) on time series. Giving a notion of what we are playing with.
The Keogh Lab 1 Presented by Abdullah Mueen. Overview of our work Our Goal: Extract information from raw, noisy, massive, unstructured data. We develop.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Doug Raiford Phage class: introduction to sequence databases.
ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets Chin-Chia Michael Yeh, Yan.
Slides by Eamonn Keogh (UC Riverside)
Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins Yan Zhu, Zachary Zimmerman,
Supervised Time Series Pattern Discovery through Local Importance
Parallel Density-based Hybrid Clustering
Genomic Data Clustering on FPGAs for Compression
Real-Time Ray Tracing Stefan Popov.
Enumeration of Time Series Motifs of All Lengths
Time Series Filtering Time Series
Spatial Online Sampling and Aggregation
Distance Functions for Sequence Data and Time Series
Time Series Filtering Time Series
Efficient Processing of Top-k Spatial Preference Queries
Faster skyline searching using Hilbert R-tree
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

1 CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu, Jesin Zakaria, Eamonn Keogh (2012). Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping SIGKDD 2012 Slides I created for this 260 class have this green background

What is Time Series? Hand at rest Hand moving above holster Hand moving down to grasp gun Hand moving to shoulder level Shooting Lance Armstrong ?

Where is the closest match to Q in T? What is Similarity Search I? Q T

Where is the closest match to Q in T? What is Similarity Search II? Q T

Note that we must normalize the data What is Similarity Search II? Q T

6 Indexing refers to any technique to search a collection of items, without having to examine every object. Obvious example: Search by last name Let look for P oe…. What is Indexing I? A-B-C-D-E-F G-H-I-J-K-L-M N-O- P -Q-R-S T-U-V-W-X-Y-Z

It is possible to index almost anything, using Spatial Access Methods (SAMs) What is Indexing II? T Q

It is possible to index almost anything, using Spatial Access Methods (SAMs) What is Indexing II?

What is Dynamic Time Warping? Mountain Gorilla Gorilla gorilla beringei Lowland Gorilla Gorilla gorilla graueri DTW Alignment

Searching and Mining Trillions of Time Series Subsequences under Dynamic Time Warping Thanawin (Art) Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Qiang Zhu, Brandon Westover, Jesin Zakaria, Eamonn Keogh

What is a Trillion? A trillion is simply one million million. Up to 2011 there have been 1,709 papers in this conference. If every such paper was on time series, and each had looked at five hundred million objects, this would still not add up to the size of the data we consider here. However, the largest time series data considered in a SIGKDD paper was a “mere” one hundred million objects. 11

Dynamic Time Warping 12 Q C C Q Similar but out of phase peaks. C Q R (Warping Windows)

Motivation Similarity search is the bottleneck for most time series data mining algorithms. The difficulty of scaling search to large datasets explains why most academic work considered at few millions of time series objects. 13

Objective Search and mine really big time series. Allow us to solve higher-level time series data mining problem such as motif discovery and clustering at scales that would otherwise be untenable. 14

Assumptions (1) Time Series Subsequences must be Z-Normalized – In order to make meaningful comparisons between two time series, both must be normalized. – Offset invariance. – Scale/Amplitude invariance. Dynamic Time Warping is the Best Measure (for almost everything) – Recent empirical evidence strongly suggests that none of the published alternatives routinely beats DTW. 15 A B C

Assumptions (2) Arbitrary Query Lengths cannot be Indexed – If we are interested in tackling a trillion data objects we clearly cannot fit even a small footprint index in the main memory, much less the much larger index suggested for arbitrary length queries. There Exists Data Mining Problems that we are Willing to Wait Some Hours to Answer – a team of entomologists has spent three years gathering 0.2 trillion datapoints – astronomers have spent billions dollars to launch a satellite to collect one trillion datapoints of star-light curve data per day – a hospital charges $34,000 for a daylong EEG session to collect 0.3 trillion datapoints 16

Proposed Method: UCR Suite 17 An algorithm for searching nearest neighbor Support both ED and DTW search Combination of various optimizations – Known Optimizations – New Optimizations

Known Optimizations (1) Using the Squared Distance Exploiting Multicores – More cores, more speed Lower Bounding – LB_Yi – LB_Kim – LB_Keogh C U L Q LB_Keogh 2

Known Optimizations (2) Early Abandoning of ED Early Abandoning of LB_Keogh 19 U, L is an envelope of Q

Known Optimizations (3) Early Abandoning of DTW Earlier Early Abandoning of DTW using LB Keogh 20 C Q R (Warping Windows) Stop if dtw_dist ≥ bsf dtw_dist

Known Optimizations (3) Early Abandoning of DTW Earlier Early Abandoning of DTW using LB_Keogh 21 C Q R (Warping Windows) (partial) dtw_dist (partial) lb_keogh Stop if dtw_dist +lb_keogh ≥ bsf

UCR Suite New Optimizations 22 Known Optimizations – Early Abandoning of ED – Early Abandoning of LB_Keogh – Early Abandoning of DTW – Multicores

UCR Suite: New Optimizations (1) Early Abandoning Z-Normalization – Do normalization only when needed (just in time). – Small but non-trivial. – This step can break O(n) time complexity for ED (and, as we shall see, DTW). – Online mean and std calculation is needed. 23

UCR Suite: New Optimizations (2) Reordering Early Abandoning – We don’t have to compute ED or LB from left to right. – Order points by expected contribution Order by the absolute height of the query point. - This step only can save about 30%-50% of calculations. Idea

UCR Suite: New Optimizations (3) Reversing the Query/Data Role in LB_Keogh – Make LB_Keogh tighter. – Much cheaper than DTW. – Triple the data. – 25 Envelop on QEnvelop on C Online envelope calculation.

UCR Suite: New Optimizations (4) Cascading Lower Bounds – At least 18 lower bounds of DTW was proposed. – Use some lower bounds only on the Skyline. 26 Tightness of LB (LB/DTW)

UCR Suite New Optimizations – Just-in-time Z-normalizations – Reordering Early Abandoning – Reversing LB_Keogh – Cascading Lower Bounds 27 Known Optimizations – Early Abandoning of ED – Early Abandoning of LB_Keogh – Early Abandoning of DTW – Multicores

UCR Suite New Optimizations – Just-in-time Z-normalizations – Reordering Early Abandoning – Reversing LB_Keogh – Cascading Lower Bounds 28 Known Optimizations – Early Abandoning of ED – Early Abandoning of LB_Keogh – Early Abandoning of DTW – Multicores State-of-the-art * *We implemented the State-of-the-art (SOTA) as well as we could. SOTA is simply the UCR Suite without new optimizations.

Experimental Result: Random Walk Million (Seconds) Billion (Minutes) Trillion (Hours) UCR-ED SOTA-ED UCR-DTW SOTA-DTW Random Walk: Varying size of the data Code and data is available at:

Random Walk: Varying size of the query Experimental Result: Random Walk 30

Query: Human Chromosome 2 of length 72,500 bps Data: Chimp Genome 2.9 billion bps Time: UCR Suite 14.6 hours, SOTA 34.6 days (830 hours) Experimental Result: DNA 31

Data: 0.3 trillion points of brain wave Query: Prototypical Epileptic Spike of 7,000 points (2.3 seconds) Time: UCR-ED 3.4 hours, SOTA-ED 20.6 days (~500 hours) Experimental Result: EEG 32

Data: One year of Electrocardiograms 8.5 billion data points. Query: Idealized Premature Ventricular Contraction (PVC) of length 421 (R=21=5%). UCR-EDSOTA-EDUCR-DTWSOTA-DTW ECG4.1 minutes66.6 minutes18.0 minutes49.2 hours Experimental Result: ECG 33 PVC (aka. skipped beat) ~30,000X faster than real time!

Speeding Up Existing Algorithm Time Series Shapelets: – SOTA 18.9 minutes, UCR Suite 12.5 minutes Online Time Series Motifs: – SOTA 436 seconds, UCR Suite 156 seconds Classification of Historical Musical Scores: – SOTA hours, UCR Suite 720 minutes Classification of Ancient Coins: – SOTA 12.8 seconds, UCR Suite 0.8 seconds Clustering of Star Light Curves: – SOTA 24.8 hours, UCR Suite 2.2 hours 34

Conclusion UCR Suite … is an ultra-fast algorithm for finding nearest neighbor. is the first algorithm that exactly mines trillion real-valued objects in a day or two with a "off-the- shelf machine". uses a combination of various optimizations. can be used as a subroutine to speed up other algorithms. Probably close to optimal ;-) 35

Authors’ Photo Bilson Campana Abdullah Mueen Gustavo Batista Qiang Zhu Brandon Westover Jesin Zakaria Eamonn Keogh Thanawin Rakthanmanon

Acknowledgements NSF grants and FAPESP award 2009/ Royal Thai Government Scholarship

38 Papers Impact It was best paper winner at SIGKDD 2012 It has 37 references according to Google Scholar. Given that it has been in print only 18 months, this would make it among the most cited papers of that conference, that year. The work was expanded to a journal paper, which adds a section on uniform scaling.

39 Discussion The paper made use of videos

40 Questions About the paper? About the presentation of it?

41

LB_Keogh 42 C U L Q C Q R (Warping Windows) U i = max(q i-r : q i+r ) L i = min(q i-r : q i+r )

Known Optimizations Lower Bounding – LB_Yi – LB_Kim – LB_Keogh 43 A B C D max(Q) min(Q) C U L Q

Ordering 44 This step only can save about 50% of calculations

UCR Suite New Optimizations – Just-in-time Z-normalizations – Reordering Early Abandoning – Reversing LB_Keogh – Cascading Lower Bounds Known Optimizations – Early Abandoning of ED/LB_Keogh/DTW – Use Square Distance – Multicores 45