Efficient Query Filtering for Streaming Time Series

Slides:



Advertisements
Similar presentations
Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Advertisements

SAX: a Novel Symbolic Representation of Time Series
Lindsey Bleimes Charlie Garrod Adam Meyerson
Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.
Word Spotting DTW.
Time Series Classification under More Realistic Assumptions Bing Hu Yanping Chen Eamonn Keogh SIAM Data Mining Conference (SDM), 2013.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.
Energy Characterization and Optimization of Embedded Data Mining Algorithms: A Case Study of the DTW-kNN Framework Huazhong University of Science & Technology,
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Mining Time Series.
Non-metric affinity propagation for unsupervised image categorization Delbert Dueck and Brendan J. Frey ICCV 2007.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.
Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs Qiang Zhu, Xiaoyue Wang, Eamonn Keogh, 1 Sang-Hee Lee Dept. Of Computer.
CBF Dataset Two-Pat Dataset Euclidean DTW Increasingly Large Training.
Making Time-series Classification More Accurate Using Learned Constraints © Chotirat “Ann” Ratanamahatana Eamonn Keogh 2004 SIAM International Conference.
Jessica Lin, Eamonn Keogh, Stefano Loardi
Distance Functions for Sequence Data and Time Series
Based on Slides by D. Gunopulos (UCR)
Finding Time Series Motifs on Disk-Resident Data
Detecting Time Series Motifs Under
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Using Relevance Feedback in Multimedia Databases
1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.
Exact Indexing of Dynamic Time Warping
Accelerating the Dynamic Time Warping Distance Measure Using Logarithmic Arithmetic Joseph Tarango, Eamonn Keogh, Philip Brisk
Similarity based Retrieval from Sequence Databases using Automata as Queries 作者 : A. Prasad Sistla, Tao Hu, Vikas howdhry 出處 :CIKM 2002 ACM 指導教授 : 郭煌政老師.
Blind Pattern Matching Attack on Watermark Systems D. Kirovski and F. A. P. Petitcolas IEEE Transactions on Signal Processing, VOL. 51, NO. 4, April 2003.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Mining Time Series.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
1 CS 260 Winter 2014 Eamonn Keogh’s Presentation of Thanawin Rakthanmanon, Bilson Campana, Abdullah Mueen, Gustavo Batista, Brandon Westover, Qiang Zhu,
Abdullah Mueen Eamonn Keogh University of California, Riverside.
Semi-Supervised Time Series Classification & DTW-D REPORTED BY WANG YAWEN.
Distributed Spatio-Temporal Similarity Search Demetrios Zeinalipour-Yazti University of Cyprus Song Lin
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
ICDE, San Jose, CA, 2002 Discovering Similar Multidimensional Trajectories Michail VlachosGeorge KolliosDimitrios Gunopulos UC RiversideBoston UniversityUC.
k-Shape: Efficient and Accurate Clustering of Time Series
Exact indexing of Dynamic Time Warping
DTW-D: Time Series Semi-Supervised Learning from a Single Example Yanping Chen 1.
Michael Isard and Andrew Blake, IJCV 1998 Presented by Wen Li Department of Computer Science & Engineering Texas A&M University.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
Stream Monitoring under the Time Warping Distance Yasushi Sakurai (NTT Cyber Space Labs) Christos Faloutsos (Carnegie Mellon Univ.) Masashi Yamamuro (NTT.
COMP 5331 Project Roadmap I will give a brief introduction (e.g. notation) on time series. Giving a notion of what we are playing with.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Machine Learning for the Quantified Self
Mining and Processing Biomedical Data
Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets Chin-Chia Michael Yeh, Yan.
Fast nearest neighbor searches in high dimensions Sami Sieranoja
Supervised Time Series Pattern Discovery through Local Importance
Enumeration of Time Series Motifs of All Lengths
Distance Functions for Sequence Data and Time Series
A Time Series Representation Framework Based on Learned Patterns
Time Series Filtering Time Series
Distance Functions for Sequence Data and Time Series
I don’t need a title slide for a lecture
Data Mining – Chapter 4 Cluster Analysis Part 2
Time Relaxed Spatiotemporal Trajectory Joins
Time Series Filtering Time Series
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Efficient Query Filtering for Streaming Time Series Li Wei Eamonn Keogh Helga Van Herle Agenor Mafra-Neto Computer Science & Engineering Dept. University of California – Riverside Riverside, CA 92521 {wli, eamonn}@cs.ucr.edu David Geffen School of Medicine University of California – Los Angeles Los Angeles, CA 90095 hvanherle@mednet.ucla.edu ISCA Technologies Riverside, CA 92517 isca@iscatech.com ICDM '05

Outline of Talk Introduction to time series Time series filtering Wedge-based approach Experimental results Conclusions

What are Time Series? Time series are collections of observations made sequentially in time. 4.7275 4.7083 4.6700 4.6600 4.6617 4.6517 4.6500 4.6500 4.6917 4.7533 4.8233 4.8700 4.8783 4.8700 4.8500 4.8433 4.8383 4.8400 4.8433 .

Time Series are Everywhere ECG Heartbeat Image Stock Video make subtitle bold

Time Series Data Mining Tasks Clustering Classification Rule Discovery Motif Discovery 50 1000 150 2000 2500 20 40 60 80 100 120 140 A B C  s = 0.5 c = 0.3 Query by Content 10 Anomaly Detection Visualization

Time Series Filtering Time Series Matches Q11 Time Series 1 5 9 Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C. 2 6 10 Say which part is given(candidates and r), which we don’t know in advance(long time series). Application: ECG monitoring, audio sensor monitoring (say more, say it is motivated by the Aerospace) 3 7 11 4 8 12 Candidates

Filtering vs. Querying Query Database Database 1 5 9 6 1 2 6 10 2 7 8 (template) Database Database Matches Q11 Best match 1 5 9 6 1 2 6 10 2 7 8 3 3 7 11 9 4 4 8 12 5 10 Queries Database

Euclidean Distance Metric Given two time series Q = q1…qn and C = c1…cn , the Euclidean distance between them is defined as: 10 20 30 40 50 60 70 80 90 100 Q C

Early Abandon During the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation. 10 20 30 40 50 60 70 80 90 100 calculation abandoned at this point Q C

Classic Approach Time Series 1 5 9 2 6 10 hundreds or thousands of candidates Individually compare each candidate sequence to the query using the early abandoning algorithm. 3 7 11 4 8 12 Candidates

Wedge Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki ) They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck . We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L} A lower bounding measure between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W: C1 C2 U W L Important thing to notice: LB_Keogh lower bounds. U L W Q

Generalized Wedge Use W(1,2) to denote that a wedge is built from sequences C1 and C2 . Wedges can be hierarchally nested. For example, W((1,2),3) consists of W(1,2) and C3 . C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) W((1, 2), 3)

Wedge Based Approach Time Series 1 5 9 Compare the query to the wedge using LB_Keogh If the LB_Keogh function early abandons, we are done Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm 2 6 10 3 7 11 4 8 12 Candidates

Examples of Wedge Merging Q W((1,2),3) C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) W((1, 2), 3) C1 (or W1 ) C2 (or W2 ) W(1, 2)

Hierarchal Clustering W3 W2 W5 W1 W4 W(2,5) W(1,4) W((2,5),3) W(((2,5),3), (1,4)) K = 5 K = 4 K = 3 K = 2 K = 1 C3 (or W3) C5 (or W5) C2 (or W2) C4 (or W4) C1 (or W1) Which wedge set to choose ?

Which Wedge Set to Choose ? Test all k wedge sets on a representative sample of data Choose the wedge set which performs the best empirical tests

Upper Bound on Wedge Based Approach Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset. But, what about streaming time series ? Streaming algorithms are limited by their worst case. Being efficient on average does not help. Worst case C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) Subsequence W((1, 2), 3)

? If dist(W((2,5),3), W(1,4)) >= 2 r Triangular Inequality < r Subsequence W3 W2 W5 W1 W4 W3 W(2,5) W1 W4 W3 W(2,5) W(1,4) W((2,5),3) < r W(((2,5),3), (1,4)) >= 2r ? W(1,4) K = 5 K = 4 K = 3 K = 2 K = 1 W(1,4) cannot fail on both wedges fails

Experimental Setup How to choose r ? Datasets ECG Dataset Stock Dataset Audio Dataset We measure the number of computational steps used by the following methods: Brute force Brute force with early abandoning (classic) Our approach (Atomic Wedgie) Our approach with random wedge set (AWR) How to choose r ? A logical value for r would be the average distance from a pattern to its nearest neighbor

ECG Dataset Batch time series Candidate set r = 0.5 650,000 data points (half an hour’s ECG signals) Candidate set 200 time series of length 40 4 types of patterns left bundle branch block beat right bundle branch block beat atrial premature beat ventricular escape beat r = 0.5 Upper Bound: 2,120 (8,000 for brute force) Algorithm Number of Steps brute force 5,199,688,000 classic 210,190,006 Atomic Wedgie 8,853,008 AWR 29,480,264

Stock Dataset Batch time series Candidate set r = 4.3 2,119,415 data points Candidate set 337 time series with length 128 3 types of patterns head and shoulders reverse head and shoulders cup and handle r = 4.3 Upper Bound: 18,048 (43,136 for brute force) Algorithm Number of Steps brute force 91,417,607,168 classic 13,028,000,000 Atomic Wedgie 3,204,100,000 AWR 10,064,000,000

Audio Dataset Batch time series Candidate set 37,583,512 data points (one hour’s sound) Candidate set 68 time series with length 51 3 species of harmful mosquitoes Culex quinquefasciatus Aedes aegypti Culiseta spp Sliding window: 11,025 (1 second) Step: 5,512 (0.5 second) r = 2 Upper Bound: 2,929 (6,868 for brute force) Algorithm Number of Steps brute force 57,485,160 classic 1,844,997 Atomic Wedgie 1,144,778 AWR 2,655,816

Conclusions We introduce the problem of time series filtering. Combining similar sequences into a wedge is a quite promising idea. We have provided the upper bound of the cost of the algorithm to compute the fastest arrival rate we can guarantee to handle.

Future Work Dynamic wedge set choosing for data with concept shifting Extension to other distance measures, for example, DTW (Dynamic Time Warping) and uniform scaling

Questions? All datasets used in this talk can be found at http://www.cs.ucr.edu/~wli/ICDM05/

Z-Normalization Normalize a data sequence C to have mean = 0 and standard deviation = 1 C = (C - mean(C )) / std(C ) 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000