Time Series Filtering Time Series

Slides:



Advertisements
Similar presentations
The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
SAX: a Novel Symbolic Representation of Time Series
Lindsey Bleimes Charlie Garrod Adam Meyerson
Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans.
Indexing DNA Sequences Using q-Grams
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering Categorical Data The Case of Quran Verses
Discovering Lag Interval For Temporal Dependencies Larisa Shwartz Liang Tang, Tao Li, Larisa Shwartz1 Liang Tang, Tao Li
Clustering Francisco Moreno Extractos de Mining of Massive Datasets
Efficient Anomaly Monitoring over Moving Object Trajectory Streams joint work with Lei Chen (HKUST) Ada Wai-Chee Fu (CUHK) Dawei Liu (CUHK) Yingyi Bu (Microsoft)
Convex Hulls in Two Dimensions Definitions Basic algorithms Gift Wrapping (algorithm of Jarvis ) Graham scan Divide and conquer Convex Hull for line intersections.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
CBF Dataset Two-Pat Dataset Euclidean DTW Increasingly Large Training.
Efficient Query Filtering for Streaming Time Series
Jessica Lin, Eamonn Keogh, Stefano Loardi
Distance Functions for Sequence Data and Time Series
1. 2 General problem Retrieval of time-series similar to a given pattern.
Based on Slides by D. Gunopulos (UCR)
Indexing Time Series.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Aaron Bernstein Analysis of Algorithms I. Sorting Algorithms Insertion Sort: Θ(n 2 ) Merge Sort:Θ(nlog(n)) Heap Sort:Θ(nlog(n)) We seem to be stuck at.
Exact Indexing of Dynamic Time Warping
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Analysis of Algorithms
Fundamentals of Algorithms MCS - 2 Lecture # 7
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
MA/CSSE 473 Day 18 Permutations by lexicographic order number.
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
Learning the threshold in Hierarchical Agglomerative Clustering
Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.
University of Macau, Macau
A New Method to Forecast Enrollments Using Fuzzy Time Series and Clustering Techniques Kurniawan Tanuwijaya 1 and Shyi-Ming Chen 1, 2 1 Department of Computer.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
Exact indexing of Dynamic Time Warping
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Auditing Information Leakage for Distance Metrics Yikan Chen David Evans TexPoint fonts used in EMF. Read the TexPoint manual.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Machine Learning for the Quantified Self
COOLCAT: An Entropy-Based Algorithm for Categorical Clustering
Fast nearest neighbor searches in high dimensions Sami Sieranoja
MA/CSSE 473 Day 17 Divide-and-conquer Convex Hull
Sorting by Tammy Bailey
Haim Kaplan and Uri Zwick
Distance Functions for Sequence Data and Time Series
Integrating XML Data Sources Using Approximate Joins
Distance Functions for Sequence Data and Time Series
I don’t need a title slide for a lecture
Improving Retrieval Performance of Zernike Moment Descriptor on Affined Shapes Dengsheng Zhang, Guojun Lu Gippsland School of Comp. & Info Tech Monash.
Searching Similar Segments over Textual Event Sequences
Data Mining – Chapter 4 Cluster Analysis Part 2
Time Relaxed Spatiotemporal Trajectory Joins
Compact routing schemes with improved stretch
Time Series Filtering Time Series
SEEM4630 Tutorial 3 – Clustering.
61 – Sequences and Series Day 2 Calculator Required
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Time Series Filtering Time Series 1 5 9 2 6 10 Matches Q11 Time Series 1 5 9 2 6 10 Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C. 3 7 11 4 8 12 Candidates

Filtering vs. Querying Query Database Database 1 5 9 6 1 2 6 10 2 7 8 (template) Database Database Matches Q11 Best match 1 5 9 6 1 2 6 10 2 7 8 3 3 7 11 9 4 4 8 12 5 10 Queries Database

Euclidean Distance Metric Given two time series Q = q1…qn and C = c1…cn , their Euclidean distance is defined as: 10 20 30 40 50 60 70 80 90 100 Q C

Early Abandon During the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation. 10 20 30 40 50 60 70 80 90 100 calculation abandoned at this point Q C

Classic Approach Time Series 1 5 9 2 6 10 Individually compare each candidate sequence to the query using the early abandoning algorithm. 3 7 11 4 8 12 Candidates

Wedge Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki ) They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck . We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L} A lower bounding measure between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W: C1 C2 U W L U L W Q

Generalized Wedge Use W(1,2) to denote that a wedge is built from sequences C1 and C2 . Wedges can be hierarchally nested. For example, W((1,2),3) consists of W(1,2) and C3 . C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) W((1, 2), 3)

H-Merge Time Series 1 5 9 Compare the query to the wedge using LB_Keogh If the LB_Keogh function early abandons, we are done Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm 2 6 10 3 7 11 4 8 12 Candidates

Hierarchal Clustering W3 W2 W5 W1 W4 W(2,5) W(1,4) W((2,5),3) W(((2,5),3), (1,4)) K = 5 K = 4 K = 3 K = 2 K = 1 C3 (or W3) C5 (or W5) C2 (or W2) C4 (or W4) C1 (or W1) Which wedge set to choose ?

Which Wedge Set to Choose ? Test all k wedge sets on a representative sample of data Choose the wedge set which performs the best

Upper Bound on H-Merge Worst case Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset. But, what about streaming time series ? Streaming algorithms are limited by their worst case. Being efficient on average does not help. Worst case C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) Subsequence W((1, 2), 3)

? Triangle Inequality If dist(W((2,5),3), W(1,4)) >= 2 r < r Subsequence W3 W2 W5 W1 W4 W3 W(2,5) W1 W4 W3 W(2,5) W(1,4) W((2,5),3) W((2,5),3) < r W(((2,5),3), (1,4)) >= 2r ? W(1,4) K = 5 K = 4 K = 3 K = 2 K = 1 W(1,4) cannot fail on both wedges fails

Experimental Setup Datasets ECG Dataset Stock Dataset Audio Dataset We measure the number of computational steps used by the following methods: Brute force Brute force with early abandoning (classic) Our approach (H-Merge) Our approach with random wedge set (H-Merge-R)

Experimental Results: ECG Dataset Batch time series 650,000 data points (half an hour’s ECG signals) Candidate set 200 time series of length 40 r = 0.5 Algorithm Number of Steps brute force 5,199,688,000 classic 210,190,006 H-Merge 8,853,008 H-Merge-R 29,480,264 x 10 9 6 brute force 5 4 Number of Steps 3 2 1 classic H-Merge H-Merge-R Algorithms

Experimental Results: Stock Dataset Batch time series 2,119,415 data points Candidate set 337 time series with length 128 r = 4.3 Algorithm Number of Steps brute force 91,417,607,168 classic 13,028,000,000 H-Merge 3,204,100,000 H-Merge-R 10,064,000,000 brute force x 10 10 10 9 8 7 Number of Steps 6 5 4 3 classic 2 H-Merge-R H-Merge 1 Algorithms

Experimental Results: Audio Dataset Batch time series 46,143,488 data points (one hour’s sound) Candidate set 68 time series with length 101 r = 4.14 Sliding window 11,025 (1 second) Step 5,512 (0.5 second) Algorithm Number of Steps brute force 57,485,160 classic 1,844,997 H-Merge 1,144,778 H-Merge-R 2,655,816 brute force x 10 7 6 5 4 Number of Steps 3 2 1 H-Merge-R classic H-Merge Algorithms

Experimental Results: Sorting Wedge with length 1,000 Random walk time series with length 65,536 r = 0.5 r = 1 r = 2 r = 3 Sorted 95,025 151,723 345,226 778,367 Unsorted 1,906,244 2,174,994 2,699,885 3,286,213