Finding Time Series Motifs on Disk-Resident Data

Slides:

Advertisements

Similar presentations

SAX: a Novel Symbolic Representation of Time Series

Advertisements

Measure Projection Analysis

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Content-Based Image Retrieval

F AST A PPROXIMATE C ORRELATION FOR M ASSIVE T IME - SERIES D ATA SIGMOD’10 Abdullah Mueen, Suman Nath, Jie Liu 1.

Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.

Mining Time Series.

Discriminative and generative methods for bags of features

1 Manifold Clustering of Shapes Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside.

Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.

Themis Palpanas1 VLDB - Aug 2004 Fair Use Agreement This agreement covers the use of all slides on this CD-Rom, please read carefully. You may freely use.

Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin Tao Jiang.

Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs Qiang Zhu, Xiaoyue Wang, Eamonn Keogh, 1 Sang-Hee Lee Dept. Of Computer.

Abdullah Mueen UC Riverside Suman Nath Microsoft Research Jie Liu Microsoft Research.

Fractal Image Compression

Efficient Query Filtering for Streaming Time Series

Jessica Lin, Eamonn Keogh, Stefano Loardi

Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Detecting Time Series Motifs Under

1Ellen L. Walker Matching Find a smaller image in a larger image Applications Find object / pattern of interest in a larger picture Identify moving objects.

Opportunities of Scale, Part 2 Computer Vision James Hays, Brown Many slides from James Hays, Alyosha Efros, and Derek Hoiem Graphic from Antonio Torralba.

Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –

1 Dot Plots For Time Series Analysis Dragomir Yankov, Eamonn Keogh, Stefano Lonardi Dept. of Computer Science & Eng. University of California Riverside.

Exact Indexing of Dynamic Time Warping

Autonomous Learning of Object Models on Mobile Robots Xiang Li Ph.D. student supervised by Dr. Mohan Sridharan Stochastic Estimation and Autonomous Robotics.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.

Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.

Dynamic Programming.

Mining Time Series.

80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.

Shape-based Similarity Query for Trajectory of Mobile Object NTT Communication Science Laboratories, NTT Corporation, JAPAN. Yutaka Yanagisawa Jun-ichi.

Paradyn Project Paradyn / Dyninst Week Madison, Wisconsin April 29-May 3, 2013 Mr. Scan: Efficient Clustering with MRNet and GPUs Evan Samanas and Ben.

Abdullah Mueen Eamonn Keogh University of California, Riverside.

University of Macau, Macau

Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.

Lei Li Computer Science Department Carnegie Mellon University Pre Proposal Time Series Learning completed work 11/27/2015.

Jaroslaw Kutylowski 1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Robust Undetectable Interference Watermarks Ryszard Grząślewicz.

Exact indexing of Dynamic Time Warping

Image-Based Segmentation of Indoor Corridor Floors for a Mobile Robot Yinxiao Li and Stanley T. Birchfield The Holcombe Department of Electrical and Computer.

CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.

Magic Camera Master’s Project Defense By Adam Meadows Project Committee: Dr. Eamonn Keogh Dr. Doug Tolbert.

Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.

Augmenting the Generalized Hough Transform to Enable the Mining of Petroglyphs Qiang Zhu, Xiaoyue Wang, Eamonn Keogh, 1 Sang-Hee Lee Dept. Of Computer.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

CS 521 Data Mining Techniques Instructor: Abdullah Mueen LECTURE 8: TIME SERIES AND GRAPH MINING.

NSF Career Award IIS University of California Riverside Eamonn Keogh Efficient Discovery of Previously Unknown Patterns and Relationships.

Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,

Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.

Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.

Data Stashing: Energy-Efficient Information Delivery to Mobile Sinks through Trajectory Prediction (IPSN 2010) HyungJune Lee, Martin Wicke, Branislav Kusy,

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Tentative Future Courses Fall `11 : Computer Vision – emphasis on recognition Spring `11 : Graduate seminar Fall `12 : Computational Photography.

ITree: Exploring Time-Varying Data using Indexable Tree Yi Gu and Chaoli Wang Michigan Technological University Presented at IEEE Pacific Visualization.

TOPIC: TOward Perfect InfluenCe Graph Summarization Lei Shi, Sibai Sun, Yuan Xuan, Yue Su, Hanghang Tong, Shuai Ma, Yang Chen.

Naifan Zhuang, Jun Ye, Kien A. Hua

Optimizing Parallel Algorithms for All Pairs Similarity Search

Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets Chin-Chia Michael Yeh, Yan.

Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins Yan Zhu, Zachary Zimmerman,

Enumeration of Time Series Motifs of All Lengths

A Time Series Representation Framework Based on Learned Patterns

Time Series Filtering Time Series

On Spatial Joins in MapReduce

Brief Review of Recognition + Context

Time Relaxed Spatiotemporal Trajectory Joins

Time Series Filtering Time Series

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

Data-Driven Approach to Synthesizing Facial Animation Using Motion Capture Ioannis Fermanis Liu Zhaopeng

Presentation transcript:

Finding Time Series Motifs on Disk-Resident Data Abdullah Mueen, Dr. Eamonn Keogh UC Riverside Nima Bigdely-Shamlo Swartz Center for Computational Neuroscience, UCSD

Outline Motivation DAME: Disk-Aware Motif Enumeration Time Series Motif DAME: Disk-Aware Motif Enumeration Performance Evaluation Speedup and Efficiency Case Studies Motifs in Brain-Computer Interfaces Motifs in Image Database Conclusion I will first present the problem of time series motif discovery on disk based data. I will then briefly describe the intuition of our algorithm using a toy example. I will also talk about the speedup we get and efficiency as a disk based algorithm. Finally I will show two case studies where we use our algorithm on real world dataset and found interesting results.

Sequence Motif Repeated Pattern in a sequence . A Pattern can be approximately similar. Mismatch is allowed A Pattern can be overlapping. GACATAATAACCAGCTATCTGCTCGCATCGCCGCGACATAGCT The very beginning of the term “motif” was in the domain of bioinformatics and molecular biology. By “Motif”; it represents repeated sequence or pattern in the sequence of bases in a DNA or in the sequence of amino acids in the protein. The motif needs to be significantly frequent than expected at random. The occurrences may have limited number of mismatches and may be slightly overlapped. The idea of “motif” is adopted in different other domains to define the repeated appearance of a particular pattern in a collection of objects. For example, motion motif in motion-capture data in computer graphics, structural motif in Biochemistry and Time series motif in Data Mining. This presentation focuses on discovering time series motif in a database of time series. 20 40 60 80 100 120 140 160 180 200 -2 -1 1 2 Motion Motif Structural Motif Time Series Motif

Time Series Motif Repeated Pattern in a Time Series. Exact Motif. The most similar pair under Euclidean Distance. Non Overlapping. Euclidean distance (between normalized segments) Beats most similarity measures on large datasets. Early abandoning. Triangular inequality. d(P,Q) ≥ |d(P,R) - d(Q,R)| When we are given a time series and we don’t know about any local and global structure in it, one of the first questions that we should ask is that if there is any repetition or motif in this series. Generally to define a motif it needs to specify two key features. The similarity and the support of the motff. For example, we can ask for motif that occurs 100 times where every possible pair of the occurrences is significantly similar. For the sake of simplicity we choose to explore the most similar pattern that occurs at least twice. So, our definition of time series motif is the most similar pair of time series in a database. We also impose the restriction on overlaps to avoid trivial matches as successive temporal values in a time series are not independent. To measure the similarity of two time series, we use Euclidian distance which is the sum of squared error when the two series are perfectly aligned. Euclidean distance is better than almost all other similarity measures specially when the dataset is very big such as several millions of time series; We also get the advantage of abandoning Euclidean distance computation as soon as we find the cumulative sum is undesirably large. The final and most important property of Euclidean distance which is the fundamental tool in our algorithm is triangular inequality. It says that we can get a very coarse lower bound on the distance between P, Q if we know the distances of P,Q from a reference point R. 10 20 30 40 50 60 -2 -1 1 2

Motif Discovery in Disk-Resident Datasets Large datasets Light Curves of Stars. Performance Counters of Data Centers. Pseudo time series dataset “80 million Tiny Images” Database of normalized subsequences An hour long trace of EEG generates over one million normalized subsequences. Huge datasets of size over a million time series of lengths over hundreds are very common now. For example, astronomers have light curves for millions of stars, data centers have millions of signals for different performance counters from servers in a data center, etc. We can also have database of pseudo time series by converting database of images, videos, etc. Even if we want to find motif in a long time series which easily fits in main memory, we may need the disk to store all the normalized subsequences for the sake of scale and shift invariant motif discovery. Therefore, motif discovery on disk resident data can no longer be ignored.

DAME Set of 2D points Geometric View Disk View 1 5 3 7 16 10 12 20 11 24 21 18 2 22 17 15 23 13 14 8 4 9 19 Blocks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Lets assume we have a toy dataset of 24 time series of length only 2. The time series are stored in 8 disk blocks each having 3 time series. The most similar time series in the geometric view are the closest pair of points. We also assume that we have enough memory for only two disk blocks. It means we can only have two disk blocks in the memory at a time. 19 20 21 DAME 22 23 24 Set of 2D points

Linear Representation Geometric View Projected View Disk View 1 5 3 7 10 12 20 11 6 21 18 2 22 17 15 23 13 14 8 4 9 19 24 16 Blocks 1 5 14 1 5 18 19 3 15 17 8 10 22 11 4 12 9 7 24 6 2 13 We choose a reference point (the yellow one) and compute distances from the reference point to the data points. Data points are then sorted according the referenced distance. The projection can be viewed as rotations of all the points about the reference point until they fall in one line. We call this line the order line. The advantage order line gives, is that the distance between two points on the order line is the lower bound on the true distance. Therefore, if two points are very far in the order line, it is guaranteed that they are very far in the original space also. Obviously the opposite is not always true. We can have two very far point appearing very close on the order line. Note that the disk blocks now contain different data points after sorting. 20 21 23 DAME 16 1819 Linear Representation in sorted order 0 is the reference point

Divide the point-set into two partition and solve the subproblem Geometric View Projected View Projected View Disk View 1 5 3 7 16 10 12 20 11 6 24 21 18 2 22 17 15 23 13 14 8 4 9 19 Blocks 1 5 14 8 10 22 9 7 24 11 4 12 3 15 17 6 2 13 20 21 23 16 1819 1 5 18 19 Best 1 We use a method similar to divide and conquer approach to search over all possible pairs. Lets assume that we divide the dataset into two equal parts which are disjoint in the order line. That is equivalent to create a hypersphere and consider the points outside the hypersphere as one partition and the points inside the hypersphere as the other partition. For example, the red points and the green points shown in this figure. Lets also assume that the motifs are found for both the partitions by a recursive routine. Now we want to check if there is any pair of points coming from the different partitions and closer than the best motif found so far. Since we have 4 blocks in each partition we need to search over 16 different pairs of blocks, which is equal to 16*9 pairs. Assume that for the partitions, we have found the closest pair of points which are (3,17) for green partition and (7,16) for the red partition. Since, (3,17) is smallest of them, it is our current minimum or best motif found so far. DAME Best 2 Divide the point-set into two partition and solve the subproblem

DAME Geometric View Projected View Projected View Disk View 1 5 3 7 16 10 12 20 11 6 24 21 18 2 22 17 15 23 13 14 8 4 9 19 Blocks 1 5 14 8 10 22 9 7 24 11 4 12 3 15 17 6 2 13 20 21 23 16 1819 1 5 18 19 Bsf Since we have a best so far, using that we can reduce search space to only these two rings. This is because any pair having at least one data point out side of these rings have no chance to be in the closest pair with another point in the other partition. These two rings correspond to the four blocks shown in the disk view. The reason is that we have sorted the database according to the distance from the reference point. Therefore, we now need to consider only 4 block pairs instead of 16. In the next slide, we show the comparisons made for these 4 block pairs. DAME The inner ring is the region for blocks 5 and 6 Blocks of Interest The outer ring is the region for blocks 3 and 4

Block 3 and block 6 do not overlap. No comparison. Block-Pair (3,5) Block-Pair (3,6) Block-Pair (4,5) Block-Pair (4,6) Block 3 and block 6 do not overlap. No comparison. 1 Comparison 9 comparisons 1 comparison 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 bsf Lets consider each pair one by one. First block-pair (3,5). Our task is to compare pairs of points so that they come from different blocks. To do that, we take a window/bracket of length equal to bsf and attach it with one green point (already loaded in the memory) at a time on the order line. If there is any red point inside the bracket attached to a green one, we would need to compare them (See the green brackets). The same argument is true for blocks. Look at block-pair (3,6). If we attach the window to the end of a green block, and find a red block overlapping the window, we would need to bring them to the memory together to compare points within them. All the red points or blocks that are out side the window can safely be ignored. Since (3,6) do not overlap, block 6 is not loaded. For block-pair (4,5) everyone is within bsf of everyone requiring 9 comparisons. And finally for (4,6) we again need only one comparison. After all these pruning, DAME compares only 11 pairs while there were 9*16 possible pairs. Also note that, in this example we have assumed that no comparisons make the motif better. If we found a better motif that would reduce the size of the bracket and then we would possibly do less number of comparisons. DAME No Comparison Loaded Blocks 11 comparisons are made instead of 9*16=144

Largest Dataset Tested Time for the Largest Dataset Speedup √ X Memory Disk Algorithm Largest Dataset Tested (thousands) Time for the Largest Dataset Estimated Time for 4.0 million CompletelyInMemory 100 35 minutes 37.8 days CompletelyInDisk 200 1.50 days 1.65 years DAME 4,000 1.35 days NoAdditionalStorage (normalization done in memory) 4.82 days 5.28 years Here we compare our algorithm to three possible trivial alternatives. The first one assumes that the data fits in the main memory an we can do brute force search without worrying about the repetitive disk accesses. The largest dataset we could try in our machine is 100 thousand time series of 1024 long random walks. It was finished in 35 minutes which projects to 37 days of computation if we had enough memory to hold 40 millions of similar time series. The second one assumes that the memory available to us is large enough to hold only two blocks. In this scenario if we run a brute force search without using our pruning techniques, it would take over 1 year of I/O time. In comparison to these two extreme scenarios, DAME reduces the I/O and CPU time significantly to find the motif in a dataset of 40 million time series in 1.35 days. We also tried another experiment for subsequence-motif where we assume a long time series of 200 thousand numbers is in the memory and the algorithm performs the comparisons on that time series without explicitly storing the normalized subsequences and, therefore, repeatedly normalizing the subsequences the during comparisons. The largest dataset we tried is of 200 thousand time series which took over 4 days. This projects to a computation time over 5 years for 40 million. Note that, we could choose some methods that use existing algorithms for similarity join or nearest neighbor search on indexed data or different other pivot or reference point based algorithms. Since the purpose of our algorithm is unique there is no point to show our algorithm is better than some other algorithm which are designed to some other tasks. If you need a detailed comparison, please contact with Abdullah.

Performance Evaluation 10,000 20,000 30,000 40,000 50,000 2 3 4 5 6 7 8 9 10 11 12 # of time series Seconds in DAME_Motif Total CPU I/O x 103 1,000 500 34 25 20 # of blocks 200 400 600 800 1000 1200 3 4 5 6 7 8 9 10 Motif Length Seconds in DAME_Motif x 103 Here we show CPU and I/O time fractions of the total execution for one million times series. We change the size of the blocks which eventually changes the number of blocks. This shows that DAME has better performance when fewer blocks are there. And for large number of small blocks it has almost equal I/O and CPU time.

Case Study 1: Brain-Computer Interfaces Target Non-Target Biosemi, Inc. Nima: Please describe this slide in your way. If needed, you can break it in two slides. Also you can add animations, colors whatever you want.

Case Study 1: Brain-Computer Interfaces 100 200 300 400 500 600 -2 -1 1 2 3 Time (ms) Normalized IC activity Motif 1 Segment 1 Segment 2 Spatial filter (ICA) -1000 -500 500 1000 10 20 30 40 50 60 70 80 90 100 110 Latency Epochs Before target presentation After target presentation IC 17, Motif 1 Time (ms) -1000 -800 -600 -400 -200 200 400 600 800 50 100 150 250 300 4 6 8 10 12 14 16 18 20 22 Target Trials Non-target Trials Distance to Motif 1 Nima: Please describe this slide in your way. If needed, you can break it in two slides. Also you can add animations, colors whatever you want.

Case Study 2: Image Motifs 100 200 300 400 500 600 700 -2 2 4 6 8 10 12 Concatenated color histogram is considered as pseudo time series. Each time series is of length 256*3 = 768. 80 million tiny images of 32X32 resolution. In this case study we have built a dataset of pseudo time series from the “80 million tiny images” collected by Torralba et.al. To convert an image into time series, we build color histograms for three prime colors and concatenate them after normalization. Therefore, the time series are of length 768. 80 million tiny images : collected by Antonio Torralba, Rob Fergus, William T. Freeman at MIT.

Case Study 2: Image Motifs 2495 21298 2477 21280 3305 22166 3245 21891 2553 21371 32751032 17012103 15513839 15513780 31391181 6791228 23277616 23277667 38468056 11896606 Duplicate Image Near Duplicate Image DAME worked on the first 40 million time series in ~6.5 days DAME found 3,836,902 images which have at least one duplicate. 1,719,443 unique images. 542,603 images have near duplicates with distance less than 0.1. To find all the duplicates and near-duplicates, DAME took about 6.5 days which is much less that the time it took to collect this massive dataset which is 8 months. The differences in the near duplicate are meaningful in most cases. For example, here different text with same background or different curve with same background. Here same map but different countries are marked red, here the same dog picture with or without the spot. The main application of identifying such near duplicates can be enforcing copyrights for images.

Conclusion DAME: The first exact-motif discovery algorithm that finds motif in disk-resident data. DAME is scalable to massive datasets of the order of millions of time series. DAME successfully finds motif in EEG traces and image databases. In this presentation, I have described the first disk-aware motif discovery algorithm. We name it DAME. DAME is scalable to massive datasets of the order of millions of time series. I also showed successful cases of DAME finding meaningful and interesting motifs.

BACKUP

Example of Multidimensional Motif Motion-Motif Here we show the application of motif discovery in multidimensional time series, such as Motion Capture Data. Motion capture data is a multidimensional data where several sensors attached to the body of a subject record the 3D position of the body parts. Typically number of sensors is 30 which capture a 30 dimensional time series. We have taken two Indian dance motions from the CMU Motion capture database and run DAME to find motif where the motif pair joins the two motions. In this figure, the dance floor is shown from the top and the trajectories of the dancers are shows. At some point, the dancers perform almost identical move shown in the two frames in the middle. Top view of the dance floor and the trajectories of the dancers. Dance Motions are taken from the CMU Motion Capture Database

Example of Worst Case Scenario 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 This shows the worst case scenario for our algorithm when all the data are almost equidistant from the reference point and thus converting the distribution of the points on the order line to a spike. In this situation no pruning can be done since all the lower bounds will be very close to zero.

Multiple References for Ordering y x Lower bound r1 r2 x y Rotational axis Instead of rotating the points about a reference point, we could rotate all the points about an axis passing through to reference points. Then we would have a plane instead of the order line. The lower bounds for a given pair could be found by simple!! Geometry (All the lengths of the solid lines are known, find the length of the dashed line) and needs lots of floating point additions and multiplications compared to the only one subtraction in the case of single reference point. However the lower bounds found in this way are not significantly tight and in some cases less tight than the ones found by single reference point. Therefore, we choose the simplicity. 40 30 20 10 Planar bounds Linear bounds Actual distances Larger Gap Smaller Gap