Finding Time Series Motifs on Disk-Resident Data Abdullah Mueen, Dr. Eamonn Keogh UC Riverside Nima Bigdely-Shamlo Swartz Center for Computational Neuroscience, UCSD
Outline Motivation DAME: Disk-Aware Motif Enumeration Time Series Motif DAME: Disk-Aware Motif Enumeration Performance Evaluation Speedup and Efficiency Case Studies Motifs in Brain-Computer Interfaces Motifs in Image Database Conclusion I will first present the problem of time series motif discovery on disk based data. I will then briefly describe the intuition of our algorithm using a toy example. I will also talk about the speedup we get and efficiency as a disk based algorithm. Finally I will show two case studies where we use our algorithm on real world dataset and found interesting results.
Sequence Motif Repeated Pattern in a sequence . A Pattern can be approximately similar. Mismatch is allowed A Pattern can be overlapping. GACATAATAACCAGCTATCTGCTCGCATCGCCGCGACATAGCT The very beginning of the term “motif” was in the domain of bioinformatics and molecular biology. By “Motif”; it represents repeated sequence or pattern in the sequence of bases in a DNA or in the sequence of amino acids in the protein. The motif needs to be significantly frequent than expected at random. The occurrences may have limited number of mismatches and may be slightly overlapped. The idea of “motif” is adopted in different other domains to define the repeated appearance of a particular pattern in a collection of objects. For example, motion motif in motion-capture data in computer graphics, structural motif in Biochemistry and Time series motif in Data Mining. This presentation focuses on discovering time series motif in a database of time series. 20 40 60 80 100 120 140 160 180 200 -2 -1 1 2 Motion Motif Structural Motif Time Series Motif
Time Series Motif Repeated Pattern in a Time Series. Exact Motif. The most similar pair under Euclidean Distance. Non Overlapping. Euclidean distance (between normalized segments) Beats most similarity measures on large datasets. Early abandoning. Triangular inequality. d(P,Q) ≥ |d(P,R) - d(Q,R)| When we are given a time series and we don’t know about any local and global structure in it, one of the first questions that we should ask is that if there is any repetition or motif in this series. Generally to define a motif it needs to specify two key features. The similarity and the support of the motff. For example, we can ask for motif that occurs 100 times where every possible pair of the occurrences is significantly similar. For the sake of simplicity we choose to explore the most similar pattern that occurs at least twice. So, our definition of time series motif is the most similar pair of time series in a database. We also impose the restriction on overlaps to avoid trivial matches as successive temporal values in a time series are not independent. To measure the similarity of two time series, we use Euclidian distance which is the sum of squared error when the two series are perfectly aligned. Euclidean distance is better than almost all other similarity measures specially when the dataset is very big such as several millions of time series; We also get the advantage of abandoning Euclidean distance computation as soon as we find the cumulative sum is undesirably large. The final and most important property of Euclidean distance which is the fundamental tool in our algorithm is triangular inequality. It says that we can get a very coarse lower bound on the distance between P, Q if we know the distances of P,Q from a reference point R. 10 20 30 40 50 60 -2 -1 1 2
Motif Discovery in Disk-Resident Datasets Large datasets Light Curves of Stars. Performance Counters of Data Centers. Pseudo time series dataset “80 million Tiny Images” Database of normalized subsequences An hour long trace of EEG generates over one million normalized subsequences. Huge datasets of size over a million time series of lengths over hundreds are very common now. For example, astronomers have light curves for millions of stars, data centers have millions of signals for different performance counters from servers in a data center, etc. We can also have database of pseudo time series by converting database of images, videos, etc. Even if we want to find motif in a long time series which easily fits in main memory, we may need the disk to store all the normalized subsequences for the sake of scale and shift invariant motif discovery. Therefore, motif discovery on disk resident data can no longer be ignored.
DAME Set of 2D points Geometric View Disk View 1 5 3 7 16 10 12 20 11 24 21 18 2 22 17 15 23 13 14 8 4 9 19 Blocks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Lets assume we have a toy dataset of 24 time series of length only 2. The time series are stored in 8 disk blocks each having 3 time series. The most similar time series in the geometric view are the closest pair of points. We also assume that we have enough memory for only two disk blocks. It means we can only have two disk blocks in the memory at a time. 19 20 21 DAME 22 23 24 Set of 2D points
Linear Representation Geometric View Projected View Disk View 1 5 3 7 10 12 20 11 6 21 18 2 22 17 15 23 13 14 8 4 9 19 24 16 Blocks 1 5 14 1 5 18 19 3 15 17 8 10 22 11 4 12 9 7 24 6 2 13 We choose a reference point (the yellow one) and compute distances from the reference point to the data points. Data points are then sorted according the referenced distance. The projection can be viewed as rotations of all the points about the reference point until they fall in one line. We call this line the order line. The advantage order line gives, is that the distance between two points on the order line is the lower bound on the true distance. Therefore, if two points are very far in the order line, it is guaranteed that they are very far in the original space also. Obviously the opposite is not always true. We can have two very far point appearing very close on the order line. Note that the disk blocks now contain different data points after sorting. 20 21 23 DAME 16 1819 Linear Representation in sorted order 0 is the reference point
Divide the point-set into two partition and solve the subproblem Geometric View Projected View Projected View Disk View 1 5 3 7 16 10 12 20 11 6 24 21 18 2 22 17 15 23 13 14 8 4 9 19 Blocks 1 5 14 8 10 22 9 7 24 11 4 12 3 15 17 6 2 13 20 21 23 16 1819 1 5 18 19 Best 1 We use a method similar to divide and conquer approach to search over all possible pairs. Lets assume that we divide the dataset into two equal parts which are disjoint in the order line. That is equivalent to create a hypersphere and consider the points outside the hypersphere as one partition and the points inside the hypersphere as the other partition. For example, the red points and the green points shown in this figure. Lets also assume that the motifs are found for both the partitions by a recursive routine. Now we want to check if there is any pair of points coming from the different partitions and closer than the best motif found so far. Since we have 4 blocks in each partition we need to search over 16 different pairs of blocks, which is equal to 16*9 pairs. Assume that for the partitions, we have found the closest pair of points which are (3,17) for green partition and (7,16) for the red partition. Since, (3,17) is smallest of them, it is our current minimum or best motif found so far. DAME Best 2 Divide the point-set into two partition and solve the subproblem
DAME Geometric View Projected View Projected View Disk View 1 5 3 7 16 10 12 20 11 6 24 21 18 2 22 17 15 23 13 14 8 4 9 19 Blocks 1 5 14 8 10 22 9 7 24 11 4 12 3 15 17 6 2 13 20 21 23 16 1819 1 5 18 19 Bsf Since we have a best so far, using that we can reduce search space to only these two rings. This is because any pair having at least one data point out side of these rings have no chance to be in the closest pair with another point in the other partition. These two rings correspond to the four blocks shown in the disk view. The reason is that we have sorted the database according to the distance from the reference point. Therefore, we now need to consider only 4 block pairs instead of 16. In the next slide, we show the comparisons made for these 4 block pairs. DAME The inner ring is the region for blocks 5 and 6 Blocks of Interest The outer ring is the region for blocks 3 and 4
Block 3 and block 6 do not overlap. No comparison. Block-Pair (3,5) Block-Pair (3,6) Block-Pair (4,5) Block-Pair (4,6) Block 3 and block 6 do not overlap. No comparison. 1 Comparison 9 comparisons 1 comparison 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 bsf Lets consider each pair one by one. First block-pair (3,5). Our task is to compare pairs of points so that they come from different blocks. To do that, we take a window/bracket of length equal to bsf and attach it with one green point (already loaded in the memory) at a time on the order line. If there is any red point inside the bracket attached to a green one, we would need to compare them (See the green brackets). The same argument is true for blocks. Look at block-pair (3,6). If we attach the window to the end of a green block, and find a red block overlapping the window, we would need to bring them to the memory together to compare points within them. All the red points or blocks that are out side the window can safely be ignored. Since (3,6) do not overlap, block 6 is not loaded. For block-pair (4,5) everyone is within bsf of everyone requiring 9 comparisons. And finally for (4,6) we again need only one comparison. After all these pruning, DAME compares only 11 pairs while there were 9*16 possible pairs. Also note that, in this example we have assumed that no comparisons make the motif better. If we found a better motif that would reduce the size of the bracket and then we would possibly do less number of comparisons. DAME No Comparison Loaded Blocks 11 comparisons are made instead of 9*16=144
Largest Dataset Tested Time for the Largest Dataset Speedup √ X Memory Disk Algorithm Largest Dataset Tested (thousands) Time for the Largest Dataset Estimated Time for 4.0 million CompletelyInMemory 100 35 minutes 37.8 days CompletelyInDisk 200 1.50 days 1.65 years DAME 4,000 1.35 days NoAdditionalStorage (normalization done in memory) 4.82 days 5.28 years Here we compare our algorithm to three possible trivial alternatives. The first one assumes that the data fits in the main memory an we can do brute force search without worrying about the repetitive disk accesses. The largest dataset we could try in our machine is 100 thousand time series of 1024 long random walks. It was finished in 35 minutes which projects to 37 days of computation if we had enough memory to hold 40 millions of similar time series. The second one assumes that the memory available to us is large enough to hold only two blocks. In this scenario if we run a brute force search without using our pruning techniques, it would take over 1 year of I/O time. In comparison to these two extreme scenarios, DAME reduces the I/O and CPU time significantly to find the motif in a dataset of 40 million time series in 1.35 days. We also tried another experiment for subsequence-motif where we assume a long time series of 200 thousand numbers is in the memory and the algorithm performs the comparisons on that time series without explicitly storing the normalized subsequences and, therefore, repeatedly normalizing the subsequences the during comparisons. The largest dataset we tried is of 200 thousand time series which took over 4 days. This projects to a computation time over 5 years for 40 million. Note that, we could choose some methods that use existing algorithms for similarity join or nearest neighbor search on indexed data or different other pivot or reference point based algorithms. Since the purpose of our algorithm is unique there is no point to show our algorithm is better than some other algorithm which are designed to some other tasks. If you need a detailed comparison, please contact with Abdullah.
Performance Evaluation 10,000 20,000 30,000 40,000 50,000 2 3 4 5 6 7 8 9 10 11 12 # of time series Seconds in DAME_Motif Total CPU I/O x 103 1,000 500 34 25 20 # of blocks 200 400 600 800 1000 1200 3 4 5 6 7 8 9 10 Motif Length Seconds in DAME_Motif x 103 Here we show CPU and I/O time fractions of the total execution for one million times series. We change the size of the blocks which eventually changes the number of blocks. This shows that DAME has better performance when fewer blocks are there. And for large number of small blocks it has almost equal I/O and CPU time.
Case Study 1: Brain-Computer Interfaces Target Non-Target Biosemi, Inc. Nima: Please describe this slide in your way. If needed, you can break it in two slides. Also you can add animations, colors whatever you want.
Case Study 1: Brain-Computer Interfaces 100 200 300 400 500 600 -2 -1 1 2 3 Time (ms) Normalized IC activity Motif 1 Segment 1 Segment 2 Spatial filter (ICA) -1000 -500 500 1000 10 20 30 40 50 60 70 80 90 100 110 Latency Epochs Before target presentation After target presentation IC 17, Motif 1 Time (ms) -1000 -800 -600 -400 -200 200 400 600 800 50 100 150 250 300 4 6 8 10 12 14 16 18 20 22 Target Trials Non-target Trials Distance to Motif 1 Nima: Please describe this slide in your way. If needed, you can break it in two slides. Also you can add animations, colors whatever you want.
Case Study 2: Image Motifs 100 200 300 400 500 600 700 -2 2 4 6 8 10 12 Concatenated color histogram is considered as pseudo time series. Each time series is of length 256*3 = 768. 80 million tiny images of 32X32 resolution. In this case study we have built a dataset of pseudo time series from the “80 million tiny images” collected by Torralba et.al. To convert an image into time series, we build color histograms for three prime colors and concatenate them after normalization. Therefore, the time series are of length 768. 80 million tiny images : collected by Antonio Torralba, Rob Fergus, William T. Freeman at MIT.
Case Study 2: Image Motifs 2495 21298 2477 21280 3305 22166 3245 21891 2553 21371 32751032 17012103 15513839 15513780 31391181 6791228 23277616 23277667 38468056 11896606 Duplicate Image Near Duplicate Image DAME worked on the first 40 million time series in ~6.5 days DAME found 3,836,902 images which have at least one duplicate. 1,719,443 unique images. 542,603 images have near duplicates with distance less than 0.1. To find all the duplicates and near-duplicates, DAME took about 6.5 days which is much less that the time it took to collect this massive dataset which is 8 months. The differences in the near duplicate are meaningful in most cases. For example, here different text with same background or different curve with same background. Here same map but different countries are marked red, here the same dog picture with or without the spot. The main application of identifying such near duplicates can be enforcing copyrights for images.
Conclusion DAME: The first exact-motif discovery algorithm that finds motif in disk-resident data. DAME is scalable to massive datasets of the order of millions of time series. DAME successfully finds motif in EEG traces and image databases. In this presentation, I have described the first disk-aware motif discovery algorithm. We name it DAME. DAME is scalable to massive datasets of the order of millions of time series. I also showed successful cases of DAME finding meaningful and interesting motifs.
BACKUP
Example of Multidimensional Motif Motion-Motif Here we show the application of motif discovery in multidimensional time series, such as Motion Capture Data. Motion capture data is a multidimensional data where several sensors attached to the body of a subject record the 3D position of the body parts. Typically number of sensors is 30 which capture a 30 dimensional time series. We have taken two Indian dance motions from the CMU Motion capture database and run DAME to find motif where the motif pair joins the two motions. In this figure, the dance floor is shown from the top and the trajectories of the dancers are shows. At some point, the dancers perform almost identical move shown in the two frames in the middle. Top view of the dance floor and the trajectories of the dancers. Dance Motions are taken from the CMU Motion Capture Database
Example of Worst Case Scenario 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 This shows the worst case scenario for our algorithm when all the data are almost equidistant from the reference point and thus converting the distribution of the points on the order line to a spike. In this situation no pruning can be done since all the lower bounds will be very close to zero.
Multiple References for Ordering y x Lower bound r1 r2 x y Rotational axis Instead of rotating the points about a reference point, we could rotate all the points about an axis passing through to reference points. Then we would have a plane instead of the order line. The lower bounds for a given pair could be found by simple!! Geometry (All the lengths of the solid lines are known, find the length of the dashed line) and needs lots of floating point additions and multiplications compared to the only one subtraction in the case of single reference point. However the lower bounds found in this way are not significantly tight and in some cases less tight than the ones found by single reference point. Therefore, we choose the simplicity. 40 30 20 10 Planar bounds Linear bounds Actual distances Larger Gap Smaller Gap