Simple Case Studies Using MASS

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Flatness, Parallelism, & Profile
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.
Mining Time Series.
Understanding Flatness, Parallelism, & Profile in Calypso Last Updated: 9/12/2014Understanding Flatness, Parallelism, & Profile in Calypso1.
Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Efficiency of Algorithms
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Algorithms and Efficiency of Algorithms February 4th.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
FORTRAN Short Course Week 4 Kate Thayer-Calder March 10, 2009.
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
Chapter 2: Algorithm Discovery and Design
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Chapter 5. Loops are common in most programming languages Plus side: Are very fast (in other languages) & easy to understand Negative side: Require a.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Mining Time Series.
Analysis of Algorithms CSCI Previous Evaluations of Programs Correctness – does the algorithm do what it is supposed to do? Generality – does it.
Shape-based Similarity Query for Trajectory of Mobile Object NTT Communication Science Laboratories, NTT Corporation, JAPAN. Yutaka Yanagisawa Jun-ichi.
TRICKS Ankur Club. VERTICALLY AND CROSSWISE you do not need to the multiplication tables beyond 5 X 5  Suppose you need 7 x 6 = 42 7 is 3 below 10 and.
Change in your CAD Project File - it happens all the time in robotics.
Algorithm Analysis Data Structures and Algorithms (60-254)
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Erdal Kose CC30.10 These slides are based of Prof. N. Yanofsky Lecture notes.
Exact indexing of Dynamic Time Warping
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
University of Macau Discovering Longest-lasting Correlation in Sequence Databases Yuhong Li Department of Computer and Information Science.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Presented by: Dardan Xhymshiti Spring 2016:. Authors: Publication:  ICDM 2015 Type:  Research Paper 2 Michael ShekelyamGregor JosseMatthias Schubert.
1 ST 6 WEEKS Place Value, Addition, Subtraction numbers that are close in value to the actual numbers, and which make it easy to do mental math SW3.
An Introduction to Programming with C++ Sixth Edition Chapter 8 More on the Repetition Structure.
Intro to Data Structures Concepts ● We've been working with classes and structures to form linked lists – a linked list is an example of something known.
Data Structures and Algorithm Analysis Lecture 5
Another Example: Circle Detection
Keogh, E. , Chakrabarti, K. , Pazzani, M. & Mehrotra, S. (2001)
Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords and Shapelets Chin-Chia Michael Yeh, Yan.
Matrix Profile II: Exploiting a Novel Algorithm and GPUs to break the one Hundred Million Barrier for Time Series Motifs and Joins Yan Zhu, Zachary Zimmerman,
A Final Example Here I want to find a pattern, on the y-axis of a robotic dog, that corresponds to “walking on carpet” I will look at the first 140 seconds.
How to use… [matrixProfile, profileIndex, motifIndex, discordIndex] = interactiveMatrixProfile(data, subLen); Input data: input time series subLen: subsequence.
A451 Theory – 7 Programming 7A, B - Algorithms.
A Closer Look at Instruction Set Architectures
A Level Computing Component 2
Measured data from motor start
Jin Shieh and Eamonn Keogh University of California - Riverside
A Time Series Representation Framework Based on Learned Patterns
Parallelizing Dynamic Time Warping
Project 2 datasets are now online.
CS 235 Nearest Neighbor Classification
We understand classification algorithms in terms of the expressiveness or representational power of their decision boundaries. However, just because your.
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Chapter 3: The Efficiency of Algorithms
The datasets have 100 instances, and 100 features
CS 201 Fundamental Structures of Computer Science
Algorithm Discovery and Design
CS 430: Information Discovery
Analysis of Algorithms
Project 2, The Final Project
Spreadsheets, Modelling & Databases
ECE 352 Digital System Fundamentals
Programming Control Structures
Data Structures Introduction
Lecture 15: Least Square Regression Metric Embeddings
Analysis of Algorithms
Analysis of Algorithms
New Perspectives on Microsoft
Presentation transcript:

Simple Case Studies Using MASS Dear Reader: This document offers examples some short, self contained examples of how we can use MASS: Mueen's Algorithm for Similarity Search If you are new to MASS, we strongly recommend you spend ten minutes running these examples, it will help you build your skills and intuitions for what MASS can do for you. Corrections and suggestions to eamonn@cs.ucr.edu See also www.cs.ucr.edu/~eamonn/MatrixProfile.html and www.cs.unm.edu/~mueen/FastestSimilaritySearch.html

Have we ever seen a pattern that looks just like this? Query The dataset comes from an accelerometer inside a Sony AIBO robot dog. The query comes from a period when the dog was walking on carpet, the test data we will search comes from a time the robot walked on cement (for 5000 data points), then carpet (for 3000 data points), then back onto cement. Have we ever seen a pattern that looks just like this? 10 20 30 40 50 60 70 80 90 100 2000 4000 6000 8000 10000 12000 This task is trivial with MASS … >> load robot_dog.txt , load carpet_query.txt % load the data >> dist = findNN(robot_dog ,carpet_query ); % compute a distance profile >> [val loc] = min(dist); % find location of match >> disp(['The best matching subsequence starts at ',num2str(loc)]) The best matching subsequence starts at 7479 Below we plot the 16 best matches. Note that they all occur during the carpet walking period. This entire process takes about 1/1000th of a second. load robot_dog.txt , load carpet_query.txt demo_search( robot_dog , carpet_query); --- function demo_search( robot_dog, carpet_query ) dist = findNN(robot_dog ,carpet_query ); % compute a distance profile [val loc] = min(dist); figure;, hold on;, plot(robot_dog) for i = 1 :16 % plot the top matches [val loc] = min(dist) % find the best match index_to_plot = loc; plot([index_to_plot:index_to_plot+length(carpet_query)], robot_dog(index_to_plot:index_to_plot+length(carpet_query)),'color',[1 rand rand]) dist(index_to_plot-80:index_to_plot+80)= inf; % remove trival matches, before going to the top of the loop end 7450 7500 7550 7600 The best match, shown in context 2000 4000 6000 8000 10000 12000

How do I quickly search this long dataset for this pattern, if an approximate search is acceptable? As shown elsewhere in this document, exact search is surprisingly fast under Euclidean Distance. However, let us suppose that you want to do even faster search, and you are willing to do an approximate search (but want a high quality answer). A simple trick is to downsample both the data and the query by the same amount (in the below, by 1 in 64) and search the downsampled data. If the data has low intrinsic dimensionality, this will typically give you very good results. Let us build a long dataset, with 67,108,864 datapoints, and a long query, with 8,192 datapoints A full exact search takes 12.4 seconds, an approximate search takes 0.24 seconds, and produces (at least in this example) almost exactly the same answer. The answer is just slightly shifted in time. How well this will work for you depends on the intrinsic dimensionality of your data. rng('default') % set seed for reproducibility data= cumsum(randn(1,2^26)); % make data query= cumsum(randn(1,2^13)); % make a query tic dist = MASS_V2(data ,query ); [val,loc] = min(dist); % find best match location hold on plot(zscore(data(loc:loc+length(query)))) plot(zscore(query),'r') title(['Exact best matching sequence is at ', num2str(loc)]) disp(['Exact best matching sequence is at ', num2str(loc)]) toc figure; downsampled_data = data(1:64:end); % create a downsampled version of the data downsampled_query = query(1:64:end); % create a downsampled version of the query dist = MASS_V2(downsampled_data ,downsampled_query ); [val,loc] = min(dist); plot(zscore(data((loc*64):(loc*64)+length(query)))) % multiply the 'loc' by 64 to index correctly title(['Approx best matching sequence is at ', num2str(loc*64)]) disp(['Approx best matching sequence is at ', num2str(loc*64)]) exact exact approximate approximate Exact best matching sequence is at 61726727, Elapsed time is 12.40 seconds. Approx best matching sequence is at 61726784, Elapsed time is 0.240 seconds.

How can I optimize similarity search in a long time series? see also “How do I quickly search this long dataset for this pattern, if an approximate search is acceptable?” Suppose you want to find a query inside a long time series, say of length 67,000,000. First trick: MASS (and several other FFT and DWT ideas) have their best case when the data length is a power of two, so pad the data to make it a power of two (padding with zeros works fine). Second trick: MASS V3 is a piecewise version of MASS that performs better when the size of the pieces are well aligned with the hardware. You need to tune a single parameter, but the parameter can only be a power of two, so you can search over say 210 to 220. Once you find a good value, you can hardcode it for your machine. rng('default') % Set seed for reproducibility data= cumsum(randn(1,67000000)); % make data query= cumsum(randn(1,2^13)); % make a long query tic dist = MASS_V2(data ,query ); [val,loc] = min(dist); % find best match location hold on plot(zscore(data(loc:loc+length(query)))) plot(zscore(query),'r') disp(['Best matching sequence is at ', num2str(loc)]) toc figure data = [data zeros(1,2^nextpow2(67000000) -67000000)]; % pad data to tic % next pow of 2 [val,loc] = min(dist); % find best match location disp(['After padding: Best matching sequence is at ', num2str(loc)]) dist = MASS_V3(data ,query, 2^16 ); [val,loc] = min(dist); % find best match location disp(['MASS V3 & padding: Best matching sequence is at ', num2str(loc)]) Naive If you run this code, it will output… Best matching sequence is at 32463217 Elapsed time is 14.30 seconds. After padding: Best matching sequence is at 32463217 Elapsed time is 12.31 seconds. MASS V3 & padding: Best matching sequence is at 32463217 Elapsed time is 5.82 seconds. Trick 1 Note that it outputs the exact same answer, regardless of the optimizations, but it is fast, then faster, then super fast. Trick 1 & 2

Have we ever seen a multidimensional pattern that looks just like this? 250,000 MagX Pressure I have 262,144 data points that record a penguin’s orientation (MagX) and the water/air pressure as he hunts for fish. Question: Does he ever change his bearing leftwards as he reaches the apex of his dive. This is easy to describe as a multidimensional search. The apex of a dive is just an approximately parabolic shape. I can create this with query_pressure = zscore([[-500:500].^2]*-1)’; it looks like this I can create bearing leftwards with a straight rising line, like this query_MagX = zscore([[-500:500]])’; It looks like this We have seen elsewhere in this document how to search for a 1D pattern. For this 2D case, all we have to do is add the two distance profiles together, before we find the minimum value. Note that the best match location in 2D is different to either of the 1D queries. We can do this for 3D or 4D… However, there are some caveats. In brief, it almost never makes sense to do multidimensional time series search in more than 3 or 4D. See Matrix Profile VI: Michael Yeh ICDM 2017 and “Weighting” Bing Hu,  ICDM 2013. In addition, in some cases we may want to weight the dimensions differently, even though they are both z-normalized Euclidean Distance. load penguintest.mat figure subplot(411); hold on; plot(penguintest) title(['Penguin: Pressure and MagX']) subplot(412); hold on; query_pressure = zscore([[-500:500].^2]*-1)'; dist_p = MASS_V2(penguintest(:,1),query_pressure); [val,loc] = min(dist_p); % find best match location plot(zscore(penguintest(loc:loc+length(query_pressure),1))) plot(zscore(query_pressure),'g') title(['Best matching sequence, MagX only, is at ', num2str(loc)]) subplot(413); hold on; query_MagX = zscore([[-500:500]])'; dist_m = MASS_V2(penguintest(:,2),query_MagX); [val,loc] = min(dist_m); % find best match location plot(zscore(penguintest(loc:loc+length(query_MagX),2)),'color',[ 0.8500 0.3250 0.0980]) plot(zscore(query_MagX),'m') title(['Best matching sequence, pressure only, is at ', num2str(loc)]) subplot(414); hold on; [val,loc] = min([dist_m + dist_p]); % find best match location title(['Best matching sequence, pressure/MagX, is at ', num2str(loc)]) Orientation (MagX) load penguintest.mat figure;, hold on; query_pressure = zscore([[-500:500].^2]*-1)'; dist_p = MASS_V2(penguintest(:,1),query_pressure); query_MagX = zscore([[-500:500]])'; dist_m = MASS_V2(penguintest(:,2),query_MagX); [val,loc] = min([dist_m + dist_p]); % find best match location in 2D plot(zscore(penguintest(loc:loc+length(query_MagX),2)),'color',[0.85 0.32 0.09]) plot(zscore(query_MagX),'m') plot(zscore(penguintest(loc:loc+length(query_pressure),1)),’b’) plot(zscore(query_pressure),'g') title(['Best matching sequence, pressure/MagX, is at ', num2str(loc)]) Pressure Best match in 2D space 1000