Distance Functions for Sequence Data and Time Series

Slides:



Advertisements
Similar presentations
Indexing Time Series Based on original slides by Prof. Dimitrios Gunopulos and Prof. Christos Faloutsos with some slides from tutorials by Prof. Eamonn.
Advertisements

Longest Common Subsequence
Embedding the Ulam metric into ℓ 1 (Ενκρεβάτωση του μετρικού χώρου Ulam στον ℓ 1 ) Για το μάθημα “Advanced Data Structures” Αντώνης Αχιλλέως.
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Mining Time Series.
Indexing Time Series. Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Dynamic Programming Solving Optimization Problems.
Inexact Matching General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic programming.
Distance Functions for Sequence Data and Time Series
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Based on Slides by D. Gunopulos (UCR)
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
1 Dynamic Programming Jose Rolim University of Geneva.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Longest Common Subsequence
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Indexing Time Series.
Time Series I.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Dynamic Time Warping Algorithm for Gene Expression Time Series
Mining Time Series.
ACM-HK Local Contest 2009 Special trainings: 6:30 - HW311 May 27, 2010 (Thur) June 3, 2010 (Thur) June 10, 2010 (Thur) Competition date: June 12,
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
ICDE, San Jose, CA, 2002 Discovering Similar Multidimensional Trajectories Michail VlachosGeorge KolliosDimitrios Gunopulos UC RiversideBoston UniversityUC.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Data Mining Multimedia Databases Text databases Image and.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Dynamic Programming Min Edit Distance Longest Increasing Subsequence Climbing Stairs Minimum Path Sum.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Part 2 # 68 Longest Common Subsequence T.H. Cormen et al., Introduction to Algorithms, MIT press, 3/e, 2009, pp Example: X=abadcda, Y=acbacadb.
Dynamic Programming (Edit Distance). Edit Distance Input: – Two input strings S1 (of size n) and S2 (of size m) E.g., S1 = ATTTCTAGTGGGTAAA S2 = ATCTAGTTTAGGGATA.
Minimum Edit Distance Definition of Minimum Edit Distance.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Dynamic Programming for the Edit Distance Problem.
Fast Subsequence Matching in Time-Series Databases.
Definition of Minimum Edit Distance
CS 3343: Analysis of Algorithms
Advanced Algorithms Analysis and Design
Least common subsequence:
Definition of Minimum Edit Distance
Distance Functions for Sequence Data and Time Series
Finding Similar Time Series
Time Series Data and Moving Object Trajectory
Sequence Alignment 11/24/2018.
Computational Biology Lecture #6: Matching and Alignment
CS Algorithms Dynamic programming 0-1 Knapsack problem 12/5/2018.
Dynamic Programming Computation of Edit Distance
CS 3343: Analysis of Algorithms
CSE 589 Applied Algorithms Spring 1999
3. Brute Force Selection sort Brute-Force string matching
Longest Common Subsequence
Lecture 8. Paradigm #6 Dynamic Programming
Dynamic Programming.
Dynamic Programming-- Longest Common Subsequence
3. Brute Force Selection sort Brute-Force string matching
Introduction to Algorithms: Dynamic Programming
Bioinformatics Algorithms and Data Structures
Longest Common Subsequence
Chap 3 String Matching 3 -.
Time Series Filtering Time Series
Longest Common Subsequence
15-826: Multimedia Databases and Data Mining
Dynamic Programming.
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Distance Functions for Sequence Data and Time Series

Time Series Databases A time series is a sequence of real numbers, representing the measurements of a real variable at equal time intervals Stock prices Volume of sales over time Daily temperature readings ECG data A time series database is a large collection of time series

Time Series Problems (from a database perspective) The Similarity Problem X = x1, x2, …, xn and Y = y1, y2, …, yn Define and compute Sim(X, Y) E.g. do stocks X and Y have similar movements? Retrieve efficiently similar time series (Indexing for Similarity Queries)

Metric Distances What properties should a similarity distance have? D(A,B) = D(B,A) Symmetry D(A,A) = 0 Constancy of Self-Similarity D(A,B) >= 0 Positivity D(A,B)  D(A,C) + D(B,C) Triangular Inequality

Euclidean Similarity Measure View each sequence as a point in n-dimensional Euclidean space (n = length of each sequence) Define (dis-)similarity between sequences X and Y as p=1 Manhattan distance p=2 Euclidean distance

Dynamic Time Warping [Berndt, Clifford, 1994] Allows acceleration-deceleration of signals along the time dimension Basic idea Consider X = x1, x2, …, xn , and Y = y1, y2, …, ym We are allowed to extend each sequence by repeating elements Euclidean distance now calculated between the extended sequences X’ and Y’ Matrix M, where mij = d(xi, yj)

Example Euclidean distance vs DTW

Formulation Let D(i, j) refer to the dynamic time warping distance between the subsequences (L_1 based distance) x1, x2, …, xi y1, y2, …, yj D(i, j) = | xi – yj | + min { D(i – 1, j), D(i – 1, j – 1), D(i, j – 1) } Dynamic programming! - Optimal substructure - Overlapping sub problems

DTW variations If we use Euclidean distance to compute the difference between two elements xi and yj then we can use: D(i, j) = ( xi – yj)2 + min { D(i – 1, j), D(i – 1, j – 1), D(i, j – 1) } and take the square root of D(n,m) at the end.

Dynamic Time Warping j = i + w warping path j = i – w Y X y3 y2 y1 x1

Restrictions on Warping Paths Monotonicity Path should not go down or to the left Continuity No elements may be skipped in a sequence Warping Window | i – j | <= w

Solution by Dynamic Programming Basic implementation = O(nm) where n, m are the lengths of the sequences will have to solve the problem for each (i, j) pair If warping window is specified, then O(nw) Only solve for the (i, j) pairs where | i – j | <= w

Longest Common Subsequence Measures (Allowing for Gaps in Sequences) Gap skipped

Basic LCS Idea Sim(X,Y) = |LCS| or Sim(X,Y) = |LCS| /n

LCSS distance for Time Series We can extend the idea of LCS to Time Series using a threshold e for matching. The distance for two subsequences: x1, x2, …, xi and y1, y2, …, yj is: LCSS[i,j] = 0 if i=0 or j=0 1+ LCSS[i-1, j-1] if |xi – yj | < e max(LCSS[i-1, j], LCSS[i, j-1]) otherwise

Edit Distance for Strings Given two strings sequences we define the distance between the sequences as the minimum number of edit operation that need to be performed to transform one sequence to the other. Edit operation are insertion, deletion and substitution of single characters. This is also called Levenshtein distance

Computing Edit distance We can compute edit distance using the following function: ed(i,j) = ed(i-1, j-1) if xi = yj min (ed(i-1,j) +1, if xi ≠ yj ed(i, j-1) +1, ed(i-1, j-1)+1) for two strings x1x2x3…xi and y1y2y3…yj

Edit distance variation In some edit distance definitions, substitution costs 2 operations. In that case, you need to use: ed(i,j) = ed(i-1, j-1) if xi = yj min (ed(i-1,j) +1, if xi ≠ yj ed(i, j-1) +1, ed(i-1, j-1)+2)

DP for Edit distance Given the function, we can design a Dynamic Programming algorithm to find the distance in O(mn) where m and n are the sizes of the two strings. More here: http://en.wikipedia.org/wiki/Levenshtein_distance

int LevenshteinDistance(char s[1..m], char t[1..n]) { // for all i and j, d[i,j] will hold the Levenshtein distance between // the first i characters of s and the first j characters of t; // note that d has (m+1)x(n+1) values declare int d[0..m, 0..n] for i from 0 to m d[i, 0] := i // the distance of any first string to an empty second string for j from 0 to n d[0, j] := j // the distance of any second string to an empty first string for j from 1 to n for i from 1 to m if s[i] = t[j] then d[i, j] := d[i-1, j-1] // no operation required else d[i, j] := minimum ( d[i-1, j] + 1, // a deletion d[i, j-1] + 1, // an insertion d[i-1, j-1] + 1 // a substitution ) } return d[m,n] From wikipedia page

Metric properties and Indexing Edit distance is a metric! So, we can use metric indexes DTW and LCSS are not metrics… we have to use specialized indexes