Download presentation
Presentation is loading. Please wait.
Published byRosamund Barrett Modified over 8 years ago
1
Fast Subsequence Matching in Time-Series Databases Author: Christos Faloutsos etc. Speaker: Weijun He
2
What is the problem? What is Time Series: 1-dimensional data e.g. Daily stock market price, Daily temperature, etc Our goal: Design fast searching methods that will locate subsequence that match a query subsequence, exactly or approximately
3
Motivation/Application Financial, marketing, production Typical query: ‘ find companies whose stock prices move similarly ’ Scientific databases Typical query: ‘ find past days in which solar magnetic wind showed similar patterns as today ’ s ’
4
Some notational conventions If S and Q are two sequences, then: Len(S) : length of S S[i:j] : subsequence including i and j S[i] : i-th entry of S D(S,Q) : distance of two equal length sequence S and Q
5
Queries Two categories for queries: Whole Mathing: len(data) = len(query) Subsequence Matching: len(data) > len(query) Remark: The distance function D(S,Q) is defined, e.g. D() can be the Euclidean distance Matching means: D(S,Q) < , i.e., approximately
6
Whole Matching 1. Any distance-preserving transform(e.g., Discrete Fourier Transform(DFT),extract f features from sequences(e.g., the first f DFT coefficients): f- dimensional feature space 2. Any spatial access method(e.g., R*-tree) can be used for range/approximate queries
7
Mathematical Background Lemma 1 To guarantee no false dismissals for range queries, the feature extraction function F() should satisfy the following formula: Dfeature(F(O1),F(O2))<=Dobject(O1,O2) False dismissal: discard the qualified sequence, BAD False alarm: non-qualified sequence not discarded, Not so bad
8
Discrete Fourier Transform Theorem(Parseval): i=0,..,n-1 X i 2 = f=0,..,n-1 X f 2 (distance preserving) DFT is a linear transform, so it can be proved that DFT satisfy Lemma 1. We Keep the first few(2-3) coefficients as features Properties: 1. Only false alarm, no false dismissal 2. Practically, false alarms are few
9
From Whole to Subsequence matching Question: How to generalize the method to approximate match queries for subsequences of arbitrary length?
10
Subsequence Matching:Criterion Some criterion: Fast: sequential scanning and distance calculation at each and every possible offset is too slow for large databases Correct: No ‘ false dismissals ’, but ‘ false alarms ’ are acceptable Small space overhead Dynamic Varying length for data and query sequences
11
Proposed Method Using Sliding window of w, minimum query length. A data sequence of length Len(S) is mapped to a trail in feature space, consisting of len(S)-w+1 points. —” Sub-Trail-index ”
12
I-na ï ve method The straightforward way is keep track of the individual points of each trail, storing them in spatial access method Disadvantage: Inefficient since almost every point in a data sequence will correspond to a point in the f-dimensional feature space.
13
I-na ï ve method – Contd. How to improve: Observation: the content of the sliding window in nearby offset will be similar. Solution: Divide the trail into sub-trails and represent each of them with its Minimum Bounding Rectangle (MBR), thus we only need to store a few MBRs, “ no false dismissals ” are guaranteed.
14
Illustration
15
MBR Property Each MBR corresponds to a whole sub-trail, i.e., points in feature space that correspond to successive positions of the sliding window. Each leaf-MBR has tstart, tend which are the offsets of the first and last such positions, also has a unique identifier for the data sequence (sequence_id) The extent of the MBR in each dimension is denoted as: (F1low,F1high, F2low,F2high, …… ) MBR are stored in R* tree.
16
Figure2: Structure of a leaf node and a non-leaf node index node layout for the last two levels …… F1_min,F1_max …… Sequence_id T_start, T_end F1_min,F1_max Fn_min,Fn_max ……
17
ST-index There are two questions for ST-index: Insertion (Dynamic requirement): when new data sequence is inserted, what is a good way to divide its trail into sub-trail? Queries longer than w: how to handle queries, especially the ones longer than w.
18
ST-index: Insertion
19
Illustration
20
I-adaptive heuristic Cost function: DA(L)=П(Li+0.5) where L=(L1,L2,..Ln), 1<=i<=n. Marginal cost of a point: Consider a sub-trail of K points with a MBR of sizes L1,…Ln, each point in this sub-trail has : mc=DA(L) /k
21
I-adaptive heuristic: algorithm /* Algorithm Divide-to-Subtrails */ Assign the first point of the trail in a (trivial) sub-trail FOR each successive point IF it increase the marginal cost of the current sub- trail THEN start another sub-trail ELSE include it in the current sub-trail
22
Searching-Queries longer than W Two methods: PrefixSearch 1. select the prefix of Q of length w, match the prefix within tolerance e MultiPiece Search 1. Suppose the query sequence has length p*w, 2. Break Q into p sub-queries which correspond to p sphere in feature space with raius e/sqrt(p); 3. Use “ ST-index ” to retrieve the sub-trails whose MBRs intersect at least one of the sub-query region.
23
Prefix vs. MultiPiece search Volume required in feature space(K is a constant): Prefixsearch:K e^f Multipiece:K*p*(e/sqrt(p))^f Multipiece is likely to produce fewer false alarms
24
Conclusions The main contribution is: “ I-adaptive ” method: achieves orders of magnitude savings over the sequential scanning. Small space overhead It is dynamic No false dismissal Future work: Extend this method for 2-dimensional gray scale images, and in general for n-dimensional vector-fields(e.g. 3-d MRI brain scans)
25
The End Thank you for your attention!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.