Download presentation
Presentation is loading. Please wait.
Published byTheresa Bates Modified over 9 years ago
1
© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore, India
2
© 2009 IBM Corporation Uncertainty in Data Uncertainty introduced due to massive amount of sensor data Server Millions of Sensors Analytics Business Decisions Privacy preserving techniques A certain degree of uncertainty is sometimes intentionally introduced 2
3
© 2009 IBM Corporation Outline Motivation Generalized Distance Measure – Properties of a Distance Measure – Algebraic Derivation DUST Distance – Computation – Properties – Examples Results – Setup – Classification, Motif Detection, 1-NN search Conclusion 3
4
© 2009 IBM Corporation What does Uncertain Data Look Like? 4 x = r(x) + ε(x) observed value real value error error distribution observedoriginalerror Uncertain Time Series
5
© 2009 IBM Corporation Data Mining on Uncertain Time Series ClusteringClassificationPattern Discovery… Require at least a partial order on the distances between time series elements However, a total order between the distances is better We need a distance function to measure the distance between uncertain time series elements Are x and x’ closer than y and y’ ? Ensures that all pairs are comparable Easy to store the distance and manage it later
6
© 2009 IBM Corporation Distance between Uncertain Time Series 6 T1T1 T2T2 T3T3 time value T1T1 T2T2 T3T3 time value T1T1 T2T2 T3T3 time value Is T 2 closer to T 1, or is T 3 closer to T 1 ? Doesn’t Matter Clearly T 3 T 2 or T 3 ???
7
© 2009 IBM Corporation How to Measure the Distance between two Time Series Elements? 7 x = r(x) + ε(x)x’ = r(x’) + ε(x’) Consider two values Axiom: The distance between x and x’, should say something about the distance between normal Euclidean distance between r(x) and r(x’) Prior Approaches Compute the apriori probability distribution of the random variable X = (r(x) – r(x’)) Work with only the mean and standard deviation of X X is not a distance measure. It is hard to work with probabilities. 1 2
8
© 2009 IBM Corporation Resolving the Question T 2 should be closer to T 1 than T 3 – This is because it is possible that T 2 and T 1 are the same time series. T 2 just has some additional error. – T 3 and T 1 can never be the same time series because the last value has a very large divergence 8 T1T1 T2T2 T3T3 time value T 2 or T 3 ??? Euclidean distance (EUCL) and Dynamic Time Warping (DTW) T3T3 DUST T2T2
9
© 2009 IBM Corporation Outline Motivation Generalized Distance Measure – Properties of a Distance Measure – Algebraic Derivation DUST Distance – Computation – Properties – Examples Results – Setup – Classification, Motif Detection, 1-NN search Conclusion 9
10
© 2009 IBM Corporation Arriving at a Distance Measure 10 Properties of a Distance Measure 1. Non-negativity: d(A,B) ≥ 0 2. Identity of Indiscernibles: d(A,B) = 0 iff A= B 3.Symmetry: d(A,B) = d(B,A) 4.Triangle Inequality: d(A,B) + d(A,C) ≥ d(B,C) 5. The distance should be similar to EUCL or DTW if the magnitude of the error is small. (Extra Condition for an uncertain distance measure)
11
© 2009 IBM Corporation Extending Prior Work 11 Two time series are considered similar if : P(DIST(T 1,T 2 ) ≤ ε) ≥ τ DIST(T 1, T 2 ) = sqrt(Σ i dist(T 1 [i], T 2 [i]) 2 ) dist(x,y) = |x-y| Assumption P(DIST(T 1,T 2 ) ≤ ε) = p(DIST(T 1,T 2 ) = 0) ε (irrespective of the size of ε) Prior Work
12
© 2009 IBM Corporation12 -log (φ(|T 1 [i] – T 2 [i]|) Some Algebra P(DIST(T 1,T 2 ) ≤ ε) > P(DIST(T 1,T 3 ) ≤ ε) p(DIST(T 1,T 2 ) = 0) > p(DIST(T 1,T 3 ) = 0) Π i p(dist(T 1 [i], T 2 [i]) = 0) > Π i p(dist(T 1 [i], T 3 [i]) = 0) Σ i –log(p(dist(T 1 [i], T 2 [i]) = 0)) ≤ Σ i –log(p(dist(T 1 [i], T 3 [i]) = 0)) ≈ φ(x) = p(dist(0,x) = 0) dist(x,y) is only dependent on |x-y| proved in the paper dust(x,y) = -log(φ(|x-y|)) + log(φ(0) Definition
13
© 2009 IBM Corporation Some Algebra - II 13 P(DIST(T 1,T 2 ) ≤ ε) > P(DIST(T 1,T 3 ) ≤ ε) Σ i –log(p(dist(T 1 [i], T 2 [i]) = 0)) ≤ Σ i –log(p(dist(T 1 [i], T 3 [i]) = 0)) ≈ dust(x,y) = -log(φ(|x-y|)) + log(φ(0) Definition Σ i dust(T 1 [i], T 2 [i]) 2 ≤ Σ i dust(T 1 [i], T 3 [i]) 2 Definition DUST(T 1, T 2 ) =Σ i dust(T 1 [i], T 2 [i]) 2 DUST(T 1, T 2 ) ≤ DUST(T 1, T 3 ) DUST behaves like a standard distance measure T1T1 T3T3 T2T2 time value
14
© 2009 IBM Corporation Outline Motivation Generalized Distance Measure – Properties of a Distance Measure – Algebraic Derivation DUST Distance – Computation – Properties – Examples Results – Setup – Classification, Motif Detection, 1-NN search Conclusion 14
15
© 2009 IBM Corporation Computing the DUST Distance 15 Compute dust(0,Δx) 1. Assume values are independent 2. Use Bayes’ Theorem 3. Arrive at final solution through numerical integration Δ xΔ x Original distribution of data error distribution dust(0, Δ x) Offline Computation Online Computation Δ xΔ x Check the last segment in the lookup table Save the values in a lookup table Compress it using a piece-wise linear representation Perform a binary search to find the right segment calculate value dust(0, Δ x) Yes No |x-y| dust(0,Δx)
16
© 2009 IBM Corporation The dust Distance 16 Normal DistributionOther Distributions The dust distance is exactly the same as Euclidean distance for the Normal distribution dust ultimately converges with Euclidean distance
17
© 2009 IBM Corporation Combining Multiple Distributions 17 Let the values in a time series have different error distributions f 1 … f n. Let their standard deviations be σ 1 … σ n. Let us choose σ e = min (σ 1, …, σ n )/5 Adjusted f’(x) η 1 ≤ x ≤ η 2 x < η 1 x > η 2 f(x) N (0, σ e ) η1η1 η2η2 Not interested Interested T1T1 T2T2 NormalUniformExponential
18
© 2009 IBM Corporation Combining Multiple Normal Distributions 18 Combining multiple normal distributions with different Standard deviations Converge to the same distance func.
19
© 2009 IBM Corporation19 Results
20
© 2009 IBM Corporation Classification Accuracy 20 No Error : 77%, DUST: 72%, Euclidean Distance: 62%
21
© 2009 IBM Corporation Classification Accuracy: Dynamic Time Warping 21 No Error : 78%, DUST: 74%, Euclidean Distance: 67%
22
© 2009 IBM Corporation Top-k Motifs : EEG Dataset 22 Anomalous Behavior Superior performance of DUST
23
© 2009 IBM Corporation #of Matches vs Standard Deviation for k- NN classification – wafer dataset 23 DUST Euclidean Dist.
24
© 2009 IBM Corporation Conclusions Uncertainty in data is increasingly prevalent in – Sensor data – Privacy preserving techniques Conventional approaches – Don’t produce good results with mining uncertain data Propose novel metric DUST – Incorporates theoretical measures of similarity – Easy to compute DUST makes up for half the accuracy lost due to uncertainty 24
25
© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore, India
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.