© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore, India
© 2009 IBM Corporation Uncertainty in Data Uncertainty introduced due to massive amount of sensor data Server Millions of Sensors Analytics Business Decisions Privacy preserving techniques A certain degree of uncertainty is sometimes intentionally introduced 2
© 2009 IBM Corporation Outline Motivation Generalized Distance Measure – Properties of a Distance Measure – Algebraic Derivation DUST Distance – Computation – Properties – Examples Results – Setup – Classification, Motif Detection, 1-NN search Conclusion 3
© 2009 IBM Corporation What does Uncertain Data Look Like? 4 x = r(x) + ε(x) observed value real value error error distribution observedoriginalerror Uncertain Time Series
© 2009 IBM Corporation Data Mining on Uncertain Time Series ClusteringClassificationPattern Discovery… Require at least a partial order on the distances between time series elements However, a total order between the distances is better We need a distance function to measure the distance between uncertain time series elements Are x and x’ closer than y and y’ ? Ensures that all pairs are comparable Easy to store the distance and manage it later
© 2009 IBM Corporation Distance between Uncertain Time Series 6 T1T1 T2T2 T3T3 time value T1T1 T2T2 T3T3 time value T1T1 T2T2 T3T3 time value Is T 2 closer to T 1, or is T 3 closer to T 1 ? Doesn’t Matter Clearly T 3 T 2 or T 3 ???
© 2009 IBM Corporation How to Measure the Distance between two Time Series Elements? 7 x = r(x) + ε(x)x’ = r(x’) + ε(x’) Consider two values Axiom: The distance between x and x’, should say something about the distance between normal Euclidean distance between r(x) and r(x’) Prior Approaches Compute the apriori probability distribution of the random variable X = (r(x) – r(x’)) Work with only the mean and standard deviation of X X is not a distance measure. It is hard to work with probabilities. 1 2
© 2009 IBM Corporation Resolving the Question T 2 should be closer to T 1 than T 3 – This is because it is possible that T 2 and T 1 are the same time series. T 2 just has some additional error. – T 3 and T 1 can never be the same time series because the last value has a very large divergence 8 T1T1 T2T2 T3T3 time value T 2 or T 3 ??? Euclidean distance (EUCL) and Dynamic Time Warping (DTW) T3T3 DUST T2T2
© 2009 IBM Corporation Outline Motivation Generalized Distance Measure – Properties of a Distance Measure – Algebraic Derivation DUST Distance – Computation – Properties – Examples Results – Setup – Classification, Motif Detection, 1-NN search Conclusion 9
© 2009 IBM Corporation Arriving at a Distance Measure 10 Properties of a Distance Measure 1. Non-negativity: d(A,B) ≥ 0 2. Identity of Indiscernibles: d(A,B) = 0 iff A= B 3.Symmetry: d(A,B) = d(B,A) 4.Triangle Inequality: d(A,B) + d(A,C) ≥ d(B,C) 5. The distance should be similar to EUCL or DTW if the magnitude of the error is small. (Extra Condition for an uncertain distance measure)
© 2009 IBM Corporation Extending Prior Work 11 Two time series are considered similar if : P(DIST(T 1,T 2 ) ≤ ε) ≥ τ DIST(T 1, T 2 ) = sqrt(Σ i dist(T 1 [i], T 2 [i]) 2 ) dist(x,y) = |x-y| Assumption P(DIST(T 1,T 2 ) ≤ ε) = p(DIST(T 1,T 2 ) = 0) ε (irrespective of the size of ε) Prior Work
© 2009 IBM Corporation12 -log (φ(|T 1 [i] – T 2 [i]|) Some Algebra P(DIST(T 1,T 2 ) ≤ ε) > P(DIST(T 1,T 3 ) ≤ ε) p(DIST(T 1,T 2 ) = 0) > p(DIST(T 1,T 3 ) = 0) Π i p(dist(T 1 [i], T 2 [i]) = 0) > Π i p(dist(T 1 [i], T 3 [i]) = 0) Σ i –log(p(dist(T 1 [i], T 2 [i]) = 0)) ≤ Σ i –log(p(dist(T 1 [i], T 3 [i]) = 0)) ≈ φ(x) = p(dist(0,x) = 0) dist(x,y) is only dependent on |x-y| proved in the paper dust(x,y) = -log(φ(|x-y|)) + log(φ(0) Definition
© 2009 IBM Corporation Some Algebra - II 13 P(DIST(T 1,T 2 ) ≤ ε) > P(DIST(T 1,T 3 ) ≤ ε) Σ i –log(p(dist(T 1 [i], T 2 [i]) = 0)) ≤ Σ i –log(p(dist(T 1 [i], T 3 [i]) = 0)) ≈ dust(x,y) = -log(φ(|x-y|)) + log(φ(0) Definition Σ i dust(T 1 [i], T 2 [i]) 2 ≤ Σ i dust(T 1 [i], T 3 [i]) 2 Definition DUST(T 1, T 2 ) =Σ i dust(T 1 [i], T 2 [i]) 2 DUST(T 1, T 2 ) ≤ DUST(T 1, T 3 ) DUST behaves like a standard distance measure T1T1 T3T3 T2T2 time value
© 2009 IBM Corporation Outline Motivation Generalized Distance Measure – Properties of a Distance Measure – Algebraic Derivation DUST Distance – Computation – Properties – Examples Results – Setup – Classification, Motif Detection, 1-NN search Conclusion 14
© 2009 IBM Corporation Computing the DUST Distance 15 Compute dust(0,Δx) 1. Assume values are independent 2. Use Bayes’ Theorem 3. Arrive at final solution through numerical integration Δ xΔ x Original distribution of data error distribution dust(0, Δ x) Offline Computation Online Computation Δ xΔ x Check the last segment in the lookup table Save the values in a lookup table Compress it using a piece-wise linear representation Perform a binary search to find the right segment calculate value dust(0, Δ x) Yes No |x-y| dust(0,Δx)
© 2009 IBM Corporation The dust Distance 16 Normal DistributionOther Distributions The dust distance is exactly the same as Euclidean distance for the Normal distribution dust ultimately converges with Euclidean distance
© 2009 IBM Corporation Combining Multiple Distributions 17 Let the values in a time series have different error distributions f 1 … f n. Let their standard deviations be σ 1 … σ n. Let us choose σ e = min (σ 1, …, σ n )/5 Adjusted f’(x) η 1 ≤ x ≤ η 2 x < η 1 x > η 2 f(x) N (0, σ e ) η1η1 η2η2 Not interested Interested T1T1 T2T2 NormalUniformExponential
© 2009 IBM Corporation Combining Multiple Normal Distributions 18 Combining multiple normal distributions with different Standard deviations Converge to the same distance func.
© 2009 IBM Corporation19 Results
© 2009 IBM Corporation Classification Accuracy 20 No Error : 77%, DUST: 72%, Euclidean Distance: 62%
© 2009 IBM Corporation Classification Accuracy: Dynamic Time Warping 21 No Error : 78%, DUST: 74%, Euclidean Distance: 67%
© 2009 IBM Corporation Top-k Motifs : EEG Dataset 22 Anomalous Behavior Superior performance of DUST
© 2009 IBM Corporation #of Matches vs Standard Deviation for k- NN classification – wafer dataset 23 DUST Euclidean Dist.
© 2009 IBM Corporation Conclusions Uncertainty in data is increasingly prevalent in – Sensor data – Privacy preserving techniques Conventional approaches – Don’t produce good results with mining uncertain data Propose novel metric DUST – Incorporates theoretical measures of similarity – Easy to compute DUST makes up for half the accuracy lost due to uncertainty 24
© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series Smruti R. Sarangi and Karin Murthy IBM Research Labs, Bangalore, India