Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Query Filtering for Streaming Time Series

Similar presentations


Presentation on theme: "Efficient Query Filtering for Streaming Time Series"— Presentation transcript:

1 Efficient Query Filtering for Streaming Time Series
Li Wei Eamonn Keogh Helga Van Herle Agenor Mafra-Neto Computer Science & Engineering Dept. University of California – Riverside Riverside, CA {wli, David Geffen School of Medicine University of California – Los Angeles Los Angeles, CA ISCA Technologies Riverside, CA ICDM '05

2 Outline of Talk Introduction to time series Time series filtering
Wedge-based approach Experimental results Conclusions

3 What are Time Series? Time series are collections of observations made sequentially in time. .

4 Time Series are Everywhere
ECG Heartbeat Image Stock Video make subtitle bold

5 Time Series Data Mining Tasks
Clustering Classification Rule Discovery Motif Discovery 50 1000 150 2000 2500 20 40 60 80 100 120 140 A B C s = 0.5 c = 0.3 Query by Content 10 Anomaly Detection Visualization

6 Time Series Filtering Time Series
Matches Q11 Time Series 1 5 9 Given a Time Series T, a set of Candidates C and a distance threshold r, find all subsequences in T that are within r distance to any of the candidates in C. 2 6 10 Say which part is given(candidates and r), which we don’t know in advance(long time series). Application: ECG monitoring, audio sensor monitoring (say more, say it is motivated by the Aerospace) 3 7 11 4 8 12 Candidates

7 Filtering vs. Querying Query Database Database 1 5 9 6 1 2 6 10 2 7 8
(template) Database Database Matches Q11 Best match 1 5 9 6 1 2 6 10 2 7 8 3 3 7 11 9 4 4 8 12 5 10 Queries Database

8 Euclidean Distance Metric
Given two time series Q = q1…qn and C = c1…cn , the Euclidean distance between them is defined as: 10 20 30 40 50 60 70 80 90 100 Q C

9 Early Abandon During the computation, if current sum of the squared differences between each pair of corresponding data points exceeds r 2, we can safely stop the calculation. 10 20 30 40 50 60 70 80 90 100 calculation abandoned at this point Q C

10 Classic Approach Time Series
1 5 9 2 6 10 hundreds or thousands of candidates Individually compare each candidate sequence to the query using the early abandoning algorithm. 3 7 11 4 8 12 Candidates

11 Wedge Having candidate sequences C1, .. , Ck , we can form two new sequences U and L : Ui = max(C1i , .. , Cki ) Li = min(C1i , .. , Cki ) They form the smallest possible bounding envelope that encloses sequences C1, .. ,Ck . We call the combination of U and L a wedge, and denote a wedge as W. W = {U, L} A lower bounding measure between an arbitrary query Q and the entire set of candidate sequences contained in a wedge W: C1 C2 U W L Important thing to notice: LB_Keogh lower bounds. U L W Q

12 Generalized Wedge Use W(1,2) to denote that a wedge is built from sequences C1 and C2 . Wedges can be hierarchally nested. For example, W((1,2),3) consists of W(1,2) and C3 . C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) W((1, 2), 3)

13 Wedge Based Approach Time Series 1 5 9
Compare the query to the wedge using LB_Keogh If the LB_Keogh function early abandons, we are done Otherwise individually compare each candidate sequences to the query using the early abandoning algorithm 2 6 10 3 7 11 4 8 12 Candidates

14 Examples of Wedge Merging
Q W((1,2),3) C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) W((1, 2), 3) C1 (or W1 ) C2 (or W2 ) W(1, 2)

15 Hierarchal Clustering
W3 W2 W5 W1 W4 W(2,5) W(1,4) W((2,5),3) W(((2,5),3), (1,4)) K = 5 K = 4 K = 3 K = 2 K = 1 C3 (or W3) C5 (or W5) C2 (or W2) C4 (or W4) C1 (or W1) Which wedge set to choose ?

16 Which Wedge Set to Choose ?
Test all k wedge sets on a representative sample of data Choose the wedge set which performs the best empirical tests

17 Upper Bound on Wedge Based Approach
Wedge based approach seems to be efficient when comparing a set of time series to a large batch dataset. But, what about streaming time series ? Streaming algorithms are limited by their worst case. Being efficient on average does not help. Worst case C1 (or W1 ) C2 (or W2 ) C3 (or W3 ) W(1, 2) Subsequence W((1, 2), 3)

18 ? If dist(W((2,5),3), W(1,4)) >= 2 r Triangular Inequality < r
Subsequence W3 W2 W5 W1 W4 W3 W(2,5) W1 W4 W3 W(2,5) W(1,4) W((2,5),3) < r W(((2,5),3), (1,4)) >= 2r ? W(1,4) K = 5 K = 4 K = 3 K = 2 K = 1 W(1,4) cannot fail on both wedges fails

19 Experimental Setup How to choose r ? Datasets
ECG Dataset Stock Dataset Audio Dataset We measure the number of computational steps used by the following methods: Brute force Brute force with early abandoning (classic) Our approach (Atomic Wedgie) Our approach with random wedge set (AWR) How to choose r ? A logical value for r would be the average distance from a pattern to its nearest neighbor

20 ECG Dataset Batch time series Candidate set r = 0.5
650,000 data points (half an hour’s ECG signals) Candidate set 200 time series of length 40 4 types of patterns left bundle branch block beat right bundle branch block beat atrial premature beat ventricular escape beat r = 0.5 Upper Bound: 2,120 (8,000 for brute force) Algorithm Number of Steps brute force 5,199,688,000 classic 210,190,006 Atomic Wedgie 8,853,008 AWR 29,480,264

21 Stock Dataset Batch time series Candidate set r = 4.3
2,119,415 data points Candidate set 337 time series with length 128 3 types of patterns head and shoulders reverse head and shoulders cup and handle r = 4.3 Upper Bound: 18,048 (43,136 for brute force) Algorithm Number of Steps brute force 91,417,607,168 classic 13,028,000,000 Atomic Wedgie 3,204,100,000 AWR 10,064,000,000

22 Audio Dataset Batch time series Candidate set
37,583,512 data points (one hour’s sound) Candidate set 68 time series with length 51 3 species of harmful mosquitoes Culex quinquefasciatus Aedes aegypti Culiseta spp Sliding window: 11,025 (1 second) Step: 5,512 (0.5 second) r = 2 Upper Bound: 2,929 (6,868 for brute force) Algorithm Number of Steps brute force 57,485,160 classic 1,844,997 Atomic Wedgie 1,144,778 AWR 2,655,816

23 Conclusions We introduce the problem of time series filtering.
Combining similar sequences into a wedge is a quite promising idea. We have provided the upper bound of the cost of the algorithm to compute the fastest arrival rate we can guarantee to handle.

24 Future Work Dynamic wedge set choosing for data with concept shifting
Extension to other distance measures, for example, DTW (Dynamic Time Warping) and uniform scaling

25 Questions? All datasets used in this talk can be found at

26 Z-Normalization Normalize a data sequence C to have mean = 0 and standard deviation = 1 C = (C - mean(C )) / std(C ) 100 200 300 400 500 600 700 800 900 1000 100 200 300 400 500 600 700 800 900 1000


Download ppt "Efficient Query Filtering for Streaming Time Series"

Similar presentations


Ads by Google