Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.

Similar presentations


Presentation on theme: "Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University."— Presentation transcript:

1 Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007

2 Outline Introduction Data Summarization Similarity Measurement COMET-CORE Framework Empirical Studies Conclusion

3 Introduction (1) Good clustering puts similar objects together and separates dissimilar ones into different clusters. Useful information from clusters  Data collection in sensor networks  Stock market trades AB G FE C D

4 Introduction (2) Online data summarization with offline clustering. Periodical Online Clustering AB G FECD Waste!!Lose Information!! User

5 Introduction (3) COMET-CORE  Use online piecewise linear line segments to approximate original data  Update correlations when a stream encounters a new end point  Update clusters by the updated correlations End point Data point Update stream correlations

6 Data Summarization (1) Problem Model  Γ = {S 1, S 2, …, S n }  S i = S i [1, …, t, …] : i-th stream  S i [t] : arriving data of S i at time t  S i app [t] : approximated data of S i at time t  : end points summary of stream S i  The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.

7 Data Summarization (2) Approximation Line Formulation  For a sub-stream S i [t s,…,t e ]  The parameters : (t s, S i [t s ]) (t e, S i [t e ])

8 Data Summarization (3) Error Function Error Threshold  It may not easy to give a proper absolute error threshold  Relative error threshold (EX: 2% error of square sum of original data stream)

9 Data Summarization (4) Online Linear Line Segment Approximation Time Error < Threshold δ l Value Error > Threshold δ l Generate New End Point t v1 t vk

10 Similarity Measurement (1) Use Pearson correlation as similarity measure Regard two streams as two different random variables

11 Similarity Measurement (2) Definition 4.2. Given two streams S i and S j, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :

12 Similarity Measurement (3) Definition 4.3. Given two streams S i and S j, and a weight function w(t), the WC vector of S i and S j is defined as :

13 Similarity Measurement (4) Similarity Update  Update WC vector when a new end point generated  Linear scan of data streams  incremental update

14 Similarity Measurement (5)...

15 COMET-CORE Framework (1) Definition 5.1. Assume that the centers of two clusters C i and C j are represented by end point sequence and, respectively. Then, the WC vector of two clusters denoted by is equal to. The weighted correlation between C i and C j denoted by wcorr(C i, C j ) is equal to wcorr(S i, S j ). COMET-CORE A stream encounters a new end point Split ClusterMerge cluster

16 COMET-CORE Framework (2) Split cluster CkCk Update Weighted Correlation Compare Correlation with δ a New trigger groups Non-trigger streams C tmp Compare correlation between non-trigger stream and representative stream with δ a Three new groups C new1 C new2 C new3 trigger streams

17 COMET-CORE Framework (3) Assign WC vectors to newly generated clusters  Type1: C i and C j are belong to the same cluster originally.  Type2: C i and C j are belong to different clusters originally.  Type3: C i is newly generated cluster, C oo is originally existing one. S 1, S 2, S 3, S 4, S 5, S 6, S 7 C1C1 CxCx CyCy S 11, S 12, S 13, S 14 C 11 S 4,S 5 S 6,S 7 S 13,S 14 S 11,S 12 S 1,S 2,S 3 C 11 C 14 CxCx CyCy C1C1 C6C6 C4C4 (a)Type1: S 4,S 5 S 6,S 7 S 13,S 14 S 11,S 12 S 1,S 2,S 3 C 11 C 14 CxCx CyCy C1C1 C6C6 C4C4 (b)Type2: S 4,S 5 S 6,S 7 S 13,S 14 S 11,S 12 S 1,S 2,S 3 C 11 C 14 CxCx CyCy C1C1 C6C6 C4C4 (c)Type3:

18 COMET-CORE Framework (4) Merge Cluster  After splitting and updating the inter-cluster correlation  Two clusters are merged if the correlation ≥ δ e until no this kind of cluster pair exists. C1C1 C2C2 CkCk wcorr(C 1, C 2 ) wcorr(C 2, C k ) wcorr(C 1, C 2 ) ≥ δ e Merge CkCk C new wcorr(C new, C k ) = min(wcorr(C 2,C k ), wcorr(C 2,C k ))

19 Empirical Studies (1) Clustering algorithms  Basic: periodically agglomerative clustering  ODAC: periodically hierarchical clustering  COMET-CORE Dissimilarity > Threshold 2Dis(P) – (Dis(C 1 ) + Dis(C 2 )) < Threshold Clustering Result All streams

20 Empirical Studies (2) Clustering quality measurement  Silhouette Validation a(S i ) is the average dissimilarity of stream S i to all other streams in the same cluster b(S i ) is the average dissimilarity of stream S i to all other streams in the another closest cluster  Cluster Silhouette  Global Silhouette

21 Empirical Studies (3) Evaluation on Real Data  δ a =δ e = 0.5 Data Sets

22 Empirical Studies (4) Evaluation on Cylinder-Bell-Funnel Data Set  δ a =δ e = 0.8  100 streams for each type (total 600 streams)  normal distribution number ranges from 0 to 1 are randomly added on each streams 128 long 6 types

23 Empirical Studies (5) Evaluation on Random Walk Data Set  δ a =δ e = 0.7  Period = 200 data points (Basic & ODAC) 20000 Points in Each StreamFixed 500 Streams Almost independent of cluster num 1. Streams number2. Cluster number

24 Conclusion The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams. COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.


Download ppt "Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University."

Similar presentations


Ads by Google