Download presentation
Presentation is loading. Please wait.
1
Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University IEEE Transaction on Knowledge and Data Engineering (TKDE) 2007
2
Outline Introduction Data Summarization Similarity Measurement COMET-CORE Framework Empirical Studies Conclusion
3
Introduction (1) Good clustering puts similar objects together and separates dissimilar ones into different clusters. Useful information from clusters Data collection in sensor networks Stock market trades AB G FE C D
4
Introduction (2) Online data summarization with offline clustering. Periodical Online Clustering AB G FECD Waste!!Lose Information!! User
5
Introduction (3) COMET-CORE Use online piecewise linear line segments to approximate original data Update correlations when a stream encounters a new end point Update clusters by the updated correlations End point Data point Update stream correlations
6
Data Summarization (1) Problem Model Γ = {S 1, S 2, …, S n } S i = S i [1, …, t, …] : i-th stream S i [t] : arriving data of S i at time t S i app [t] : approximated data of S i at time t : end points summary of stream S i The objective is that given a set of data streams Γ and the threshold parameters, stream clusters are monitored online.
7
Data Summarization (2) Approximation Line Formulation For a sub-stream S i [t s,…,t e ] The parameters : (t s, S i [t s ]) (t e, S i [t e ])
8
Data Summarization (3) Error Function Error Threshold It may not easy to give a proper absolute error threshold Relative error threshold (EX: 2% error of square sum of original data stream)
9
Data Summarization (4) Online Linear Line Segment Approximation Time Error < Threshold δ l Value Error > Threshold δ l Generate New End Point t v1 t vk
10
Similarity Measurement (1) Use Pearson correlation as similarity measure Regard two streams as two different random variables
11
Similarity Measurement (2) Definition 4.2. Given two streams S i and S j, and a weight function w(t), the weighted correlation coefficient between these two streams is defined as :
12
Similarity Measurement (3) Definition 4.3. Given two streams S i and S j, and a weight function w(t), the WC vector of S i and S j is defined as :
13
Similarity Measurement (4) Similarity Update Update WC vector when a new end point generated Linear scan of data streams incremental update
14
Similarity Measurement (5)...
15
COMET-CORE Framework (1) Definition 5.1. Assume that the centers of two clusters C i and C j are represented by end point sequence and, respectively. Then, the WC vector of two clusters denoted by is equal to. The weighted correlation between C i and C j denoted by wcorr(C i, C j ) is equal to wcorr(S i, S j ). COMET-CORE A stream encounters a new end point Split ClusterMerge cluster
16
COMET-CORE Framework (2) Split cluster CkCk Update Weighted Correlation Compare Correlation with δ a New trigger groups Non-trigger streams C tmp Compare correlation between non-trigger stream and representative stream with δ a Three new groups C new1 C new2 C new3 trigger streams
17
COMET-CORE Framework (3) Assign WC vectors to newly generated clusters Type1: C i and C j are belong to the same cluster originally. Type2: C i and C j are belong to different clusters originally. Type3: C i is newly generated cluster, C oo is originally existing one. S 1, S 2, S 3, S 4, S 5, S 6, S 7 C1C1 CxCx CyCy S 11, S 12, S 13, S 14 C 11 S 4,S 5 S 6,S 7 S 13,S 14 S 11,S 12 S 1,S 2,S 3 C 11 C 14 CxCx CyCy C1C1 C6C6 C4C4 (a)Type1: S 4,S 5 S 6,S 7 S 13,S 14 S 11,S 12 S 1,S 2,S 3 C 11 C 14 CxCx CyCy C1C1 C6C6 C4C4 (b)Type2: S 4,S 5 S 6,S 7 S 13,S 14 S 11,S 12 S 1,S 2,S 3 C 11 C 14 CxCx CyCy C1C1 C6C6 C4C4 (c)Type3:
18
COMET-CORE Framework (4) Merge Cluster After splitting and updating the inter-cluster correlation Two clusters are merged if the correlation ≥ δ e until no this kind of cluster pair exists. C1C1 C2C2 CkCk wcorr(C 1, C 2 ) wcorr(C 2, C k ) wcorr(C 1, C 2 ) ≥ δ e Merge CkCk C new wcorr(C new, C k ) = min(wcorr(C 2,C k ), wcorr(C 2,C k ))
19
Empirical Studies (1) Clustering algorithms Basic: periodically agglomerative clustering ODAC: periodically hierarchical clustering COMET-CORE Dissimilarity > Threshold 2Dis(P) – (Dis(C 1 ) + Dis(C 2 )) < Threshold Clustering Result All streams
20
Empirical Studies (2) Clustering quality measurement Silhouette Validation a(S i ) is the average dissimilarity of stream S i to all other streams in the same cluster b(S i ) is the average dissimilarity of stream S i to all other streams in the another closest cluster Cluster Silhouette Global Silhouette
21
Empirical Studies (3) Evaluation on Real Data δ a =δ e = 0.5 Data Sets
22
Empirical Studies (4) Evaluation on Cylinder-Bell-Funnel Data Set δ a =δ e = 0.8 100 streams for each type (total 600 streams) normal distribution number ranges from 0 to 1 are randomly added on each streams 128 long 6 types
23
Empirical Studies (5) Evaluation on Random Walk Data Set δ a =δ e = 0.7 Period = 200 data points (Basic & ODAC) 20000 Points in Each StreamFixed 500 Streams Almost independent of cluster num 1. Streams number2. Cluster number
24
Conclusion The paper proposes a novel and efficient online clustering framework COMET-CORE for clustering over streams. COMET-CORE uses efficient split and merge algorithm to modify clusters with good clustering quality.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.