Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003.

Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Outline Mixed-Attribute Clustering  ROCK, CACTUS  Links between mixed attributes Weighted Clustering  PROCLUS  Weights, distance and similarity measures  Methods of computing the weights Conclusion

Mixed-Attribute Clustering Most real datasets have mixed attributes  numeric (continuous)  total ordering  categorical (discrete)  no total ordering Few clustering algorithms for mixed attributes Combined information from mixed attributes may be useful for clustering Use the context to compute the distance / similarity measure instead of using a fixed measure Apply the concept of links (in ROCK) and concept of strongly-connected attribute values (in CACTUS)

ROCK Hierarchical, agglomerative clustering algorithm We only focus on the concept of links Given records T i, T j  Their similarity is sim(T i,T j )  They are neighbors if sim(T i,T j ) ≥θ  link(T i,T j ) is the number of their common neighbors Relationship between links and clusters  High intra-cluster similarity within clusters  Many links within clusters  High inter-cluster dissimilarity among clusters  Few cross links among clusters

Example of links Jaccard coefficient: |T i ⋂ T j | / |T i ⋃ T j | sim({1,2,3},{1,2,7}) = 0.5 (different cluster) sim({1,2,3},{3,4,5}) = 0.2 (same cluster) For links, let θ= 0.5 link({1,2,3},{1,2,7}) = 3 link({1,2,3},{3,4,5}) = 4 Figure 1: Basket data example (adapted from [1])

CACTUS Clusters are generated and validated in later phases We only focus on the summarization (first) phase  inter-attribute summary  IJ (links among attribute values)  intra-attribute summary  II (similarity of attribute values) Some notations Dataset: D Tuple: t Categorical attributes: A 1, A 2, …, A n Domains: D 1, D 2, …, D n Values in domains : V 1,*, V 2,*, …, V n,*

Support and similarity Let i ≠ j, support σ D (V i,x,V j,y ) is defined as σ D (V i,x,V j,y ) = |{t ∊ D : t.A i = V i,x ⋀ t.A j =V j,y }| Let α>1, V i,x and V j,y are strongly connected if σ D (V i,x,V j,y ) > α*|D| / (|D i |*|D j |) σ* D (V i,x,V j,y ) = σ D (V i,x,V j,y ) if they are strongly connected = 0 otherwise The similarity γ j (V i,x,V i,z ) with respect to D j (i≠j) is defined as γ j (V i,x,V i,z ) =|{ u ∊ D j :σ* D (V i,x, u)>0 ⋀ σ* D (V i,z, u) >0}| expected support under attribute independence assumption

Example of similarities Figure 2: Inter-attribute summary  IJ (adapted from [2]) Figure 3: Intra-attribute summary  II (adapted from [2]) number of common neighbors in another attribute links among attribute values of different attributes

Summary on two previous concepts Links and strongly-connected attribute values are for categorical data The former is for tuples and the latter is for attribute values The latter can be viewed as “links” between attribute values Need to extend them to mixed attributes

Links between mixed attributes A i is categorical and A j is numeric  How to compute the similarity γ j (V i,x,V i,z ) with respect to D j (i≠j) ? MultiSet: set with multiplicity {3,7} and {7,3} are equivalent {3,7} and {3,3,7} are different The set of values of A j with A i as V i,x is defined by MSet(V i,x, i, j) = { t.A j : t ∊ D and t.A i = V i,x } D={(a,5),(a,6),(b,7),(b,8)} MSet(a,1,2)={5,6}, MSet(b,1,2)={7,8} The similarity γ j (V i,x,V i,z ) can be computed using MSet(V i,x, i, j) and MSet(V i,z, i, j) Figure 4: Inter-attribute summary of mixed attributes

Links between mixed attributes (1) Histogram  Represent MSet(V i,x, i, j) by a histogram Hist(V i,x, i, j)  Compute the similarity γ j (V i,x,V i,z ) by sim(Hist(V i,x, i, j),Hist(V i,z, i, j))  Histogram Intersection Sim = 2 + 3 + 3 + 4 + 5 + 4 + 3 = 24 (needs to be normalized)  A more robust method also considers adjacent regions Figure 5: Histogram of Hist( V i,x, i, j) Figure 6: Histogram of Hist( V i,z, i, j)

Links between mixed attributes (2) Approximate the sequence by normal distribution  Assume MSet(V i,x, i, j) satisfies normal distribution  The mean and variance of MSet(V i,x, i, j) can describe MSet(V i,x, i, j) approximately  Compute the similarity γ j (V i,x,V i,z ) by the means and variances of MSet(V i,x, i, j) and MSet(V i,z, i, j)

Weighted clustering Clustering not meaningful in high dimensional spaces because of the irrelevant attributes Clusters may form in subspaces  Projected clustering algorithms find the subspaces of the clusters during cluster formation It is possible that different attributes have different relevances to different clusters  Weighted clustering algorithms determine the weights of attributes in the clusters Users can interpret the weights in a meaningful way

Example on weighted clustering Assume there are 3 attributes X,Y and Z. Figure 7: Projection on X-Y plane (adapted from [3]) Figure 8: Projection on X-Z plane (adapted from [3]) Projected clusteringWeighted clustering Cluster 1{X,Z}w x =0.45, w y =0.10,w z =0.45 Cluster 2{X,Y}w x =0.45, w y =0.45,w z =0.10

PROCLUS A projected clustering algorithm Medoid-based and efficient Some disadvantages  Clusters with less than (|D|/k)*minDev are bad  Quality of clusters depends on the medoids Example: number of clusters (k) = 2 Unlucky  2 medoids drawn from the same cluster that cluster will be split into two small clusters Points in the other cluster become misses or outliers Figure 9: An example of clusters

Definition of weights Assume there are k clusters and m attributes. The weights of the clusters must satisfy: ∀ (i ∈ [1,k], j ∈ [1,m]), w i,j is real number s.t. w i,j ∈ [0,1] ∀ (i ∈ [1,k]), ∑ j ∈ [1,m] w i,j = 1 Note: i and j are integers

Weighted measures Weighted distance dist p (x,y)= ∑ i ∈ [1,m] w p,i * dist p,i (x.A i,y.A i ) Weighted similarity sim p (x,y)= ∑ i ∈ [1,m] w p,i * sim p,i (x.A i,y.A i ) For the weights to be meaningful, it is required that dist p,i and sim p,i return real values from [0,1] A simple categorical distance measure: dist p,i (x.A i,y.A i ) = 0 if x.A i =y.A i = 1 otherwise A more complex sim p,i will be introduced later

Adapted algorithms from PROCLUS Adapt it to a weighted clustering algorithm  Change the FindDimension procedure (for finding relevant attributes) to the FindWeight procedure (for computing the weights of the attributes)  Replace the distance functions 3 methods for computing the weights

1 st method w p,i = Var ({ | {t : t ∊ C p ⋀ t.A i = V i,j } | : V i,j ∊ D i }) and then normalize w p,i Attribute having high variance of counts among attribute values  high relevance to the cluster Variance of attribute values of an attribute is used as the weight of that attribute in that cluster ABC a:4d:3f:1 b:1e:2g:1 h:1 i:1 j:1 Var4.50.50 W0.90.10 ABC adf adg adh aei bej Count of an attribute value in a cluster

2 nd method w p,i = |{t : t ∊ C p ⋀ t.A i =Medoid(C p ).A i }| and then normalize w p,i In a cluster, the number of records in the cluster with the same attribute values as the medoid in for each attribute is computed. Attributes with high counts  higher weights ABC 431 w0.50.3750.125 ABC adf adg adh aei bej medoid

3 rd method The original similarity γ j (V i,x,V i,z ) is too strong because only strongly-connected attribute values are considered We change the definition from γ j (V i,x,V i,z ) =|{ u ∊ D j :σ* D (V i,x, u)>0 ⋀ σ* D (V i,z, u) >0}| to γ p j (V i,x,V i,z ) = ∑ u ∊ D j σ C p (V i,x, u) * σ C p (V i,z, u) γ p (V i,x,V i,z ) = ∑ j ∈ [1,m],i ≠ j γ p j (V i,x,V i,z )

3 rd method (cont’) A similarity matrix per each attribute per cluster (total k*m) SimMax p,i =max ({γ p (V i,x,V i,z ): i ∈ [1,m] ⋀ V i,x,V i,z ∈ D i }) sim p,i (V i,x,V i,x ) = 1 sim p,i (V i,x,V i,z ) = γ p (V i,x,V i,z ) / SimMax p,i w p,i = SimMax p,i and then normalize w p,i simfgsy f10.010.680.97 g0.011 s0.680.0111 y0.970.0111 Maximum entry of similarity matrix [p,i]

Extension Pruning of insignificant weights  Although the weights of irrelevant attributes are low, they can also affect the distance measure if there are too many of them  Let α> 1, a weight is said to be insignificant if it is lower than 1/(α * m). (m: # of dimensions)  Insignificant weights are set to zero and all the weights are normalized again.

Future work Redefine the “badness” of the medoids  Medoids are bad if the clusters are less than a predefined size. Real dataset may have clusters of various sizes. Detect whether different medoids are chosen from the same cluster Use other distance or similarity measure Study mixed-attribute clustering Adapt other algorithms to weighted clustering algorithm

Conclusion Mixed-attribute clustering can exploit information from both types of attributes Weighted clustering can reduce the effect of noise on the clusters The weights are meaningful to end-users Adapt other algorithms to weighted clustering algorithms

References  Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345–366, 2000.  Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS – clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73–83, 1999.  Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. Fast algorithms for projected clustering. pages 61–72, 1999.

Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003.

Similar presentations

Presentation on theme: "Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003.

Similar presentations

Presentation on theme: "Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003."— Presentation transcript:

Similar presentations

About project

Feedback