Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003.

Slides:

Advertisements

Similar presentations

Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Hierarchical Clustering, DBSCAN The EM Algorithm

Clustering Basic Concepts and Algorithms

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

CS4432: Database Systems II

Clustering Categorical Data The Case of Quran Verses

Fast Algorithms For Hierarchical Range Histogram Constructions

Data Mining Cluster Analysis: Advanced Concepts and Algorithms

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

Image Segmentation Image segmentation (segmentace obrazu) –division or separation of the image into segments (connected regions) of similar properties.

Data Mining Techniques: Clustering

Cluster Analysis.

Mutual Information Mathematical Biology Seminar

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

1 DECK: Detecting Events from Web Click-through Data Ling Chen, Yiqun Hu, Wolfgang Nejdl Presented by Sebastian Föllmer.

© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Ensemble Learning: An Introduction

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Cluster Analysis (1).

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

What is Cluster Analysis?

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling

Computer Vision James Hays, Brown

CURE: Clustering Using REpresentatives algorithm Student: Uglješa Milić University of Belgrade School of Electrical Engineering.

Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.

CSE 185 Introduction to Computer Vision Pattern Recognition 2.

Clustering Spatial Data Using Random Walks Author : David Harel Yehuda Koren Graduate : Chien-Ming Hsiao.

CACTUS – Clustering Categorical Data Using Summaries By Venkatesh Ganti, Johannes Gehrke and Raghu Ramakrishnan RongEn Li School of Informatics, Edinburgh.

Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.

DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)

Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

CS654: Digital Image Analysis

The Curse of Dimensionality Richard Jang Oct. 29, 2003.

1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.

ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

381 Self Organization Map Learning without Examples.

Presented by Ho Wai Shing

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Signal & Weight Vector Spaces

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.

CURE: An Efficient Clustering Algorithm for Large Databases Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Presentation by: Vuk Malbasa For CIS664.

Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.

Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.

JMP Discovery Summit 2016 Janet Alvarado

Machine Learning for the Quantified Self

Semi-Supervised Clustering

CACTUS-Clustering Categorical Data Using Summaries

Clustering Categorical Data Using Summaries

CSE572, CBS598: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

What Is Good Clustering?

Clustering Wei Wang.

Text Categorization Berlin Chen 2003 Reference:

Group 9 – Data Mining: Data

CSE572: Data Mining by H. Liu

Presentation transcript:

Mixed-Attribute Clustering and Weighted Clustering Presented by: Yiu Man Lung 24 January, 2003

Outline Mixed-Attribute Clustering  ROCK, CACTUS  Links between mixed attributes Weighted Clustering  PROCLUS  Weights, distance and similarity measures  Methods of computing the weights Conclusion

Mixed-Attribute Clustering Most real datasets have mixed attributes  numeric (continuous)  total ordering  categorical (discrete)  no total ordering Few clustering algorithms for mixed attributes Combined information from mixed attributes may be useful for clustering Use the context to compute the distance / similarity measure instead of using a fixed measure Apply the concept of links (in ROCK) and concept of strongly-connected attribute values (in CACTUS)

ROCK Hierarchical, agglomerative clustering algorithm We only focus on the concept of links Given records T i, T j  Their similarity is sim(T i,T j )  They are neighbors if sim(T i,T j ) ≥θ  link(T i,T j ) is the number of their common neighbors Relationship between links and clusters  High intra-cluster similarity within clusters  Many links within clusters  High inter-cluster dissimilarity among clusters  Few cross links among clusters

Example of links Jaccard coefficient: |T i ⋂ T j | / |T i ⋃ T j | sim({1,2,3},{1,2,7}) = 0.5 (different cluster) sim({1,2,3},{3,4,5}) = 0.2 (same cluster) For links, let θ= 0.5 link({1,2,3},{1,2,7}) = 3 link({1,2,3},{3,4,5}) = 4 Figure 1: Basket data example (adapted from [1])

CACTUS Clusters are generated and validated in later phases We only focus on the summarization (first) phase  inter-attribute summary  IJ (links among attribute values)  intra-attribute summary  II (similarity of attribute values) Some notations Dataset: D Tuple: t Categorical attributes: A 1, A 2, …, A n Domains: D 1, D 2, …, D n Values in domains : V 1,*, V 2,*, …, V n,*

Support and similarity Let i ≠ j, support σ D (V i,x,V j,y ) is defined as σ D (V i,x,V j,y ) = |{t ∊ D : t.A i = V i,x ⋀ t.A j =V j,y }| Let α>1, V i,x and V j,y are strongly connected if σ D (V i,x,V j,y ) > α*|D| / (|D i |*|D j |) σ* D (V i,x,V j,y ) = σ D (V i,x,V j,y ) if they are strongly connected = 0 otherwise The similarity γ j (V i,x,V i,z ) with respect to D j (i≠j) is defined as γ j (V i,x,V i,z ) =|{ u ∊ D j :σ* D (V i,x, u)>0 ⋀ σ* D (V i,z, u) >0}| expected support under attribute independence assumption

Example of similarities Figure 2: Inter-attribute summary  IJ (adapted from [2]) Figure 3: Intra-attribute summary  II (adapted from [2]) number of common neighbors in another attribute links among attribute values of different attributes

Summary on two previous concepts Links and strongly-connected attribute values are for categorical data The former is for tuples and the latter is for attribute values The latter can be viewed as “links” between attribute values Need to extend them to mixed attributes

Links between mixed attributes A i is categorical and A j is numeric  How to compute the similarity γ j (V i,x,V i,z ) with respect to D j (i≠j) ? MultiSet: set with multiplicity {3,7} and {7,3} are equivalent {3,7} and {3,3,7} are different The set of values of A j with A i as V i,x is defined by MSet(V i,x, i, j) = { t.A j : t ∊ D and t.A i = V i,x } D={(a,5),(a,6),(b,7),(b,8)} MSet(a,1,2)={5,6}, MSet(b,1,2)={7,8} The similarity γ j (V i,x,V i,z ) can be computed using MSet(V i,x, i, j) and MSet(V i,z, i, j) Figure 4: Inter-attribute summary of mixed attributes

Links between mixed attributes (1) Histogram  Represent MSet(V i,x, i, j) by a histogram Hist(V i,x, i, j)  Compute the similarity γ j (V i,x,V i,z ) by sim(Hist(V i,x, i, j),Hist(V i,z, i, j))  Histogram Intersection Sim = = 24 (needs to be normalized)  A more robust method also considers adjacent regions Figure 5: Histogram of Hist( V i,x, i, j) Figure 6: Histogram of Hist( V i,z, i, j)

Links between mixed attributes (2) Approximate the sequence by normal distribution  Assume MSet(V i,x, i, j) satisfies normal distribution  The mean and variance of MSet(V i,x, i, j) can describe MSet(V i,x, i, j) approximately  Compute the similarity γ j (V i,x,V i,z ) by the means and variances of MSet(V i,x, i, j) and MSet(V i,z, i, j)

Weighted clustering Clustering not meaningful in high dimensional spaces because of the irrelevant attributes Clusters may form in subspaces  Projected clustering algorithms find the subspaces of the clusters during cluster formation It is possible that different attributes have different relevances to different clusters  Weighted clustering algorithms determine the weights of attributes in the clusters Users can interpret the weights in a meaningful way

Example on weighted clustering Assume there are 3 attributes X,Y and Z. Figure 7: Projection on X-Y plane (adapted from [3]) Figure 8: Projection on X-Z plane (adapted from [3]) Projected clusteringWeighted clustering Cluster 1{X,Z}w x =0.45, w y =0.10,w z =0.45 Cluster 2{X,Y}w x =0.45, w y =0.45,w z =0.10

PROCLUS A projected clustering algorithm Medoid-based and efficient Some disadvantages  Clusters with less than (|D|/k)*minDev are bad  Quality of clusters depends on the medoids Example: number of clusters (k) = 2 Unlucky  2 medoids drawn from the same cluster that cluster will be split into two small clusters Points in the other cluster become misses or outliers Figure 9: An example of clusters

Definition of weights Assume there are k clusters and m attributes. The weights of the clusters must satisfy: ∀ (i ∈ [1,k], j ∈ [1,m]), w i,j is real number s.t. w i,j ∈ [0,1] ∀ (i ∈ [1,k]), ∑ j ∈ [1,m] w i,j = 1 Note: i and j are integers

Weighted measures Weighted distance dist p (x,y)= ∑ i ∈ [1,m] w p,i * dist p,i (x.A i,y.A i ) Weighted similarity sim p (x,y)= ∑ i ∈ [1,m] w p,i * sim p,i (x.A i,y.A i ) For the weights to be meaningful, it is required that dist p,i and sim p,i return real values from [0,1] A simple categorical distance measure: dist p,i (x.A i,y.A i ) = 0 if x.A i =y.A i = 1 otherwise A more complex sim p,i will be introduced later

Adapted algorithms from PROCLUS Adapt it to a weighted clustering algorithm  Change the FindDimension procedure (for finding relevant attributes) to the FindWeight procedure (for computing the weights of the attributes)  Replace the distance functions 3 methods for computing the weights

1 st method w p,i = Var ({ | {t : t ∊ C p ⋀ t.A i = V i,j } | : V i,j ∊ D i }) and then normalize w p,i Attribute having high variance of counts among attribute values  high relevance to the cluster Variance of attribute values of an attribute is used as the weight of that attribute in that cluster ABC a:4d:3f:1 b:1e:2g:1 h:1 i:1 j:1 Var W ABC adf adg adh aei bej Count of an attribute value in a cluster

2 nd method w p,i = |{t : t ∊ C p ⋀ t.A i =Medoid(C p ).A i }| and then normalize w p,i In a cluster, the number of records in the cluster with the same attribute values as the medoid in for each attribute is computed. Attributes with high counts  higher weights ABC 431 w ABC adf adg adh aei bej medoid

3 rd method The original similarity γ j (V i,x,V i,z ) is too strong because only strongly-connected attribute values are considered We change the definition from γ j (V i,x,V i,z ) =|{ u ∊ D j :σ* D (V i,x, u)>0 ⋀ σ* D (V i,z, u) >0}| to γ p j (V i,x,V i,z ) = ∑ u ∊ D j σ C p (V i,x, u) * σ C p (V i,z, u) γ p (V i,x,V i,z ) = ∑ j ∈ [1,m],i ≠ j γ p j (V i,x,V i,z )

3 rd method (cont’) A similarity matrix per each attribute per cluster (total k*m) SimMax p,i =max ({γ p (V i,x,V i,z ): i ∈ [1,m] ⋀ V i,x,V i,z ∈ D i }) sim p,i (V i,x,V i,x ) = 1 sim p,i (V i,x,V i,z ) = γ p (V i,x,V i,z ) / SimMax p,i w p,i = SimMax p,i and then normalize w p,i simfgsy f g0.011 s y Maximum entry of similarity matrix [p,i]

Extension Pruning of insignificant weights  Although the weights of irrelevant attributes are low, they can also affect the distance measure if there are too many of them  Let α> 1, a weight is said to be insignificant if it is lower than 1/(α * m). (m: # of dimensions)  Insignificant weights are set to zero and all the weights are normalized again.

Future work Redefine the “badness” of the medoids  Medoids are bad if the clusters are less than a predefined size. Real dataset may have clusters of various sizes. Detect whether different medoids are chosen from the same cluster Use other distance or similarity measure Study mixed-attribute clustering Adapt other algorithms to weighted clustering algorithm

Conclusion Mixed-attribute clustering can exploit information from both types of attributes Weighted clustering can reduce the effect of noise on the clusters The weights are meaningful to end-users Adapt other algorithms to weighted clustering algorithms

References  Sudipto Guha, Rajeev Rastogi, and Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25(5):345–366,  Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. CACTUS – clustering categorical data using summaries. In Knowledge Discovery and Data Mining, pages 73–83,  Charu C. Aggarwal, Joel L. Wolf, Philip S. Yu, Cecilia Procopiuc, and Jong Soo Park. Fast algorithms for projected clustering. pages 61–72, 1999.