CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)

Slides:



Advertisements
Similar presentations
Hierarchical Clustering
Advertisements

Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
unsupervised learning - clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Validation.
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
What is Cluster Analysis?
DATA MINING LECTURE 8 Clustering The k-means algorithm
Clustering Basic Concepts and Algorithms 2
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Lecture 20: Cluster Validation
Data Mining Cluster Analysis: Basic Concepts and Algorithms Adapted from Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar.
Critical Issues with Respect to Clustering Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar 10/30/2007.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Minqi Zhou Minqi Zhou Introduction.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar.
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Cluster Analysis: Basic Concepts and Algorithms. What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Basic Cluster Analysis
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
What Is the Problem of the K-Means Method?
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Techniques: Basic
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Critical Issues with Respect to Clustering
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Analysis.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

CSE4334/5334 Data Mining Clustering

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

Applications of Cluster Analysis Clustering precipitation in Australia

What is not Cluster Analysis?

Notion of a Cluster can be Ambiguous How many clusters? Four ClustersTwo Clusters Six Clusters

Types of Clusterings

Partitional Clustering Original Points A Partitional Clustering

Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical ClusteringNon-traditional Dendrogram Traditional Dendrogram

Other Distinctions Between Sets of Clusters

Types of Clusters Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function

Types of Clusters: Well-Separated 3 well-separated clusters

Types of Clusters: Center-Based 4 center-based clusters

Types of Clusters: Contiguity-Based 8 contiguous clusters

Types of Clusters: Density-Based 6 density-based clusters

Types of Clusters: Conceptual Clusters 2 Overlapping Circles

Types of Clusters: Objective Function

Types of Clusters: Objective Function …

Characteristics of the Input Data Are Important

Clustering Algorithms K-means and its variants Hierarchical clustering

K-means Clustering o Partitional clustering approach o Each cluster is associated with a centroid (center point) o Each point is assigned to the cluster with the closest centroid o Number of clusters, K, must be specified o The basic algorithm is very simple

K-means Clustering – Details

Two different K-means Clusterings Sub-optimal ClusteringOptimal Clustering Original Points

Importance of Choosing Initial Centroids

Evaluating K-means Clusters

Importance of Choosing Initial Centroids …

Problems with Selecting Initial Points

10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

Solutions to Initial Centroids Problem

Handling Empty Clusters

Updating Centers Incrementally

Pre-processing and Post-processing

Bisecting K-means

Bisecting K-means Example

Limitations of K-means

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Overcoming K-means Limitations Original PointsK-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together.

Overcoming K-means Limitations Original PointsK-means Clusters

Overcoming K-means Limitations Original PointsK-means Clusters

Hierarchical Clustering

Strengths of Hierarchical Clustering

Hierarchical Clustering

Agglomerative Clustering Algorithm

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix

Intermediate Situation After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

After Merging The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error 

Cluster Similarity: MIN or Single Link 12345

Hierarchical Clustering: MIN Nested ClustersDendrogram

Strength of MIN Original Points Two Clusters Can handle non-elliptical shapes

Limitations of MIN Original Points Two Clusters Sensitive to noise and outliers

Cluster Similarity: MAX or Complete Linkage 12345

Hierarchical Clustering: MAX Nested ClustersDendrogram

Strength of MAX Original Points Two Clusters Less susceptible to noise and outliers

Limitations of MAX Original Points Two Clusters Tends to break large clusters Biased towards globular clusters

Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters. Need to use average connectivity for scalability since total proximity favors large clusters 12345

Hierarchical Clustering: Group Average Nested ClustersDendrogram

Hierarchical Clustering: Group Average

Cluster Similarity: Ward’s Method

Hierarchical Clustering: Comparison Group Average Ward’s Method MINMAX

Hierarchical Clustering: Time and Space requirements

Hierarchical Clustering: Problems and Limitations

MST: Divisive Hierarchical Clustering

Use MST for constructing hierarchy of clusters

Cluster Validity

Clusters found in Random Data Random Points K-means DBSCAN Complete Link

1.Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2.Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3.Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4.Comparing the results of two different sets of cluster analyses to determine which is better. 5.Determining the ‘correct’ number of clusters. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters. Different Aspects of Cluster Validation

Measures of Cluster Validity

Measuring Cluster Validity Via Correlation

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = Corr =

Order the similarity matrix with respect to cluster labels and inspect visually. Using Similarity Matrix for Cluster Validation

Clusters in random data are not so crisp DBSCAN

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp K-means

Using Similarity Matrix for Cluster Validation Clusters in random data are not so crisp Complete Link

Using Similarity Matrix for Cluster Validation DBSCAN

Internal Measures: SSE

SSE curve for a more complicated data set SSE of clusters found using K-means

Framework for Cluster Validity

Statistical Framework for SSE

Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Statistical Framework for Correlation Corr = Corr =

Internal Measures: Cohesion and Separation

12345  m1m1 m2m2 m K=2 clusters: K=1 cluster:

Internal Measures: Cohesion and Separation cohesionseparation

Internal Measures: Silhouette Coefficient

External Measures of Cluster Validity: Entropy and Purity

“The validation of clustering structures is the most difficult and frustrating part of cluster analysis. Without a strong effort in this direction, cluster analysis will remain a black art accessible only to those true believers who have experience and great courage.” Algorithms for Clustering Data, Jain and Dubes Final Comment on Cluster Validity