Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
DBSCAN – Density-Based Spatial Clustering of Applications with Noise M.Ester, H.P.Kriegel, J.Sander and Xu. A density-based algorithm for discovering clusters.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
unsupervised learning - clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like diagram that.
Clustering II.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
DATA MINING LECTURE 8 Clustering The k-means algorithm
Clustering Basic Concepts and Algorithms 2
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining: Basic Cluster Analysis
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Hierarchical Clustering: Time and Space requirements
Clustering CSC 600: Data Mining Class 21.
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
What Is the Problem of the K-Means Method?
CSE 5243 Intro. to Data Mining
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
CSE 5243 Intro. to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Techniques: Basic
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Analysis.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

Clustering (2)

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like diagram that records the sequences of merges or splits

Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters –‘cut’ the dendogram at the proper level They may correspond to meaningful taxonomies –Example in biological sciences e.g., animal kingdom, phylogeny reconstruction, …

Hierarchical Clustering Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster left. Algorithm Let each data point be a cluster Compute the proximity matrix Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters.

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix

Intermediate Situation After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix

After Merging The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Distance? MIN MAX Group Average Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix MIN MAX Group Average

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix MIN MAX Group Average

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix MIN MAX Group Average

Cluster Similarity: MIN Similarity of two clusters is based on the two most similar (closest) points in the different clusters –Determined by one pair of points

Hierarchical Clustering: MIN Nested ClustersDendrogram

Strength of MIN Original Points Two Clusters Can handle non-globular shapes

Limitations of MIN Sensitive to noise and outliers Original Points Four clusters Three clusters: The yellow points got wrongly merged with the red ones, as opposed to the green one.

Cluster Similarity: MAX Similarity of two clusters is based on the two least similar (most distant) points in the different clusters –Determined by all pairs of points in the two clusters

Hierarchical Clustering: MAX Nested ClustersDendrogram

Strengths of MAX Less susceptible respect to noise and outliers Original Points Four clusters Three clusters: The yellow points get now merged with the green one.

Limitations of MAX Original Points Two Clusters Tends to break large clusters

Cluster Similarity: Group Average Proximity of two clusters is the average of pairwise proximity between points in the two clusters.

Hierarchical Clustering: Group Average Nested ClustersDendrogram

Hierarchical Clustering: Time and Space O(N 2 ) space since it uses the proximity matrix. –N is the number of points. O(N 3 ) time in many cases –There are N steps and at each step the size, N 2, proximity matrix must be updated and searched –Complexity can be reduced to O(N 2 log(N) ) time for some approaches

Hierarchical Clustering Example

From “Indo-European languages tree by Levenshtein distance” by M. Serva1 and F. Petroni

DBSCAN DBSCAN is a density-based algorithm. Locates regions of high density that are separated from one another by regions of low density. Density = number of points within a specified radius (Eps) A point is a core point if it has more than a specified number of points (MinPts) within Eps –These are points that are at the interior of a cluster A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is neither a core point nor a border point.

DBSCAN: Core, Border, and Noise Points

DBSCAN Algorithm Any two core points that are close enough---within a distance Eps of one another---are put in the same cluster. Likewise, any border point that is close enough to a core point is put in the same cluster as the core point. Ties may need to be resolved if a border point is close to core points from different clusters. Noise points are discarded.

DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

When DBSCAN Works Well Original Points Clusters Resistant to Noise Can handle clusters of different shapes and sizes

When DBSCAN Does NOT Work Well Why DBSCAN doesn’t work well here?

DBSCAN: Determining EPS and MinPts Look at the behavior of the distance from a point to its k-th nearest neighbor, called the k­dist. For points that belong to some cluster, the value of k­dist will be small [if k is not larger than the cluster size]. However, for points that are not in a cluster, such as noise points, the k­dist will be relatively large. So, if we compute the k­dist for all the data points for some k, sort them in increasing order, and then plot the sorted values, we expect to see a sharp change at the value of k­dist that corresponds to a suitable value of Eps. If we select this distance as the Eps parameter and take the value of k as the MinPts parameter, then points for which k­dist is less than Eps will be labeled as core points, while other points will be labeled as noise or border points.

DBSCAN: Determining EPS and MinPts Eps determined in this way depends on k, but does not change dramatically as k changes. If k is too small ? then even a small number of closely spaced points that are noise or outliers will be incorrectly labeled as clusters. If k is too large ? then small clusters (of size less than k) are likely to be labeled as noise. Original DBSCAN used k = 4, which appears to be a reasonable value for most data sets.