© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
unsupervised learning - clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like diagram that.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
DATA MINING LECTURE 8 Clustering The k-means algorithm
ZHANGXI LIN TEXAS TECH UNIVERSITY Lecture Notes 10 CRM Segmentation - Introduction.
Clustering Basic Concepts and Algorithms 2
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Adapted from Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar 10/30/2007.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Minqi Zhou Minqi Zhou Introduction.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Jianping Fan Department of Computer Science UNC-Charlotte Density-Based Data Clustering Algorithms: K-Means & Others.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar.
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining Cluster Analysis: Basic Concepts and Algorithms.
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining: Basic Cluster Analysis
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Hierarchical Clustering: Time and Space requirements
Clustering CSC 600: Data Mining Class 21.
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
What Is the Problem of the K-Means Method?
CSE 5243 Intro. to Data Mining
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
CSE 5243 Intro. to Data Mining
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 2 Hierarchical Clustering l Two main types of hierarchical clustering algorithms. –Agglomerative  Start with the points as individual clusters  Merge clusters until only one is left –Divisive  Start with all the points as one cluster  Split clusters until only only singleton clusters –Agglomerative is more popular l Traditional hierarchical algorithms use a similarity or distance matrix. –Merge or split one cluster at a time

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 3 Hierarchical Clustering l Produces a set of nested clusters organized as a hierarchical tree. l Can be visualized as a dendrogram –Tree like diagram –Records the sequences of merges or splits l Can ‘cut’ the dendrogram to get a partitional clustering

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 4 Basic Agglomerative Clustering Algorithm l Algorithm is straightforward 1.Compute the proximity matrix, if necessary 2.Let each data point by a cluster 3.Repeat 4.Merge the two closest clusters 5.Update the proximity matrix 6.Until only a single cluster remains l Key operation is the computation of the proximity of two clusters. l Different approaches to defining the distance between clusters distinguishes the different algorithms.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 5 l For agglomerative hierarchical clustering we start with clusters of individual points and a proximity matrix. Agglomerative Hierarchical Clustering: Starting Situation p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 6 C2C1 l After some merging steps, we have some clusters. Agglomerative Hierarchical Clustering: Intermediate Situation C1 C3 C5 C4 C2 C3C4C5 C1 C4 C2 C5 C3 Proximity Matrix

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 7 C2C1 l We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. Agglomerative Hierarchical Clustering: Intermediate Situation C1 C3 C5 C4 C2 C3C4C5 C1 C4 C2 C5 C3 Proximity Matrix

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 8 l The question is “How do we update the proximity matrix?” Agglomerative Hierarchical Clustering: After Merging C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 9 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error Proximity Matrix

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 10 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 11 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 12 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 13 l Similarity of two clusters is based on the two closest points in the different clusters. –Determined by one pair of points, i.e., by one link in the proximity graph. l Can handle non-elliptical shapes. l Sensitive to noise and outliers. Cluster Similarity: MIN or Single Link

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 14 Hierarchical Clustering: MIN Nested ClustersDendrogram

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 15 Strength of MIN Original PointsTwo Clusters

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 16 Limitations of MIN Original PointsTwo Clusters

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 17 l Similarity of two clusters is based on the two most distant points in the different clusters. –Determined by all pairs of points in the two clusters. l Tends to break large clusters. l Less susceptible to noise and outliers. l Biased towards globular clusters. Cluster Similarity: MAX or Complete Linkage

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 18 Hierarchical Clustering: MAX Nested ClustersDendrogram

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 19 Strength of MAX Original PointsTwo Clusters

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 20 Limitations of MAX Original PointsTwo Clusters

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 21 l Distance of two clusters is the average of pairwise distance between points in the two clusters. l Compromise between Single and Complete Link. l Need to use average connectivity for scalability since total connectivity favors large clusters. l Less susceptible to noise and outliers. l Biased towards globular clusters. Cluster Similarity: Group Average

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 22 Hierarchical Clustering: Group Average Nested ClustersDendrogram

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 23 l Similarity of two clusters is based on the increase in squared error when two clusters are merged. –Similar to group average if distance between points is distance squared. l Less susceptible to noise and outliers. l Biased towards globular clusters. l Hierarchical analogue of K-means –But Ward’s method does not correspond to a local minimum –Can be used to initialize K-means Cluster Similarity: Ward’s Method

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 24 Hierarchical Clustering: Group Average Nested ClustersDendrogram

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 25 Hierarchical Clustering: Comparison Group Average Ward’s Method MINMAX

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 26 l O(N 2 ) space since it uses the proximity matrix. –N is the number of points. l O(N 3 ) time in many cases. –There are N steps and at each step the size, N 2, proximity matrix must be updated and searched. –By being careful, the complexity can be reduced to O(N 2 log(N) ) time for some approaches. Hierarchical Clustering: Time and Space requirements

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 27 l Once a decision is made to combine two clusters, it cannot be undone. l No objective function is directly minimized. l Different schemes have problems with one or more of the following: –Sensitivity to noise and outliers. –Difficulty handling different sized clusters and convex shapes. –Breaking large clusters. Hierarchical Clustering: Problems and Limitations

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 28 DBSCAN l DBSCAN is a density based algorithm. –Density = number of points within a specified radius (Eps) –A point is a core point if it has more than a specified number of points (MinPts) within Eps  These are points that are at the interior of a cluster. –A border point has fewer than MinPts within Eps, but is in the neighborhoold of a core point –A noise point is any point that is not a core point or a border point.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 29 DBSCAN: Core, Border, and Noise Points

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 30 When DBSCAN Works Well Original Points Clusters

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 31 DBSCAN: Core, Border and Noise Points Original Points Point types: core, border and noise Eps = 10, MinPts = 4

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 32 DBSCAN: Determining EPS and MinPts l Idea is that for points in a cluster, there k th nearest neighbors are at roughly the same distance. l Noise points have the k th nearest neighbor at at farther distance. l So, plot sorted distance of every point to its k th nearest neighbor. (k=4 used for 2D points.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 33 When DBSCAN Does NOT Work Well Original Points (MinPts=4, Eps=9.75). (MinPts=4, Eps=9.92)