K-means and Hierarchical Clustering

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Albert Gatt Corpora and Statistical Methods Lecture 13.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
4. Ad-hoc I: Hierarchical clustering
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Data Mining and Text Mining. The Standard Data Mining process.
Hierarchical Clustering & Topic Models
Unsupervised Learning
Clustering Anna Reithmeir Data Mining Proseminar 2017
Unsupervised Learning: Clustering
Sampath Jayarathna Cal Poly Pomona
Hierarchical Clustering
Unsupervised Learning: Clustering
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
Machine Learning Clustering: K-means Supervised Learning
Hierarchical Clustering
Constrained Clustering -Semi Supervised Clustering-
Machine Learning Lecture 9: Clustering
Hierarchical Clustering
Clustering.
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Hierarchical and Ensemble Clustering
CSE572, CBS598: Data Mining by H. Liu
Revision (Part II) Ke Chen
Information Organization: Clustering
KAIST CS LAB Oh Jong-Hoon
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Revision (Part II) Ke Chen
Hierarchical and Ensemble Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Junheng, Shengming, Yunsheng 11/09/2018
SEEM4630 Tutorial 3 – Clustering.
CSE572: Data Mining by H. Liu
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

K-means and Hierarchical Clustering Team C USMAN CHEEMA | MUHAMMAD ABDULLAH | YIN HELIN

Table of Contents Supervised and Unsupervised Learning Clustering K-means Clustering Hierarchical Clustering Comparison of K-means and Hierarchical Clustering

Learning Supervised learning is the machine learning task of learning a function from labeled training data Unsupervised learning is the machine learning task of inferring a function to describe hidden structure from unlabeled data

Clustering Data Clustering is an unsupervised learning problem Given: N unlabeled examples {x1, . . . , xN }; the number of partitions K Goal: Group the examples into K partitions The only information clustering uses is the similarity between examples Clustering groups examples based of their mutual similarities A good clustering is one that achieves: High within-cluster similarity Low inter-cluster similarity

Clustering - Similarity between Data Points Choice of the similarity measure is very important for clustering Similarity is inversely related to distance Different ways exist to measure distances. Some examples: Euclidean distance: Manhattan distance: Kernelized (non-linear) distance: For the left figure above, Euclidean distance may be reasonable For the right figure above, kernelized distance seems more reasonable

Clustering - Examples Clustering images based on their perceptual similarities Image segmentation (clustering pixels) Clustering web pages based on their content Clustering web-search results Clustering people in social networks based on user properties/preferences Many more..

Types of Clustering Flat or Partitional clustering Partitions are independent of each other K-means Gaussian mixture models etc. Hierarchical clustering Partitions can be visualized using a tree structure (a dendrogram) Possible to view partitions at different levels of granularities (i.e., can refine/coarsen clusters) using different K Agglomerative clustering Divisive clustering In laters slides we will discuss K-means clustering along with

K-means Clustering K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data(i.e., data without defined categories or groups) The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K(user defined) The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided Data points are clustered based on feature similarity(distance)

K-means Clustering The k-means clustering algorithm inputs are the number of cluster k and data set The data set is a collection of features for each data point The algorithm starts with initial estimates for the k centroids, which can be either be randomly or selected from the data set The process of k-means clustering Set the number of cluster k(e.g. k = 2) The algorithm then iterates between following two steps until no more change in centroid point Data assignment step Centroid update step

K-means Clustering – Data assignment step  

K-means Clustering – Data assignment step (Iteration 1) First, randomly set a point as centroid point For example, k = 2 Y X Centroid in cluster 1 (2,1) Centroid in cluster 2(3,3)

K-means Clustering – Data assignment step Calculate the distance between the centroid and each point Data X Y Distance between centroid(2,1) and point in cluster 1 Compare Distance between centroid(3,3) and point in cluster 2 Cluster 1 < 2 3 4 5 > 6 7 8 9 10

K-means Clustering – Data assignment step(Iteration 1) First, randomly set a point as centroid point For example, k = 2 Cluster2 Y Cluster1 X Centroid in cluster 2(3,3) Centroid in cluster 1 (2,1)

K-means Clustering – Centroid update step In this step, the centroids are recomputed by taking the mean of all data points assigned to that centroid’s cluster Data X Y Cluster 1 2 3 4 5 6 7 8 9 10     Cluster New Centroid Data Index 1 (1.5, 1.5) 1,2,3,4 2 (5.1, 5.6) 5,6,7,8,9,10

K-means Clustering – Data assignment step(Iteration 2) Now, repeat calculating distance between new centroid and each point Cluster2 Changed Y Cluster1 New Centroid in cluster 1(1.5,1.5) X New Centroid in cluster 2(5.1,5.6)

K-means Clustering – Data assignment step Calculate the distance between the centroid and each point Data X Y Distance between centroid(1.5,1.5) and point in cluster 1 Compare Distance between centroid(5.1,5.6) and point in cluster 2 Cluster 1 < 2 3 4 5 6 > 7 8 9 10 Changed

K-means Clustering – Centroid update step In this step, the centroids are recomputed by taking the mean of all data points assigned to that centroid’s cluster Data X Y Cluster 1 2 3 4 5 6 7 8 9 10     Cluster New Centroid Data Index 1 (1.8, 1.6) 1,2,3,4,5 2 (5.6, 6.2) 6,7,8,9,10

K-means Clustering – Data assignment step(Iteration 3) Now, repeat calculating distance between new centroid and each point Cluster2 Y Cluster1 New Centroid in cluster 1 (1.8,1.6) X New Centroid in cluster 2 (5.6,6.2)

K-means Clustering Example – The animation of K-means Clustering

K-means Clustering Repeat data assignment step and centroid update step till no more changes of centroid point Advantages of K-means Clustering The algorithm is simple to use The algorithm can guarantee the result Disadvantages of K-means Clustering Need to specify k, the number of clusters Often terminates at a local optimum Not suitable to discover clusters with non-convex shapes Not suitable to discover clusters with different densities

K-means Clustering - Failures Non-convex/non-round-shaped clusters: Standard K-means fails! Clusters with different densities

Dissimilarity between Clusters We know how to compute the dissimilarity between two examples How to compute the dissimilarity between two clusters R and S? Min-link or single-link: results in chaining (clusters can get very large) Max-link or complete-link: results in small, round shaped clusters Average-link: compromise between single and complete linkage

Hierarchical Clustering Clustering is performed based upon Dissimilarities between Clusters Produces a sequence or tree of clusterings Does not need the number of clusters as input Partitions can be visualized using a tree structure (a dendrogram) Possible to view partitions at different levels of granularities using different K Applications: Social Sciences Biological Taxonomy Medicine, Archaeology Computer Science Engineering etc. Types of Hierarchical Clustering: Agglomerative (bottom-up) Clustering Divisive (top-down) Clustering Agglomerative is more popular and simpler than divisive (but less accurate)

Hierarchical Clustering - Agglomerative (bottom-up) Clustering Say “Every point is its own cluster”

Hierarchical Clustering - Agglomerative (bottom-up) Clustering Find “most similar” pair of clusters by • Minimum distance between points in clusters (in which case we’re simply doing Euclidean Minimum Spanning Trees) • Maximum distance between points in clusters • Average distance between points in clusters

Hierarchical Clustering - Agglomerative (bottom-up) Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster

Hierarchical Clustering - Agglomerative (bottom-up) Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat

Hierarchical Clustering - Agglomerative (bottom-up) Clustering Say “Every point is its own cluster” Find “most similar” pair of clusters Merge it into a parent cluster Repeat until you’ve merged the whole dataset into one cluster

Hierarchical Clustering - Agglomerative (bottom-up) Clustering Divisive Clustering is a top-down method of Hierarchical clustering Start with all examples in the same cluster At each time-step, remove the “outsiders” from the least cohesive cluster Stop when each example is in its own singleton cluster, else go to 2

K-means clustering vs Hierarchical Clustering K-means clustering produces a single partitioning Hierarchical Clustering can give different partitionings depending on the level-of-resolution we are looking at K-means clustering needs the number of clusters to be specified Hierarchical clustering doesn’t need the number of clusters to be specified K-means clustering is usually more efficient run-time wise Hierarchical clustering can be slow (has to make several merge/split decisions) No clear consensus on which of the two produces better clustering

Thank you