Clustering AMCS/CS 340: Data Mining Xiangliang Zhang

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

K-Means Clustering Algorithm Mining Lab
Copyright Jiawei Han, modified by Charles Ling for CS411a
AMCS/CS229: Machine Learning
K-means Clustering Ke Chen.
Clustering.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
What is Cluster Analysis?
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
0 x x2 0 0 x1 0 0 x3 0 1 x7 7 2 x0 0 9 x0 0.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Basic Concepts and Algorithms
7x7=.
PARTITIONAL CLUSTERING
CS690L: Clustering References:
Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis (1).
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
DATA MINING LECTURE 8 Clustering The k-means algorithm
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Jeff Howbert Introduction to Machine Learning Winter Clustering Basic Concepts and Algorithms 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Clustering.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Clustering.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
ΠΑΝΕΠΙΣΤΗΜΙΟ ΙΩΑΝΝΙΝΩΝ ΑΝΟΙΚΤΑ ΑΚΑΔΗΜΑΪΚΑ ΜΑΘΗΜΑΤΑ Εξόρυξη Δεδομένων Ομαδοποίηση (clustering) Διδάσκων: Επίκ. Καθ. Παναγιώτης Τσαπάρας.
Data Mining: Basic Cluster Analysis
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 10 —
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Clustering Basic Concepts and Algorithms 1
Critical Issues with Respect to Clustering
CSE572, CBS598: Data Mining by H. Liu
Data Mining: Clustering
Clustering Wei Wang.
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
Presentation transcript:

Clustering AMCS/CS 340: Data Mining Xiangliang Zhang King Abdullah University of Science and Technology

Grouping fruits Grouping apple with apple, orange with orange and banana with banana 2

Give pictures to a computer Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 3

Change pictures to data Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 4

Change pictures to data x3 x1 x2 x4 x5 x6 x7 x8 x9 ……xn Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 5

Use clustering methods 1 2 3 ... … x1 x2 x3 x4 x5 x6 x7 x8 x9 ... … Output: clustering indicator Clustering Method Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 6

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Correct? 1 2 3 ... … x1 x2 x3 x4 x5 x6 x7 x8 x9 ... … Output: clustering indicator Clustering Method Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 7

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Cluster Analysis What is Cluster Analysis? Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Clustering High-Dimensional Data How to decide the number of clusters? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 8

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized Unsupervised learning: no predefined classes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 9

Applications of Cluster Analysis Understanding As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms E.g., group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Image segmentation/compression Preserve Privacy (e.g., in medical data) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 10

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Clustering A clustering is a set of clusters Notion of a Cluster can be Ambiguous A set of data points A clustering with Two Clusters A clustering with Six Clusters A clustering with Four Clusters Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 11

Quality: What is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the implementation of a method and the similarity measure used by this method The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 12

Quality: What is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the implementation of a method and the similarity measure used by this method The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns What kinds of method ? What kinds of similarity measure ? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 13

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Similarity measure The similarity measure depends on the characteristics of the input data Attribute type: binary, categorical, continuous Sparseness Dimensionality Type of proximity Center-based Density-based Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 14

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Data Structures Data matrix n instances, p attributes (features) Distance matrix (dissimilarity matrix) Minkowski distance: If q = 1, d is Manhattan distance If q = 2, d is Euclidean distance Cosine measure Correlation coefficient Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 15

Types of Clustering methods Partitioning Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Typical methods: k-means, k-medoids, CLARANS A Partitioning Clustering Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 16

Types of Clustering methods Hierarchical clustering A set of nested clusters organized as a hierarchical tree Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON Hierarchical Clustering Dendrogram Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 17

Types of Clustering methods Density-based Clustering Based on connectivity and density functions A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Typical methods: DBSACN, OPTICS, DenClue 6 density-based clusters Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 18

Types of Clustering methods Grid-based Clustering Based on a multiple-level granularity structure Typical methods: STING, WaveCluster, CLIQUE Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 19

Types of Clustering methods Model-based clustering: A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 20

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining Cluster Analysis What is Cluster Analysis? Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods Clustering High-Dimensional Data How to decide the number of clusters? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 21

Partitioning Algorithms: Basic Concept Partitioning clustering method: Construct a partition of a dataset D of n objects into a set of k clusters, s.t., min sum of squared distance Averaged center of the cluster NP-hard when k is a part of input (even for 2-dim)* Given a k, finding a partition of k clusters that optimizes SSD takes Heuristic methods: k-means (also called Lloyd’s method [Llo82]) C3 C2 C1 x μi a simple heuristic: k-means! A Partitioning Clustering k=3 * Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. (2009). "The Planar k-Means Problem is NP-Hard". Lecture Notes in Computer Science 5431: 274–285. # Inaba; Katoh; Imai (1994). Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering". Proceedings of 10th ACM Symposium on Computational Geometry. 22

Partitioning Algorithms: Basic Concept Partitioning clustering method: Construct a partition of a dataset D of n objects into a set of k clusters, s.t., min sum of squared distance Actual center of the cluster Global optimal: exhaustively enumerate all partitions Heuristic methods: k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87) A Partitioning Clustering k=3 C3 C2 C1 O μi Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 23

Partitioning Algorithms k-means Algorithm Issue of initial centroids , clustering evaluation Limitations of k-means k-medoids Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 24

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining k-means clustering Number of clusters, K, must be specified Each cluster is associated with an averaged point (centroid) Each point is assigned to the cluster with the closest centroid The basic algorithm is very simple Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 25

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining k-means clustering Example: 1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each objects to most similar center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the cluster means Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 26

K-means Clustering – Details Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * t * d ) n = number of points, K = number of clusters, t = number of iterations, d = number of attributes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 27

Partitioning Algorithms k-means Algorithm Issue of initial centroids , clustering evaluation Limitations of k-means k-medoids Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 28

Importance of Choosing Initial Centroid Clusters produced vary from one run to another. Run 2 Run 1 Original Points 29

Evaluating K-means Clustering Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest centroid (the error of representing each point by its nearest centroid) Given two clustering results, we can choose the one with smaller error SSE of optimal clustering result reduces when increasing K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K Note: C1 is better than C2 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 30

Importance of Choosing Initial Centroid Clusters produced vary from one run to another. Run 2 Run 1 Original Points SSE(Run1) < SSR(Run2) 31

Solutions to Initial Centroids Problem Multiple runs select the one with smallest SSE Sample and use hierarchical clustering to determine initial centroids Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 32

Comments on the K-Means Method Strength: Relatively efficient: O(nktd), where n is # objects, k is # clusters, t is # iterations , and d is # dimensions. Normally, k, t ,d<< n. Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k)) Comment: Often terminates at a local optimum. Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with differing sizes, differing density, non-convex shapes Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 33

Partitioning Algorithms k-means Algorithm Issue of initial centroids , clustering evaluation Limitations of k-means k-medoids Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 34

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 35

Limitations of K-means: Differing Density Original Points K-means (3 Clusters) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 36

Limitations of K-means: Non-convex Shapes Original Points K-means (2 Clusters) Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 37

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 38

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 39

Overcoming K-means Limitations Original Points K-means Clusters One solution is to use many clusters. Find parts of clusters, but need to put together. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 40

K-medoids, instead of K-means The k-means algorithm is sensitive to outliers ! Since an object with an extremely large value may substantially distort the distribution of the data Means are not able to compute in some cases. Only similarities among objects are available K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 1 2 3 4 5 6 7 8 9 10 Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 41

Partitioning Algorithms k-means Algorithm Issue of initial centroids , clustering evaluation Limitations of k-means k-medoids PAM CLARA CLARANS Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 42

The K-Medoids Clustering Method Find representative objects, called medoids, in clusters k-medoids, use the same strategy of k-means PAM (Partitioning Around Medoids, 1987) starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering PAM works effectively for small data sets, but does not scale well for large data sets CLARA (Kaufmann & Rousseeuw, 1990) CLARANS (Ng & Han, 1994): Randomized sampling Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 43

A Typical K-Medoids Algorithm (PAM) Total Cost ({mi}) = 20 10 9 8 7 Arbitrary choose k object as initial medoids (mi,i=1..k) Assign each remaining object to nearest medoids 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 K=2 select a nonmedoid object,Oj Total Cost = 18 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 Compute total cost of all possible new set of medoids (Oj,{mi},i≠t) Swapping mt and Oj If cost({Oj,{mi,i≠t})} is the smallest, and cost ({Oj,{mi,i≠t}}) < cost({mi}). Do loop Until no change 44

PAM (Partitioning Around Medoids) (1987) PAM (Partitioning Around Medoids, Kaufman and Rousseeuw, 1987) Use real object to represent the cluster Select k representative objects (medoids) arbitrarily For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih TCih= total_cost(replace i by h) - total_cost(no replace) If min(TCih )< 0, i is replaced by h Then assign each non-selected object to the most similar representative object repeat steps 2-3 until there is no change O(k(n-k)2 ) for each iteration where n is # of data, k is # of clusters Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 45

CLARA (Clustering Large Applications) (1990) CLARA (Clustering LARge Applications, Kaufmann and Rousseeuw) Sampling based method: It draws multiple samples of the data set, applies PAM on each sample, gives the best clustering as the output (minimizing cost/SSE) Strength: deals with larger data sets than PAM Weakness: Efficiency depends on the sample size A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 46

CLARANS (“Randomized” CLARA) (1994) CLARANS (A Clustering Algorithm based on Randomized Search, Ng and Han’94) The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids PAM: checks every neighbor CLARA: examines fewer neighbors, searches in subgraphs built from samples CLARANS: searches the whole graph but draws sample of neighbors dynamically Each node: k medoids, which correspond to a clustering ……. Two nodes are connected as neighbors if their sets differ by only one item each node has k(n-k) neighbors Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 47

CLARANS (“Randomized” CLARA) (1994) CLARANS: searches the whole graph but draws sample of neighbors dynamically It is more efficient and scalable than both PAM and CLARA Focusing techniques and spatial access structures may further improve its performance (Ester et al.’95) ……. Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 48

Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining What you should know What is clustering? What is partitioning clustering method? How does k-means work? The limitation of k-means How does k-mediods work? How to solve the scalability problem of k-mediods? Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining 49