Clustering Patrice Koehl Department of Biological Sciences National University of Singapore

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Introduction to Bioinformatics
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Cluster Analysis: Basic Concepts and Algorithms
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Validity index for clusters of different sizes and densities Presenter: Jun-Yi Wu Authors: Krista Rizman.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering Patrice Koehl Department of Biological Sciences
Chapter 15 – Cluster Analysis
Machine Learning Clustering: K-means Supervised Learning
Hierarchical Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Clustering Patrice Koehl Department of Biological Sciences National University of Singapore

Clustering is a hard problem Many possibilities; What is best clustering ?

Clustering is a hard problem 2 clusters: easy

Clustering is a hard problem 4 clusters: difficult ?? Many possibilities; What is best clustering ?

Clustering  Hierarchical clustering  K-means clustering  How many clusters?

Clustering  Hierarchical clustering  K-means clustering  How many clusters?

Hierarchical Clustering To cluster a set of data D={P 1, P 2, …,P N }, hierarchical clustering proceeds through a series of partitions that runs from a single cluster containing all data points, to N clusters, each containing 1 data points. Two forms of hierarchical clustering: a, b, c, d, e c, d, e a, bd, e ab cd e Agglomerative Divisive

Agglomerative hierarchical clustering techniques  Starts with N independent clusters: {P 1 }, {P 2 }, …,{P N }  Find the two closest (most similar) clusters, and join them  Repeat step 2 until all points belong to the same cluster Methods differ in their definition of inter-cluster distance (or similarity)

1) Single linkage clustering Distance between closest pairs of points: Distance between farthest pairs of points: 2) Complete linkage clustering Cluster A Cluster B Cluster A Cluster B Agglomerative hierarchical clustering techniques

3) Average linkage clustering Mean distance of all mixed pairs of points: Mean distance of all pairs of points 4) Average group linkage clustering Cluster A: N A elements Cluster B: N B elements Cluster A Cluster B Cluster T Agglomerative hierarchical clustering techniques

Clustering  Hierarchical clustering  K-means clustering  How many clusters?

K-means clustering (

K-means clustering (

K-means clustering (

K-means clustering (

K-means clustering (

Clustering  Hierarchical clustering  K-means clustering  How many clusters?

Cluster validation Clustering is hard: it is an unsupervised learning technique. Once a Clustering has been obtained, it is important to assess its validity! The questions to answer:  Did we choose the right number of clusters?  Are the clusters compact?  Are the clusters well separated? To answer these questions, we need a quantitative measure of the cluster sizes:  intra-cluster size  Inter-cluster distances

Inter cluster size Several options:  Single linkage  Complete linkage  Average linkage  Average group linkage

Intra cluster size Several options: -Complete diameter: -Average diameter: -Centroid diameter: For a cluster S, with N members and center C:

Cluster Quality 1) Dunn’s index For a clustering with K clusters: Large values of D correspond to good clusters 2) Davies-Bouldin’s index Low values of DB correspond to good clusters

Cluster Quality: Silhouette index Define a quality index for each point in the original dataset:  For the ith object, calculate its average distance to all other objects in its cluster. Call this value ai.  For the ith object and any cluster not containing the object, calculate the object’s average distance to all the objects in the given cluster. Find the minimum such value with respect to all clusters; call this value bi.  For the ith object, the silhouette coefficient is

Cluster Quality: Silhouette index Note that:  s(i) = 1, i is likely to be well classified  s(i) = -1, i is likely to be incorrectly classified  s(i) = 0, indifferent

Cluster Quality: Silhouette index Cluster silhouette index: Global silhouette index: Large values of GS correspond to good clusters