Clustering Patrice Koehl Department of Biological Sciences

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Agglomerative Hierarchical Clustering 1. Compute a distance matrix 2. Merge the two closest clusters 3. Update the distance matrix 4. Repeat Step 2 until.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Introduction to Bioinformatics
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
Introduction to Bioinformatics - Tutorial no. 12
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Clustering Algorithm CS 157B JIA HUANG. Definition Data clustering is a method in which we make cluster of objects that are somehow similar in characteristics.
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Hierarchical Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
Machine Learning Clustering: K-means Supervised Learning
Hierarchical Clustering
K-means and Hierarchical Clustering
Clustering.
Clustering and Multidimensional Scaling
Information Organization: Clustering
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Unsupervised Learning: Clustering
SEEM4630 Tutorial 3 – Clustering.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Clustering Patrice Koehl Department of Biological Sciences National University of Singapore http://www.cs.ucdavis.edu/~koehl/Teaching/BL5229 koehl@cs.ucdavis.edu

Clustering is a hard problem Many possibilities; What is best clustering ?

Clustering is a hard problem 2 clusters: easy

Clustering is a hard problem 4 clusters: difficult ? ? Many possibilities; What is best clustering ?

Clustering Hierarchical clustering K-means clustering How many clusters?

Clustering Hierarchical clustering K-means clustering How many clusters?

Hierarchical Clustering To cluster a set of data D={P1, P2, …,PN}, hierarchical clustering proceeds through a series of partitions that runs from a single cluster containing all data points, to N clusters, each containing 1 data points. Two forms of hierarchical clustering: Agglomerative a, b, c, d, e c, d, e a, b d, e Divisive a b c d e

Agglomerative hierarchical clustering techniques Starts with N independent clusters: {P1}, {P2}, …,{PN} Find the two closest (most similar) clusters, and join them Repeat step 2 until all points belong to the same cluster Methods differ in their definition of inter-cluster distance (or similarity)

Agglomerative hierarchical clustering techniques Cluster A 1) Single linkage clustering Distance between closest pairs of points: Cluster B 2) Complete linkage clustering Cluster A Distance between farthest pairs of points: Cluster B

Agglomerative hierarchical clustering techniques Cluster A: NA elements 3) Average linkage clustering Mean distance of all mixed pairs of points: Cluster B: NB elements 4) Average group linkage clustering Cluster A Cluster T Mean distance of all pairs of points Cluster B

Clustering Hierarchical clustering K-means clustering How many clusters?

K-means clustering (http://www.weizmann.ac.il/midrasha/courses/)

K-means clustering (http://www.weizmann.ac.il/midrasha/courses/)

K-means clustering (http://www.weizmann.ac.il/midrasha/courses/)

K-means clustering (http://www.weizmann.ac.il/midrasha/courses/)

K-means clustering (http://www.weizmann.ac.il/midrasha/courses/)

Clustering Hierarchical clustering K-means clustering How many clusters?

Cluster validation Clustering is hard: it is an unsupervised learning technique. Once a Clustering has been obtained, it is important to assess its validity! The questions to answer: Did we choose the right number of clusters? Are the clusters compact? Are the clusters well separated? To answer these questions, we need a quantitative measure of the cluster sizes: intra-cluster size Inter-cluster distances

Inter cluster size Several options: Single linkage Complete linkage Average linkage Average group linkage

Intra cluster size Several options: Complete diameter: Average diameter: Centroid diameter: For a cluster S, with N members and center C:

Cluster Quality For a clustering with K clusters: 1) Dunn’s index Large values of D correspond to good clusters 2) Davies-Bouldin’s index Low values of DB correspond to good clusters

Cluster Quality: Silhouette index Define a quality index for each point in the original dataset: For the ith object, calculate its average distance to all other objects in its cluster. Call this value ai. For the ith object and any cluster not containing the object, calculate the object’s average distance to all the objects in the given cluster. Find the minimum such value with respect to all clusters; call this value bi. For the ith object, the silhouette coefficient is

Cluster Quality: Silhouette index Note that: s(i) = 1, i is likely to be well classified s(i) = -1, i is likely to be incorrectly classified s(i) = 0, indifferent

Cluster Quality: Silhouette index Cluster silhouette index: Global silhouette index: Large values of GS correspond to good clusters