Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Clustering (1) Clustering Similarity measure Hierarchical clustering Model-based clustering Figures from the book Data Clustering by Gan et al.
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering CMPUT 466/551 Nilanjan Ray. What is Clustering? Attach label to each observation or data points in a set You can say this “unsupervised classification”
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
4. Ad-hoc I: Hierarchical clustering
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CLUSTERING (Segmentation)
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
Clustering Basic Concepts and Algorithms 2
Clustering II. 2 Finite Mixtures Model data using a mixture of distributions –Each distribution represents one cluster –Each distribution gives probabilities.
tch?v=Y6ljFaKRTrI Fireflies.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
LISA Short Course Series Multivariate Clustering Analysis in R Yuhyun Song Nov 03, 2015 LISA: Multivariate Clustering Analysis in RNov 3, 2015.
CHAPTER 1: Introduction. 2 Why “Learn”? Machine learning is programming computers to optimize a performance criterion using example data or past experience.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Unsupervised Learning
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Clustering (1) Clustering Similarity measure Hierarchical clustering
Hierarchical Clustering
PREDICT 422: Practical Machine Learning
What Is the Problem of the K-Means Method?
CSE 5243 Intro. to Data Mining
Hierarchical Clustering
K-means and Hierarchical Clustering
John Nicholas Owen Sarah Smith
Hierarchical clustering approaches for high-throughput data
Cluster Analysis of Microarray Data
Clustering and Multidimensional Scaling
Information Organization: Clustering
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Clustering The process of grouping samples so that the samples are similar within each group.
Hierarchical Clustering
Presentation transcript:

Important clustering methods used in microarray data analysis Steve Horvath Human Genetics and Biostatistics UCLA

Contents Multidimensional scaling plots –Related to principal component analysis k-means clustering hierarchical clustering

Introduction to clustering

MDS plot of clusters

2 references for clustering T. Hastie, R. Tibshirani, J. Friedman (2002) The elements of Statistical Learning. Springer Series L. Kaufman, P. Rousseeuw (1990) Finding groups in data. Wiley Series in Probability

Introduction to clustering Cluster analysis aims to group or segment a collection of objects into subsets or "clusters", such that those within each cluster are more closely related to one another than objects assigned to different clusters. An object can be described by a set of measurements (e.g. covariates, features, attributes) or by its relation to other objects. Sometimes the goal is to arrange the clusters into a natural hierarchy, which involves successively grouping or merging the clusters themselves so that at each level of the hierarchy clusters within the same group are more similar to each other than those in different groups.

Proximity matrices are the input to most clustering algorithms Proximity between pairs of objects: similarity or dissimilarity. If the original data were collected as similarities, a monotone- decreasing function can be used to convert them to dissimilarities. Most algorithms use (symmetric) dissimilarities (e.g. distances) But the triangle inequality does *not* have to hold. Triangle inequality:

Different intergroup dissimilarities Let G and H represent 2 groups.

Agglomerative clustering, hierarchical clustering and dendrograms

Hierarchical clustering plot

Agglomerative clustering Agglomerative clustering algorithms begin with every observation representing a singleton cluster. At each of the N-1 the closest 2 (least dissimilar) clusters are merged into a single cluster. Therefore a measure of dissimilarity between 2 clusters must be defined.

Comparing different linkage methods If there is a strong clustering tendency, all 3 methods produce similar results. Single linkage has a tendency to combine observations linked by a series of close intermediate observations ("chaining“). Good for elongated clusters Bad: Complete linkage may lead to clusters where observations assigned to a cluster can be much closer to members of other clusters than they are to some members of their own cluster. Use for very compact clusters (like perls on a string). Group average clustering represents a compromise between the extremes of single and complete linkage. Use for ball shaped clusters

Dendrogram Recursive binary splitting/agglomeration can be represented by a rooted binary tree. The root node represents the entire data set. The N terminal nodes of the trees represent individual observations. Each nonterminal node ("parent") has two daughter nodes. Thus the binary tree can be plotted so that the height of each node is proportional to the value of the intergroup dissimilarity between its 2 daughters. A dendrogram provides a complete description of the hierarchical clustering in graphical format.

Comments on dendrograms Caution: different hierarchical methods as well as small changes in the data can lead to different dendrograms. Hierarchical methods impose hierarchical structure whether or not such structure actually exists in the data. In general dendrograms are a description of the results of the algorithm and not graphical summary of the data. Only valid summary to the extent that the pairwise *observation* dissimilarities obey the ultrametric inequality for all i,i’,k

Figure 1 averagecomplete single