Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Hierarchical Clustering
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Machine Learning and Data Mining Clustering
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Clustering II.
1 Machine Learning: Symbol-based 10d More clustering examples10.5Knowledge and Learning 10.6Unsupervised Learning 10.7Reinforcement Learning 10.8Epilogue.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Tree Clustering & COBWEB. Remember: k-Means Clustering.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Gene expression analysis
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Clustering Unsupervised learning introduction Machine Learning.
Machine Learning and Data Mining Clustering (adapted from) Prof. Alexander Ihler TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Flat clustering approaches
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Module Networks BMI/CS 576 Mark Craven December 2007.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 baySeq homework HS analysis: Out of 7388 genes with data, 1995 genes were DE at FDR
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Clustering Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Graph clustering to detect network modules
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering Patrice Koehl Department of Biological Sciences
Clustering Manpreet S. Katari.
Microarray Clustering
Hierarchical clustering approaches for high-throughput data
Information Organization: Clustering
Data Mining – Chapter 4 Cluster Analysis Part 2
Text Categorization Berlin Chen 2003 Reference:
Machine Learning and Data Mining Clustering
Presentation transcript:

Clustering approaches for high- throughput data Sushmita Roy BMI/CS Nov 12 th, 2013

Key concepts Hierarchical clustering Determining the number of clusters Ways to assess cluster quality

Hierarchical clustering In K-means and GMMs we need to specify the number of clusters Hierarchical clustering instead requires us to specify how much dissimilarity we will tolerate We maintain a matrix of distance (or similarity) scores for all pairs of – expression vectors – clusters (formed so far) – Expression vectors and clusters

Hierarchical clustering leaves represent objects to be clustered (e.g. genes) height of bar indicates degree of distance within cluster distance scale 0

Distance between two clusters The distance between two clusters can be determined in several ways – single link: distance of two most similar profiles – complete link: distance of two least similar profiles – average link: average distance between profiles

Updating distances efficiently If we just merged and into, we can determine distance to each other cluster as follows – single link: – complete link: – average link:

Effect of different linkage methods Complete linkage Average linkage Single linkage

Flat clustering from a hierarchical clustering cutting here results in 2 clusters cutting here results in 4 clusters We can always generate a flat clustering from a hierarchical clustering by “cutting” the tree at some distance threshold

Computational complexity The naïve implementation of hierarchical clustering has time complexity, where n is the number of objects – computing the initial distance matrix takes time – there are merging steps – on each step, we have to update the distance matrix and select the next pair of clusters to merge K -means and EM have time complexity for each iteration – reassignment step: compute K × n distances – recomputation step: loop through n profiles updating k means

Choosing the number of clusters Picking the number of clusters based on the clustering objective will result in k=N (number of data points) Pick k based on penalized clustering objective Pick based on cross-validation

Picking k based on cross-validation Cluster Training setTest set Evaluate Average Data set Split into 3 sets Compute objective based on test data Run method on all data once k has been determined

Evaluation of clusters Internal validation – How well does clustering optimize the intra-cluster similarity and inter-cluster dissimilarity External validation – Do genes in the same cluster have similar function?

Internal validation One measure of assessing cluster quality is the Silhouette index (SI) More positive the SI, better the clusters K: number of clusters C j : Set representing j th cluster b(x i ) : Average distance of x i to instances in next closest cluster a(x i ) : Average distance of x i to other instances in same cluster

External validation Are genes in the same cluster associated with similar function? Gene Ontology (GO) is a controlled vocabulary of terms used to annotate genes of a particular category One can use GO terms to study whether the genes in a cluster are associated with the same GO term more than expected by chance One can also see if genes in a cluster are associated with similar transcription factor binding sites

The Gene Ontology A controlled vocabulary of more than 30K concepts describing molecular functions, biological processes, and cellular components

Using Gene Ontology to assess the quality of a cluster Genes Conditions GO terms Transcription factor binding sites for HAP4 and MSN2/4

Summary of clustering Many different methods to cluster – Flat clustering – Hierarchical clustering – Distance metrics among objects can influence clustering results a lot Picking the number of clusters is difficult but there are some ways to do this Evaluation of clusters is hard sometimes – Comparison with other sources of information can help assess cluster quality