Clustering (Gene Expression Data) 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Basic Gene Expression Data Analysis--Clustering
PARTITIONAL CLUSTERING
6.096 – Algorithms for Computational Biology Lecture 12 Biological Networks Microarrays – Expression Clustering – Bayesian nets – Small-world nets.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Cluster analysis for microarray data Anja von Heydebreck.
Machine Learning and Data Mining Clustering
Introduction to Bioinformatics
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering II.
Mutual Information Mathematical Biology Seminar
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Microarrays. Regulation of Gene Expression Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
What is Cluster Analysis?
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
What is Cluster Analysis?
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Assigning Numbers to the Arrows Parameterizing a Gene Regulation Network by using Accurate Expression Kinetics.
CS262 Lecture 17, Win07, Batzoglou Gene Regulation and Microarrays.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Evaluating Performance for Data Mining Techniques
Gene expression profiling identifies molecular subtypes of gliomas
Gene expression & Clustering (Chapter 10)
Whole Genome Expression Analysis
More on Microarrays Chitta Baral Arizona State University.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Microarrays.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
High-throughput omic datasets and clustering
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Flat clustering approaches
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Unsupervised Learning
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Text Categorization Berlin Chen 2003 Reference:
Unsupervised Learning
Presentation transcript:

Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005

Challenges in Computational Biology DNA 4 Genome Assembly Gene Finding Regulatory motif discovery Database lookup Gene expression analysis9 RNA transcript Sequence alignment Evolutionary Theory7 TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT Cluster discovery10Gibbs sampling Protein network analysis12 Emerging network properties14 13 Regulatory network inference Comparative Genomics RNA folding8 11

Plan Gene Expression Data/DNA Microarrays Feature selection and Clustering

DNA MicroArrays To measure levels of messages in a cell –Construct an array with DNA sequences for multiple genes –Hybridize each RNA in your sample to a sequence in your array (All sequences from the same gene hybridize to the same spot) –Measure the number of hybridizations for each spot DNA 1 DNA 3 DNA 5DNA 6 DNA 4 DNA 2 cDNA 4 cDNA 6 Hybridize Gene 1 Gene 3 Gene 5Gene 6 Gene 4 Gene 2 Measure RNA 4 RNA 6 RT

Result 6000 genes in one shot Entire transcriptome observable in one experiment Can perform multiple experiments under varying conditions –Temperature –Time –Sugar level –Other chemicals –Gene knock-outs –…–…

Noise Sources of Noise –Cross-hybridization –Non-uniform hybridization kinetics –Non-linearity of array response to concentration –Non-linear amplification –Improper probe sequence –Difference in materials/procedures

Noise model: y ij =n i α ij (c j t ij ) + ε ij –y ij : observed level for gene j on chip i –t ij : true level –c j : gene constant –n i : multiplicative chip normalization –α ij, ε ij : multiplicative and additive noise terms Estimating the parameters –n i : spiked in control probes, not present in genome studied –c j : control experiments of known concentrations for gene j –ε ij : un-spiked control probes should be zero –α ij : spiked controls that are constant across chips Expression Value Normalization

Gene expression data For each gene j we have a vector t j =(t 1j,t 2j, …, t dj ) Now what ? I.e., what can we do with this data ?

Supervised vs. unsupervised “learning” Make the parallel with modeling biological sequences that we saw last week. What do we do when we don’t have any models? –We can look for patterns / i.e. similarities between the different genes –We can look for recurring themes.  Clustering

The problem Group genes into co-regulated sets –Observe cells under different environmental changes –Find genes whose expression profiles are affected in a similar way –These genes are potentially co-regulated, i.e. regulated by the same transcription factor Clustering!

Clustering expression levels Clustering process: 1.How to tell if two expression profiles are similar ? –Define the (dis)-similarity measure between two profiles 2.How to group multiple profiles into meaningful subsets ? –Describe the clustering procedure 3.Are the results meaningful ? –Evaluate statistical significance of a clustering And don’t forget about: –De-noising –Choice of experiments/features

(Dis)-similarity measures Distance metrics (between vectors x and y ) –“Manhattan” distance:MD(x,y) = ∑ I |x i -y i | –Euclidean distance: ED(x,y) = [ ∑ I (x i -y i ) 2 ] 1/2 –SSE:SSE(x,y) = ∑ I (x i -y i ) 2 Correlation: C(x,y)= ∑ I x i * y i (possibly take absolute value) Data pre-processing: Instead of clustering on direct observation of expression values… –… can cluster based on differential expression from the mean, e.g., ∑ I | x i – avg(x) – (y i – avg(y)) | –… or differential expression normalized by standard deviation, e.g., ∑ I | (x i – avg(x))/stdev(x) – (y i - avg(y))/stdev(y) |

Clustering Algorithms Hierarchical: Merge data successively to construct tree b e d f a c h g abdefghc Non-Hierarchical: place k-means to best explain data b e d f a c h g c1 c2 c3 abghcdef

Hierarchical clustering Bottom-up algorithm: –Initialization: each point in a separate cluster At each step: –Choose the pair of closest clusters –Merge The exact behavior of the algorithm depends on how we define the distance CD(X,Y) between clusters X and Y Avoids the problem of specifying the number of clusters b e d f a c h g

Distance between clusters CD(X,Y)=min x  X, y  Y D(x,y) Single-link method CD(X,Y)=max x  X, y  Y D(x,y) Complete-link method CD(X,Y)=avg x  X, y  Y D(x,y) Average-link method CD(X,Y)=D( avg(X), avg(Y) ) Centroid method e d f h g e d f h g e d f h g e d f h g

Example I

Example II

K-means algorithm Each cluster X i has a center c i Define the clustering cost criterion COST(X 1,…X k ) = ∑ Xi ∑ x  Xi SSE(x,c i ) Algorithm tries to find clusters X 1 …X k and centers c 1 …c k that minimize COST K-means algorithm: –Initialize centers “somehow” –Repeat: Compute best clusters for given centers → Attach each point to the closest center Compute best centers for given clusters → Choose the centroid of points in cluster –Until the COST is “small” b e d f a c h g c1 c2 c3 How ? SSE(x,y) = ∑ I (x i -y i ) 2

Choosing optimal center Consider a cluster X and a center c (not necessarily a centroid) Want to minimize ∑ x  X SSE(x,c) = ∑ x  X ∑ i (x i -c i ) 2 = ∑ i ∑ x  X (x i -c i ) 2 Can optimize each c i separately: ∑ x  X (x i -c i ) 2 = ∑ x  X ( x i 2 - 2x i c i – c i 2 ) = ∑ x  X x i 2 – c i ∑ x  X 2x i + |X|c i 2 Optimum: c i = ∑ x  X x i / |X|

Links /tutorial_html/AppletKM.htmlhttp:// /tutorial_html/AppletKM.html /tutorial_html/AppletH.htmlhttp:// /tutorial_html/AppletH.html

Relationship between k-means and EM, Optimizing two variables at the same time. Know one compute the other, make the parallel

Clustering Algorithms: Running time Hierarchical: Merge data successively to construct tree b e d f a c h g abdefghc Non-Hierarchical: place k-means to best explain data b e d f a c h g c1 c2 c3

Running time: Hierarchical methods Repeat: –Choose the pair of closest clusters –Merge Number of iterations: –Exactly n-1 Iteration cost: –At most n 2 computations of CD(, ) –How many point-point distance computations ? –At most n 2 as well ! Total running time: O(n 3 ) b e d f a c h g

What about the running time for k-means?

Improvements Single-link = Minimum Spanning Tree –O(n 2 ) time …

What have we learned? Gene expression data –Microarray technology –De-noising Two methods for clustering –Hierarchical clustering non-parametric, general, top-down –K-means clustering ‘model’-based –Relationship with HMMs, alignment Distance metrics What’s next? –Evaluate clustering results –Visualizing clustering output

Evaluating clustering output Computing statistical significance of clusters +–N experiments, p labeled +, (N-p) – Cluster: k elements, m positive P-value of single cluster containing k elements out of which r are same Prob that a randomly chosen set of k experiments would result in m positive and k-m negative P-value of uniformity in computed cluster

Visualizing clustering output

Rearranging tree branches Optimizing one-dimensional ordering of tree leaves abghcdef abdefghc

Ziv Bar-Zoseph published a linear-time DP algorithm (from what I remember) to calculate branch re-ordering. It’d be fun to show it in lecture, if you have time