CSCE555 Bioinformatics Lecture 14 Cluster analysis for microarray data

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Basic Gene Expression Data Analysis--Clustering
PARTITIONAL CLUSTERING
CSCE555 Bioinformatics Lecture 3 Gene Finding Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Cluster analysis for microarray data Anja von Heydebreck.
Unsupervised learning
Introduction to Bioinformatics
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
More on Microarrays Chitta Baral Arizona State University.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Microarrays.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Design of Micro-arrays Lecture Topic 6. Experimental design Proper experimental design is needed to ensure that questions of interest can be answered.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
CPH Dr. Charnigo Chap. 14 Notes In supervised learning, we have a vector of features X and a scalar response Y. (A vector response is also permitted.
Computational Biology
Unsupervised Learning
Combinatorial clustering algorithms. Example: K-means clustering
Cluster Analysis of Gene Expression Profiles
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Cluster Analysis II 10/03/2012.
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Microarray - Leukemia vs. normal GeneChip System.
Data Mining K-means Algorithm
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Principal Component Analysis (PCA)
Microarray Clustering
K-means and Hierarchical Clustering
Hierarchical clustering approaches for high-throughput data
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Clustering.
Multivariate Statistical Methods
Cluster Analysis in Bioinformatics
Dimension reduction : PCA and Clustering
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Unsupervised Learning
Design Issues Lecture Topic 6.
Presentation transcript:

CSCE555 Bioinformatics Lecture 14 Cluster analysis for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: http://www.scigen.org/csce555 University of South Carolina Department of Computer Science and Engineering 2008 www.cse.sc.edu.

Outline Why microarray data analysis Understanding Microarray Data Clustering in microarray analysis Project proposals and Group 11/27/2018

Functional Genomics – Why? You can’t learn everything from DNA sequence Levels of organization contain different information Want to know how a whole cell works May not know a priori which pieces of information are going to be the most valuable to solve a particular problem Need to know both analytical tools and biology What are the great questions? What techniques are available to answer them? What computational/stasticial techniques can get you what you want? Start with whole genome, annotated so you know where most of the genes are Make probes to 3’ ends of genes generally

Microarrays Extract RNA and incorporate label Combine Scanned image Hybridize

Types of Microarrays For measuring mRNA Affymetrix, Nimblegen, Illumina (short oligos made on the slide) Oligonucleotides (short DNA – typically 70-mers) For identifying where proteins bind DNA and non-coding or other RNAs Tiling arrays For identifying SNPs http://www.illumina.com/ There are even protein arrays!

Why microarrays? You can get a great deal of data – multiplexed, uses very little reagent, can use standard detection devices Similar techniques are used for working with RNA and DNA and similar attachment chemistries Can carry out enzyme reactions on the arrays for sequencing and other analyses

Microarray experiments have multiple steps each step introduces some variability Cell harvest Slide Manufacture RNA preparation Slide Post-processing Labeling reaction Pre-hybridization Hybridization Washing Scanning Data Analysis

What questions are you asking? First of all – you are measuring mRNA abundance – so questions are about gene expression at the mRNA level What are the levels of all mRNAs? How do they change? Are there interesting changes or groups that provide insight into some process or state?

Design and analysis Experimental design is critical Analysis tools: Hierarchical clustering Gene lists based on expression Time course analysis Statistics Gene Ontology Pathway analysis Classifiers – to identify biomarkers

Aim of clustering: Group objects according to their similarity a set of objects that are similar to each other and separated from the other objects. Example: green/ red data points were generated from two different normal distributions

Clustering microarray data Genes and experiments/samples are given as the row and column vectors of a gene expression data matrix. Clustering may be applied either to genes or experiments (regarded as vectors in Rp or Rn). gene expression data matrix n experiments p genes

Why cluster genes? Identify groups of possibly co-regulated genes (e.g. in conjunction with sequence data). Identify typical temporal or spatial gene expression patterns (e.g. cell cycle data). Arrange a set of genes in a linear order that is at least not totally meaningless.

Why cluster experiments/samples? Quality control: Detect experimental artifacts/bad hybridizations Check whether samples are grouped according to known categories (though this might be better addressed using a supervised approach: statistical tests, classification) Identify new classes of biological samples (e.g. tumor subtypes)

Alizadeh et al., Nature 403:503-11, 2000

Cluster analysis Generally, cluster analysis is based on two ingredients: Distance measure: Quantification of (dis-)similarity of objects. Cluster algorithm: A procedure to group objects. Aim: small within-cluster distances, large between-cluster distances.

Some distance measures Given vectors x = (x1, …, xn), y = (y1, …, yn) Euclidean distance: Manhattan distance: Correlation distance:

Which distance measure to use? The choice of distance measure should be based on the application area. What sort of similarities would you like to detect? Correlation distance dc measures trends/relative differences: dc(x, y)= dc(ax+b, y) if a > 0. x = (1, 1, 1.5, 1.5) y = (2.5, 2.5, 3.5, 3.5) = 2x + 0.5 z = (1.5, 1.5, 1, 1) dc(x, y) = 0, dc(x, z) = 2. dE(x, z) = 1, dE(x, y) ~ 3.54.

Which distance measure to use? Euclidean and Manhattan distance both measure absolute differences between vectors. Manhattan distance is more robust against outliers. May apply standardization to the observations: Subtract mean and divide by standard deviation: After standardization, Euclidean and correlation distance are equivalent:

K-means clustering Input: N objects given as data points in Rp Specify the number k of clusters. Initialize k cluster centers. Iterate until convergence: - Assign each object to the cluster with the closest center (wrt Euclidean distance). - The centroids/mean vectors of the obtained clusters are taken as new cluster centers. K-means can be seen as an optimization problem: Minimize the sum of squared within-cluster distances, Results depend on the initialization. Use several starting points and choose the “best” solution (with minimal W(C)).

K-means/PAM: How to choose K (the number of clusters)? There is no easy answer. Many heuristic approaches try to compare the quality of clustering results for different values of K (for an overview see Dudoit/Fridlyand 2002). The problem can be better addressed in model-based clustering, where each cluster represents a probability distribution, and a likelihood-based framework can be used.

Self-organizing maps Neural Network K = r*s clusters are arranged as nodes of a two-dimensional grid. Nodes represent cluster centers/prototype vectors. This allows to represent similarity between clusters. Algorithm: Initialize nodes at random positions. Iterate: - Randomly pick one data point (gene) x. - Move nodes towards x, the closest node most, remote nodes (in terms of the grid) less. Decrease amount of movements with no. of iterations. from Tamayo et al. 1999

Self-organizing maps from Tamayo et al. 1999 (yeast cell cycle data)

Agglomerative hierarchical clustering Bottom-up algorithm (top-down (divisive) methods are less common). Start with the objects as clusters. In each iteration, merge the two clusters with the minimal distance from each other - until you are left with a single cluster comprising all objects. But what is the distance between two clusters?

Distances between clusters used for hierarchical clustering Calculation of the distance between two clusters is based on the pairwise distances between members of the clusters. Complete linkage: largest distance Average linkage: average distance Single linkage: smallest distance Complete linkage gives preference to compact/spherical clusters. Single linkage can produce long stretched clusters.

Hierarchical clustering Golub data: different types of leukemia. Clustering based on the 150 genes with highest variance across all samples. The height of a node in the dendrogram represents the distance of the two children clusters. Loss of information: n objects have n(n-1)/2 pairwise distances, tree has n-1 inner nodes. The ordering of the leaves is not uniquely defined by the dendrogram: 2n-2 possible choices.

Alternative: direct visualization of similarity/distance matrices Array batch 1 Array batch 2 Useful if one wants to investigate a specific factor (advantage: no loss of information). Sort experiments according to that factor.

Clustering of time course data Suppose we have expression data from different time points t1, …, tn, and want to identify typical temporal expression profiles by clustering the genes. Usual clustering methods/distance measures don’t take the ordering of the time points into account – the result would be the same if the time points were permuted. Simple modification: Consider the difference xi(j+1) – xij between consecutive timepoints as an additional observation yij. Then apply a clustering algorithm such as K-means to the augmented data matrix.

Biclustering Usual clustering algorithms are based on global similarities of rows or columns of an expression data matrix. But the similarity of the expression profiles of a group of genes may be restricted to certain experimental conditions. Goal of biclustering: identify “homogeneous” submatrices. Difficulties: computational complexity, assessing the statistical significance of results Example: Tanay et al. 2002.

The role of feature selection Sometimes, people first select genes that appear to be differentially expressed between groups of samples. Then they cluster the samples based on the expression levels of these genes. Is it remarkable if the samples then cluster into the two groups? No, this doesn’t prove anything, because the genes were selected with respect to the two groups! Such effects can even be obtained with a matrix of i.i.d. random numbers.

Conclusions Clustering has been very popular in the analysis of microarray data. However, many typical questions in microarray studies (e.g. identifying expression signatures of tumor types) are better addressed by other methods (classification, statistical tests). Clustering algorithms are easy to apply (you cannot really do it wrong), and they are useful for exploratory analysis. But it is difficult to assess the validity/significance of the results. Even “random” data with no structure can yield clusters or exhibit interesting looking patterns.

Final Project Proposal Aims: investigate a bioinformatics problem Group: one or two person Proposal: one page Problem description Data sources Algorithms Expected results Test: presentation and final report

Acknowledgement Anja von Heydebreck