Clustering in Microarray Data-mining and Challenges Beyond Qing-jun Wang Center for Biophysics & Computational Biology University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Basic Gene Expression Data Analysis--Clustering
Outlines Background & motivation Algorithms overview
Cluster Analysis: Basic Concepts and Algorithms
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Cluster analysis for microarray data Anja von Heydebreck.
Introduction to Bioinformatics
Cluster Analysis.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Bio277 Lab 2: Clustering and Classification of Microarray Data Jess Mar Department of Biostatistics Quackenbush Lab DFCI
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Evaluating Performance for Data Mining Techniques
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
More on Microarrays Chitta Baral Arizona State University.
Microarrays to Functional Genomics: Generation of Transcriptional Networks from Microarray experiments Joshua Stender December 3, 2002 Department of Biochemistry.
Microarrays.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Gene expression analysis
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Cluster validation Integration ICES Bioinformatics.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
Computational Biology
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering
Cluster Analysis.
Clustering.
Unsupervised Learning
Presentation transcript:

Clustering in Microarray Data-mining and Challenges Beyond Qing-jun Wang Center for Biophysics & Computational Biology University of Illinois at Urbana-Champaign CS491jh presentation March 7, 2002

Clustering What? Where? How? Challenges beyond clustering

Data Acquisition Experimental design -MIAME -Replicates -Single/multiple slides Perform experiment Collect data Data Processing Grid alignment Data quality e.g. bad data, S/N Missing data Normalization -Total intensity normalization -Regression techniques -Ratio statistics

Gene Expression Matrix (Affymetrix GeneChip® oligonucleotide arrays) sam/ref

Gene Expression Matrix (glass slides)

Data Acquisition MIAME Experiment design -Replicates -Single/multiple slides Data Validation Data Analysis Data Processing Re-scale Data quality e.g. bad data, S/N Grid alignment Missing data Normalization -Total intensity normalization -Regression techniques -Ratio statistics Distance matrices Unsupervised analysis (clustering) -Hierarchical -Non-hierarchical (e.g. K-means, PCA-based clustering, self-organizing maps, block clustering, gene-shaving, plaid models) Supervised analysis e.g. SVM, K-nearest neighbor, decision trees, voted classification, weighted gene voting, Bayesian classification

Protocol 1.Calculate pairwise distance matrix 2.Find the two most similar genes or clusters 3.Merge the two selected clusters to produce a new cluster 4.Calculate pairwise distance matrix involving the new cluster 5.Repeat steps 2-4 until all objects are in one cluster 6.The clustering sequence is represented by a hierarchical tree – dendrogram. Hierarchical clustering Step 0 Step 1Step 2Step 3Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3Step 2Step 1Step 0 agglomerative (AGNES) divisive (DIANA)

Hierarchical clustering Variations – differ in how distances are calculated Single-linkage clustering – minimum distance Complete-linkage clustering – maximum distance Average-linkage clustering (UPGMA) Weighted pair-group average – use size of the clusters as the weights in computing averages Within-groups clustering Ward’s method – smallest possible increase in the sum of squared errors

Difficulties 1. As clusters grow in size, the expression vector that represents the cluster might no longer represent any of the genes in the cluster – an artifact 2. If a bad assignment is made early on, it cannot be corrected Hierarchical clustering Bottom-up (agglomerative) approach One-way clustering Deterministic clustering Produce a greater number of clusters than k-means clustering – valuable feature for discovery. Produce an order for objects – informative for data display.

K-means clustering Top-down (divisive) approach Used when the number of clusters is known in advance One-way clustering Non-deterministic owing to the random initialization Produce tighter clusters than hierarchical clustering Protocol 1.Initial reference vectors are assigned randomly or according to previous knowledge 2.Assign each object to one of k clusters randomly 3.Calculate average expression vectors for each cluster (as reference vectors) and the distance between clusters 4.Iteratively move objects between clusters and the objects stay in the new cluster when they are closer to the new cluster than to the old cluster. 5.Repeat steps 3-4 until converge, i.e. moving any more objects would increase intra-cluster distances Non-hierarchical clustering

K-means clustering Non-hierarchical clustering K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign (Borrowed from Dr. Jiawei Han March 5, 2002)

Non-hierarchical clustering Difficulty How to determine whether there are really only k distinct clusters represented in the data or not. Solutions Use K-means clustering with principal component analysis (PCA), which allows visual estimation of the number of clusters represented in the data. Try sequential k-means approach which finds number of clusters based on dataset. K-means clustering

Self-organizing map clustering Non-hierarchical clustering Top-down (divisive) approach One-way clustering Neural-network-based clustering approach Non-deterministic owing to the random order in which genes are used to move the reference vectors. Similar to k-means clustering except that the cluster centers are restricted to lie in a one or two-dimensional manifold Model the complexity within a dataset more effectively than k- means clustering.

Self-organizing map clustering Non-hierarchical clustering (Borrowed from Joshua Unger Feb. 28, 2002) Protocol 1.Define a geometric configuration for the partitions, e.g. a 2D rectangular or hexagonal grid 2.Construct and assign random vectors to each partition 3.Pick a gene randomly; identify the reference vector that is closest to the gene 4.Adjust the reference vectors so that they are more similar to the gene vector 5.Repeat steps 3-4 until the reference vectors converge 6.Map genes to the relevant partitions based on the reference vectors to which they are most similar

One-way clustering – used to group genes with similar behavior across samples or samples with similar gene expression vectors Two-way clustering – simultaneously cluster both genes and samples Hierarchical clustering K-means clustering Self-organizing maps Block clustering Gene shaving Plaid models … Non-hierarchical clustering

Blocking clustering Protocol Top-down approach Two-way clustering Produce a matrix with homogeneous blocks of the outcomes Produce hierarchical clustering trees for the rows and columns 1.Begin with the entire matrix in one block 2.Sort rows and columns by row and column means 3.Find the row or column splits of all existing blocks, choosing the one that produces largest reduction in the total within-block-variance 4.If there are existing row/column splits that intersect the block, one of them must be used. Otherwise all split points are tried. 5.The splitting is continued until a large number of blocks are obtained 6.Apply weakest link pruning to recombine some of the blocks until the optimal number of blocks is obtained. 7.The optimal number of blocks is estimated by “maximum gap” approach Gene Sample Non-hierarchical clustering

Blocking clustering Difficulty When applied to median centered data, at the start, all rows and column means are approximately zero – the procedure has difficulty getting started. Non-hierarchical clustering

The two-way clustering approach seek a single re-ordering of the samples for all genes. However, one set of genes might cluster the samples in one way while another set of genes in a very different way. Gene Shaving approach finds the linear combination of genes having maximal variation among samples. This linear combination of genes is viewed as a “super gene”. The genes having lowest correlation with the “super gene” is removed (shaved). The process is continued until the subset of genes contains only one gene. This process produces a sequence of gene blocks, each containing genes that are similar to one another and displaying large variance across samples. A statistical approach Two-way clustering Identifies subsets of genes with coherent expression patterns and large variation across conditions Gene may belong to more than one cluster Can be either un-supervised or supervised Non-hierarchical clustering

Gene shaving Protocol 1.Start with all data in one block. 2.Find the first principal component of the genes 3.For each gene i, compute the absolute value of its correlation with the first principal component 4.Remove the fraction  of genes having the smallest absolute correlation 5.Repeat steps 3~4 until only one gene remains 6.This procedure produces a set of nested gene groups G1  G2  …  G*  …  Gn, from which G* is selected as the optimal gene block (small ), where the optimal shave size is estimated using “maximum gap” method. 7.The rows of the gene expression matrix are orthogonalised with respect to the average of all genes in cluster G* to obtain a new gene expression matrix to encourage discovery of a different second cluster. Repeat steps 2-7 until no interesting gene shaves can be found.

A cellular process may involve a relatively small subset of genes in the dataset. The process may take place only in a small number of samples. Therefore, when the full dataset is analyzed, the signal of this process may be completely overwhelmed by the noise of vast majority of unrelated data. Plaid models search for interpretable biological structures in microarray data, i.e. subsets of the genes/samples, one of which can be used to cluster the other to yield stable and significant partitions/layers. Two-way clustering Allows a gene to be in more than one cluster or in none at all Allows a cluster of genes to be defined with respect to only a subset of samples, not necessarily all of them Non-hierarchical clustering

Plaid models: Non-hierarchical clustering Ideal reordering: Every gene and every sample are in exactly one cluster

Plaid models: Non-hierarchical clustering

Evaluate clustering Clarity of cluster definitions Computational cost Robustness Reproducibility Cancer research Cancer typing Correlating whole-genome expression pattern with particular clinical implication Diagnose malignant tissue from normal one Drug effect study Pathway discovery Assign functions of unknown genes Gene network & regulation: metabolism, photosynthesis, cell cycle, …

Challenges beyond clustering Understand sources of noise and variations in microarray experiments Combine expression data with other sources of information Published literature DNA & protein sequence databases Protein data bank Phylogenetic profiles Metabolic function Annotated experimental functional studies

Clustering Assumption: guilt-by-association Genes that are contained in a particular pathway, or that respond to a common environmental challenge, should be co-regulated and consequently, should show similar patterns of expression. This is a controversial hypothesis because the existence of Convergent regulation (similar temporal expression patterns, different control strategies) & Divergent regulation (similar control regions, different ways to take effects)

Challenges beyond clustering Understand sources of noise and variations in microarray experiments Combine expression data with other sources of information Reconstruct networks of genetic interactions to create integrated and systematic models of biological systems Published literature DNA & protein sequence databases Protein data bank Phylogenetic profiles Metabolic function Annotated experimental functional studies Boolean networks Linear modeling Generic programming Bayesian belief networks

References 1.Quackenbush (2001) Nature Reviews Genetics. 2: Altman & Raychaudhuri (2001) Curr. Opin. Struct. Biol. 11: Lazzeroni & Owen (2000) Tech. Report. Stanford Univ. 4.Aas (2001) SAMBA 5.Tibshirani et al. (1999) Tech. Report. Stanford Univ. 6.Hastie et al. (2000) Genome Biol. 1(2)