Mar 2002 (GG)1 Clustering Gene Expression Data Gene Expression Data Clustering of Genes and Conditions Methods –Agglomerative Hierarchical: Average Linkage.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Analysis of Microarray Genomic Data of Breast Cancer Patients Hui Liu, MS candidate Department of statistics Prof. Eric Suess, faculty mentor Department.
BioInformatics (3).
Basic Gene Expression Data Analysis--Clustering
Early Adenoma Normal Hyperplasti c Dysplastic Carcinoma Polyp Metastasis (to Liver) COLON CANCER - TUMOR PROGRESSION: Tumor progression.
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
Cluster analysis for microarray data Anja von Heydebreck.
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
The Broad Institute of MIT and Harvard Clustering.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Mutual Information Mathematical Biology Seminar
Microarray Data Preprocessing and Clustering Analysis
BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
DNA Arrays …DNA systematically arrayed at high density, –virtual genomes for expression studies, RNA hybridization to DNA for expression studies, –comparative.
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Introduction to Bioinformatics - Tutorial no. 12
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Advanced Methods of Data Analysis 9: :00CTWC 10: :00 CTWC exercise 11:00 – 11:30 Break 11: :00 SPIN 12: :00 SPIN exercise Course.
I=1,2,...N data points = vertices of graph neighbors i,j connected by edges J i,j – weight associated with edge i,j J 5,8 J i,j depends on distance.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF Introduction to DNA Microarrays DNA Microarrays and DNA chips resources.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
DNA Microarrays and DNA chips resources on the web
Gene expression profiling identifies molecular subtypes of gliomas
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
From motif search to gene expression analysis
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
More on Microarrays Chitta Baral Arizona State University.
Microarrays.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
A Short Overview of Microarrays Tex Thompson Spring 2005.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Cluster validation Integration ICES Bioinformatics.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Unsupervised Learning
FINAL PROJECT- Key dates
Gene Expression Analysis
Unsupervised Learning
Presentation transcript:

Mar 2002 (GG)1 Clustering Gene Expression Data Gene Expression Data Clustering of Genes and Conditions Methods –Agglomerative Hierarchical: Average Linkage –Centroids: K-Means –Physically motivated: Super-Paramagnetic Clustering Coupled Two-Way Clustering EMBnet: DNA Microarrays Workshop Mar. 4 – Mar. 8, 2002,UNIL & EPFL, Lausanne Gaddy Getz, Weizmann Institute, Israel

Mar 2002 (GG)2 Gene Expression Technologies DNA Chips (Affymetrix) and MicroArrays can measure mRNA concentration of thousands of genes simultaneously General scheme: Extract RNA, synthesize labeled cDNA, Hybridize with DNA on chip.

Mar 2002 (GG)3 Single Experiment After hybridization –Scan the Chip and obtain an image file –Image Analysis (find spots, measure signal and noise) Tools: ScanAlyze, Affymetrix, … Output File –Affymetrix chips: For each gene a reading proportional to the concentrations and a present/absent call. (Average Difference, Absent Call) –cDNA MicroArrays: competing hybridization of target and control. For each gene the log ratio of target and control. (CH1I-CH1B, CH2I-CH2B)

Mar 2002 (GG)4 Preprocessing: From one experiment to many Chip and Channel Normalization –Aim: bring readings of all experiments to be on the same scale –Cause: different RNA amounts, labeling efficiency and image acquisition parameters –Method: Multiply readings of each array/channel by a scaling factor such that: The sum of the scaled readings will be the same for all arrays Find scaling factor by a linear fit of the highly expressed genes –Note: In multi-channel experiments normalize each channel separately.

Mar 2002 (GG)5 Preprocessing: From one experiment to many Filtering of Genes –Remove genes that are absent in most experiments –Remove genes that are constant in all experiments –Remove genes with low readings which are not reliable.

Mar 2002 (GG)6 Noise and Repeats >90% 2 to 3 fold Multiplicative noise Repeat experiments Log scale dist(4,2)=dist(2,1) log – log plot

Mar 2002 (GG)7 We can ask many questions? Which genes are expressed differently in two known types of conditions? What is the minimal set of genes needed to distinguish one type of conditions from the others? Which genes behave similarly in the experiments? How many different types of conditions are there? Supervised Methods (use predefined labels) Unsupervised Methods (use only the data)

Mar 2002 (GG)8 Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated. Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression. Unsupervised Analysis Clustering Methods

Mar 2002 (GG)9 What is clustering?

Mar 2002 (GG)10 T (RESOLUTION) Cluster Analysis Yields Dendrogram

Mar 2002 (GG)11 What is clustering? More Mathematically Input: N data points, X i, i=1,2,…,N in a D dimensional space. Goal: Find “natural” groups or clusters. Data point of same cluster - “more similar” Tasks: –Determine number of clusters –Generate a dendrogram –Identify significant “stable” clusters

Mar 2002 (GG)12 Clustering is ill-posed Problem specific definitions Similarity: which points should be considered close? –Correlation coefficient –Euclidean distance Resolution: specify/hierarchical results Shape of clusters: general, spherical.

Mar 2002 (GG)13 Adjusting Data Adjusting

Mar 2002 (GG)14 Similarity Measure Similarity measure –Centered Correlation –Uncentered Correlation –Absolute correlation –Euclidean

Mar 2002 (GG)15 Similarity Measure Similarity measures –Centered Correlation –Uncentered Correlation –Absolute correlation –Euclidean

Mar 2002 (GG) Agglomerative Hierarchical Clustering Distance between joined clusters Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Dendrogram The dendrogram induces a linear ordering of the data points

Mar 2002 (GG)17 Agglomerative Hierarchical Clustering Results depend on distance update method –Single Linkage: elongated clusters –Complete Linkage: sphere-like clusters Greedy iterative process NOT robust against noise No inherent measure to choose the clusters

Mar 2002 (GG)18 Centroid Methods - K-means Iteration = 0 Start with random position of K centroids. Iteratre until centroids are stable Assign points to centroids Move centroids to center of assign points

Mar 2002 (GG)19 Start with random position of K centroids. Iteratre until centroids are stable Assign points to centroids Move centroids to center of assign points Iteration = 1 Centroid Methods - K-means

Mar 2002 (GG)20 Start with random position of K centroids. Iteratre until centroids are stable Assign points to centroids Move centroids to center of assign points Iteration = 1 Centroid Methods - K-means

Mar 2002 (GG)21 Iteration = 3 Start with random position of K centroids. Iteratre until centroids are stable Assign points to centroids Move centroids to center of assign points Centroid Methods - K-means

Mar 2002 (GG)22 Result depends on initial centroids’ position Fast algorithm: compute distances from data points to centroids No way to choose K. Example: 3 clusters / K=2, 3, 4 Breaks long clusters Centroid Methods - K-means

Mar 2002 (GG)23 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation The idea behind SPC is based on the physical properties dilute magnets. Calculating correlation between magnet orientations at different temperatures (T). T=Low

Mar 2002 (GG)24 The idea behind SPC is based on the physical properties dilute magnets. Calculating correlation between magnet orientations at different temperatures (T). T=High Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation

Mar 2002 (GG)25 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation The idea behind SPC is based on the physical properties dilute magnets. Calculating correlation between magnet orientations at different temperatures (T). T=Intermediate

Mar 2002 (GG)26 The algorithm simulates the magnets behavior at a range of temperatures and calculates their correlation The temperature (T) controls the resolution Example: N=4800 points in D=2 Super-Paramagnetic Clustering (SPC)

Mar 2002 (GG)27 Output of SPC Size of largest clusters as function of T Dendrogram Stable clusters “live” for large  T A function  (T) that peaks when stable clusters break

Mar 2002 (GG)28 Choosing a value for T

Mar 2002 (GG)29 Advantages of SPC Scans all resolutions (T) Robust against noise and initialization - calculates collective correlations. Identifies “natural” (  ) and stable clusters (  T) No need to pre-specify number of clusters Clusters can be any shape

Mar 2002 (GG)30 Many clustering methods applied to expression data Agglomerative Hierarchical –Average Linkage (Eisen et. al., PNAS 1998) Centroid (representative) –K-Means (Golub et. al., Science 1999) –Self Organized Maps (Tamayo et. al., PNAS 1999) Physically motivated –Deterministic Annealing (Alon et. al., PNAS 1999) –Super-Paramagnetic Clustering (Getz et. al., Physica A 2000)

Mar 2002 (GG)31 Available Tools Software packages: –M. Eisen’s programs for clustering and display of results (Cluster, TreeView) Predefined set of normalizations and filtering Agglomerative, K-means, 1D SOM Web sites: –Coupled Two-Way Clustering (CTWC) website both CTWC and SPC – General mathematical tools –MATLAB Agglomerative, public m-files. –Statistical programs (SPSS, SAS, S-plus)

Mar 2002 (GG)32 Back to gene expression data 2 Goals: Cluster Genes and Conditions 2 independent clustering: –Genes represented as vectors of expression in all conditions –Conditions are represented as vectors of expression of all genes

Mar 2002 (GG)33 1. Identify tissue classes (tumor/normal) First clustering - Experiments

Mar 2002 (GG)34 2. Find Differentiating And Correlated Genes Second Clustering - Genes Ribosomal proteins Cytochrome C HLA2 metabolism

Mar 2002 (GG)35 Two-way Clustering

Mar 2002 (GG)36 Coupled Two-Way Clustering (CTWC) G. Getz, E. Levine and E. Domany (2000) PNAS Motivation: Only a small subset of genes play a role in a particular biological process; the other genes introduce noise, which may mask the signal of the important players. Only a subset of the samples exhibit the expression patterns of interest. New Goal: Use subsets of genes to study subsets of samples (and vice versa) A non-trivial task – exponential number of subsets. CTWC is a heuristic to solve this problem.

Mar 2002 (GG)37 Football Booing Cheering

Mar 2002 (GG)38 CTWC of colon cancer data (A) (B)

Mar 2002 (GG)39 Using only the tumor tissues to cluster Genes, reveals correlation between two Gene clusters; Cell growth and epthelial COLON CANCER - ASSOCIATED WITHEPITHELIAL CELLS CTWC of colon cancer - genes

Mar 2002 (GG)40 Glioma cell line Low grade astrocytoma Secondary GBM Primary GBM p53 mutation S11 S12 S14 S10 S13 CTWC of Glioblastoma Data – S1(G5) Godard, Getz, Kobayashi, Nozaki, Diserens, Hamon, Stupp, Janzer, Bucher, de Tribolet, Domany & Hegi (2002) Submitted AB STAT-induced STAT inhibitor 3 M32977 VEGF ANGIOGENESIS M35410 IGFBP2 X51602 VEGFR1 ANGIOGENESIS M96322 Gravin AB STAT-induced STAT inhibitor 2 X52946 PTN J04111 C-JUN X79067 TIS11B AB STAT-induced STAT inhibitor 3 M32977 VEGF ANGIOGENESIS M35410 IGFBP2 X51602 VEGFR1 ANGIOGENESIS M96322 Gravin AB STAT-induced STAT inhibitor 2 X52946 PTN J04111 C-JUN X79067 TIS11B

Mar 2002 (GG)41 Biological Work Literature search for the genes Genomics: search for common regulatory signal upstream of the genes Proteomics: infer functions. Design next experiment – get more data to validate result. Find what is in common with sets of experiments/conditions.

Mar 2002 (GG)42 Summary Clustering methods are used to –find genes from the same biological process –group the experiments to similar conditions Different clustering methods can give different results. The physically motivated ones are more robust. Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions