Discrimination and clustering with microarray gene expression data Terry Speed, Jane Fridlyand, Yee Hwa Yang and Sandrine Dudoit* Department of Statistics,

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Random Forest Predrag Radenković 3237/10
CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Mathematical Statistics, Centre for Mathematical Sciences
Biostatistics Grand Rounds March 13, 2001 Discussion of Richard Simon’s talk: Using DNA Microarrays to Improve Cancer Diagnostic Classification.
Gene Expression Chapter 9.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Normalization for cDNA Microarray Data Yee Hwa Yang, Sandrine Dudoit, Percy Luu and Terry Speed. SPIE BIOS 2001, San Jose, CA January 22, 2001.
Gene expression patterns of breast cancer phenotype revealed by molecular profiling Gabriela Alexe, IBM Research DIMACS Workshop on Detecting and Processing.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Discrimination Class web site: Statistics for Microarrays.
Differentially expressed genes
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Discrimination Methods As Used In Gene Array Analysis.
1 Lecture 21, Statistics 246, April 8, 2004 Identifying expression differences in cDNA microarray experiments, cont.
Networks and Algorithms in Bio-informatics D. Frank Hsu Fordham University *Joint work with Stuart Brown; NYU Medical School Hong Fang.
Cluster Analysis Class web site: Statistics for Microarrays.
Microarray Data Analysis
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
DIMACS Workshop on Machine Learning Techniques in Bioinformatics 1 Cancer Classification with Data-dependent Kernels Anne Ya Zhang (with Xue-wen.
Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini
Statistics for Microarrays
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Gene expression profiling identifies molecular subtypes of gliomas
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Xuelian Wei Department of Statistics Most of Slides Adapted from by Darlene Goldstein Classification.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
CDNA Microarrays MB206.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
RNAseq analyses -- methods
Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:
Department of Statistics, University of California, Berkeley, and Division of Genetics and Bioinformatics, Walter and Eliza Hall Institute of Medical Research.
Lecture 20: Cluster Validation
SLIDES RECYCLED FROM ppt slides by Darlene Goldstein Supervised Learning, Classification, Discrimination.
Scenario 6 Distinguishing different types of leukemia to target treatment.
1 Advanced analysis: Classification, clustering and other multivariate methods. Statistics for Microarray Data Analysis – Lecture 4 The Fields Institute.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Statistics for Differential Expression Naomi Altman Oct. 06.
Henrik Bengtsson Mathematical Statistics Centre for Mathematical Sciences Lund University, Sweden Plate Effects in cDNA Microarray Data.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Prof. Yechiam Yemini (YY) Computer Science Department Columbia University (c)Copyrights; Yechiam Yemini; Lecture 2: Introduction to Paradigms 2.3.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Statistical Concepts Basic Principles An Overview of Today’s Class What: Inductive inference on characterizing a population Why : How will doing this allow.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
High-throughput genomic profiling of tumor-infiltrating leukocytes
Topics in analysis of microarray data : clustering and discrimination
Bagging and Random Forests
Classification with Gene Expression Data
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Significance Analysis of Microarrays (SAM)
CSE 4705 Artificial Intelligence
Hierarchical clustering approaches for high-throughput data
Significance Analysis of Microarrays (SAM)
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Microarray Gene Expression Analysis of Fixed Archival Tissue Permits Molecular Classification and Identification of Potential Therapeutic Targets in Diffuse.
Normalization for cDNA Microarray Data
Presentation transcript:

Discrimination and clustering with microarray gene expression data Terry Speed, Jane Fridlyand, Yee Hwa Yang and Sandrine Dudoit* Department of Statistics, UC Berkeley, *Department of Biochemistry, Stanford University ENAR, Charlotte NC, March

Outline Introductory comments Classification Clustering A synthesis Concluding remarks

Tumor classification A reliable and precise classification of tumors is essential for successful treatment of cancer. Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables. In spite of recent progress, there are still uncertainties in diagnosis. Also, it is likely that the existing classes are heterogeneous. DNA microarrays may be used to characterize the molecular variations among tumors by monitoring gene expression on a genomic scale.

Tumor classification, ctd There are three main types of statistical problems associated with tumor classification: 1. The identification of new/unknown tumor classes using gene expression profiles; 2. The classification of malignancies into known classes; 3. The identification of “marker” genes that characterize the different tumor classes. These issues are relevant to other questions we meet, e.g. characterising/classifying neurons or the toxicity of chemicals administered to cells or model animals.

Gene Expression Data Gene expression data on p genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = Log( Red intensity / Green intensity) Log(Avg. PM - Avg. MM) sample1sample2sample3sample4sample5 …

Comparison of discrimination methods In this field many people are inventing new methods of classification or using quite complex ones (e.g. SVMs). Is this necessary? We did a study comparing several methods on three publicly available tumor data sets: the Leukemia data set, the Lymphoma data set, and the NIH 60 tumor cell line data, as well as some unpublished data sets. We compared NN, FLDA, DLDA, DQDA and CART, the last with or without aggregation (bagging or boosting). The results were unequivocal: simplest is best!

Lymphoma data set: 29 B-CLL, 9 FL, 43 DLBCL, 4,682 genes50 genes Images of correlation matrix between 81 samples

Cluster Analysis Can cluster genes, cell samples, or both. Strengthens signal when averages are taken within clusters of genes (Eisen). Useful (essential ?) when seeking new subclasses of cells, tumors, etc. Leads to readily interpreted figures.

Clusters Taken from Nature February, 2000 Paper by A Alizadeh et al Distinct types of diffuse large B-cell lymphoma identified by Gene expression profiling,

Discovering sub-groups

Clustering problems Suppose we have gene expression data on p genes for n tumor mRNA samples in the form of gene expression profiles x i = (x i1, …, x ip ), i=1,…,p. Three related tasks are: 1. Estimating the number of tumor clusters ; 2. Assigning each tumor sample to a cluster; 3. Assessing the strength/confidence of cluster assignments for individual tumors. These are generic clustering problems.

Assessing the strength/confidence of cluster assignments The silhouette width of an observation is s = (b-a )/max(a,b) where a is the average dissimilarity between the observation and all others in the cluster to which it belongs, and b is the smallest of the average dissimilarities between the observation and ones in other clusters. Large s means well clustered.

Bagging In discriminant analysis, it is well known that gains in accuracy can be obtained by aggregating predictors built from perturbed versions of the learning set (cf. bagging and boosting). In the bootstrap aggregating or bagging procedure, perturbed learning sets of the same size as the original learning set are formed by drawing at random with replacement from the learning set, i.e., by forming non- parametric bootstrap replicates of the learning set. Predictors are build for each perturbed dataset and aggregated by plurality voting.

Bagging a clustering algorithm For a fixed number k of clusters –Generate multiple bootstrap learning sets (B=50) –Apply the clustering algorithm to each bootstrap learning set; –Re-label the clusters for the bootstrap learning sets so that there is maximum overlap with the original clustering of these observations; –The cluster assignment of each observation is then obtained by plurality voting. Record for each observation its cluster vote (CV), which is the proportion of votes in favour of the “winning” cluster.

Lymphoma data set

Leukemia data set

Comparison of clustering and other approaches to microarray data analysis Cluster analyses: 1) Usually outside the normal framework of statistical inference; 2) less appropriate when only a few genes are likely to change. 3) Needs lots of experiments Single gene approaches 1) may be too noisy in general to show much 2) may not reveal coordinated effects of positively correlated genes. 3) harder to relate to pathways.

Clustering as a means to an end We and others (Stanford) are working on methods which try to combine combine clustering with more traditional approaches to microarray data analysis. Idea: find clusters of genes and average their responses to reduce noise and enhance interpretability. Use testing to assign significance with averages of clusters of genes as we would with single genes.

Clustering genes Cluster 6=(1,2) Cluster 7=(1,2,3) Cluster 8=(4,5) Cluster 9= (1,2,3,4,5) Let p = number of genes. 1. Calculate within class correlation. 2. Perform hierarchical clustering which will produce (2p-1) clusters of genes. 3. Average within clusters of genes. 4 Perform testing on averages of clusters of genes as if they were single genes. E.g. p=5

Data - Ro1 Transgenic mice with a modified G i coupled receptor (Ro1). Experiment: induced expression of Ro1 in mice. 8 control (ctl) mice 9 treatment mice eight weeks after Ro1 being induced. Long-term question: Which groups of genes work together. Based on paper: Conditional expression of a Gi-coupled receptor causes ventricular conduction delay and a lethal cardiomyopathy, see Redfern C. et al. PNAS, April 25, also (Conklin lab, UCSF)

Histogram Cluster of genes (1703, 3754)

Top 15 averages of gene clusters = (1703, 3754) = (6194, 1703, 3754) = (4572, 4772, 5809) = (2534, 1343, 1954) = (6089, 5455, 3236, 4014) Might be influenced by 3754 Correlation T Group ID

Closing remarks More sophisticated classification methods may become justified when data sets are larger. There seems to be considerable room for approaches which bring cluster analysis into a more traditional statistical framework. The idea of using clustering to obtain derived variables seems promising, but has yet to realise this promise.

AcknowledgmentsUCB Yee Hwa Yang Jane Fridlyand WEHI Natalie Thorne PMCI David Bowtell Chuang Fong Kong Stanford Sandrine Dudoit UCSF Bruce Conklin Karen Vranizan