02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver.

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Basic Gene Expression Data Analysis--Clustering
Outlines Background & motivation Algorithms overview
UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.
Bioinformatics “Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.” SheAvery
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Gene Expression Chapter 9.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Microarray GEO – Microarray sets database
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 9 Clustering Algorithms Bioinformatics Data Analysis and Tools.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
What is Cluster Analysis
Alizadeh et. al. (2000) Stephen Ayers 12/2/01. Clustering “Clustering is finding a natural grouping in a set of data, so that samples within a cluster.
Introduction to Bioinformatics - Tutorial no. 12
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Patrick Kemmeren Using EP:NG.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Gene Expression Analysis using Microarrays Anne R. Haake, Ph.D.
Analysis of microarray data
Presented by Karen Xu. Introduction Cancer is commonly referred to as the “disease of the genes” Cancer may be favored by genetic predisposition, but.
Whole Genome Expression Analysis
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
From motif search to gene expression analysis
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
More on Microarrays Chitta Baral Arizona State University.
A New Oklahoma Bioinformatics Company. Microarray and Bioinformatics.
Finish up array applications Move on to proteomics Protein microarrays.
Bioinformatics Brad Windle Ph# Web Site:
The Broad Institute of MIT and Harvard Classification / Prediction.
Microarrays.
+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)
Predicting protein degradation rates Karen Page. The central dogma DNA RNA protein Transcription Translation The expression of genetic information stored.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Overview of Bioinformatics 1 Module Denis Manley..
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
DNA Microarray Data Analysis using Artificial Neural Network Models. by Venkatanand Venkatachalapathy (‘Venkat’) ECE/ CS/ ME 539 Course Project.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
High-throughput omic datasets and clustering
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Brad Windle, Ph.D Unsupervised Learning and Microarrays Web Site: Link to Courses and.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
A Short and Simple Introduction to Linear Discriminants (with almost no math) Jennifer Listgarten, November 2002.
Prof. Yechiam Yemini (YY) Computer Science Department Columbia University (c)Copyrights; Yechiam Yemini; Lecture 2: Introduction to Paradigms 2.3.
Gene expression. Gene Expression 2 protein RNA DNA.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
Unsupervised Learning
Semi-Supervised Clustering
Gene Expression Analysis
Gene expression.
Assessing Hierarchical Modularity in Protein Interaction Networks
Cluster Analysis in Bioinformatics
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Unsupervised Learning
Presentation transcript:

02/21/00 V1.2 Clustering Large Data Sets in Gene expression analysis Daniel Weaver

02/21/00 V1.2 Overview What is “Gene Expression”? Scientific questions and clustering techniques

02/21/00 V1.2 “The Central Dogma” The arrows represent the transfer or flow of information. DNA and RNA store information in a base-4 code (the four nucleotides). Proteins store information in a base-20 code (the 20 amino acids). DNA  RNA  Protein TranscriptionTranslation

02/21/00 V1.2 What’s in a name? DNA  RNA = “Transcription” –because the information is exactly copied (or “transcribed”) from one base-4 system (DNA) to an equivalent base-4 system (RNA). Think of a monk transcribing a scroll. RNA  Protein = “Translation” –because the information is converted from a base-4 system (RNA) to a base-20 system (protein). Think of a monk translating a scroll into a new language.

02/21/00 V1.2 What is a “gene”? “A gene is a segment of DNA that contains all the information necessary to code for some function.” A gene is also the unit of information that is transferred through Transcription and Translation.

02/21/00 V1.2 Switching genes on (or off) Promoter Enhancer Purpose: to correctly control the amount of active functional (protein) product present in the cell or organism. Figure taken, with permission from Alberts et al., Molecular Biology of the Cell

02/21/00 V1.2 Presence vs. expression All cells have the same set of genes. Different cell types express different subsets of their genes. Constitutive genes are expressed in most cell types. Cell-type specific genes are expressed in only a few cell types. A B C

02/21/00 V1.2 Gene expression responds to the environment Changes to the cell’s internal or external environment can lead to changes in gene expression. Most human diseases manifest through a mis- regulation of gene expression A B C

02/21/00 V1.2 Microarrays and related technologies

02/21/00 V1.2 Example - raw microarray data = more abundant in cell type A = more abundant in cell type B = equally abundant in both cell types

02/21/00 V1.2 Interpreting raw data Most gene expression detection data sets are expressed as a ratio of Red:Green (experiment:control) signal. Frequently use a normalized log(red:green) ratio: for gene X X i = Such that the Euclidean length of X is 1. Interpreted raw data are tabulated in a Entity-by- Entity table, Genes-by-Experiments. log (ratio i ) [  log 2 (ratio i )] ½

02/21/00 V1.2 Gene-by-Experiment table Gene expression analysis is a variant of classic data mining – looking for informative patterns in the rows and columns of this type of table.

02/21/00 V1.2 Data volumes ~120,000 genes in the human genome. Expression detection techniques can take from measurement simultaneously on each gene. Many, diverse Gene and Experiment attributes In 3-5 years, data sets will be available for analysis Data volumes ranging from 10’s of Gb to a few Tb

02/21/00 V1.2 Analyzing Gene expression data What genes are (or are not) expressed? –In different cells –Under different external conditions –In different disease states How much does their expression change? Does the change in expression correlate with other observed parameters? Handled with descriptive statistics

02/21/00 V1.2 Clustering and Classifying gene expression Scientific questions to be answered Clustering techniques that are being applied Lots of room and need for novel statistical and computational analyses

02/21/00 V1.2 Clustering Gene expression data Functionally classify novel genes Identify co-regulated gene groups Identify diagnostic gene expression patterns

02/21/00 V1.2 Functionally Classifying Genes Problem: –Genome sequencing projects identify many, previously unstudied genes. –Can one use the genes’ expression patterns to cluster genes that have similar function?

02/21/00 V1.2 Inputs and outputs Inputs –A set of genes whose functional classification is know. –A set of genes whose functional classification is unknown. –Gene expression data sets for all the genes. Desired Output –A “best fit” functional classification for each of the novel genes.

02/21/00 V1.2 Examples Brown et al. (2000) PNAS 97(1), Input: –Log normalized data from 79 experiments on 2,467 genes Trained on 2/3 of the genes, tested on remaining 3rd. Classifiers tried include: Support Vector Machines and four machine learning algorithms (Parzen, FLD, C4.5, MOC1 ) SVM’s performed the best and using the kernel: K(X,Y) = (XY+1) d (d=1,2,or 3) This kernel transforms the data into higher dimensional space where it is easier to identify a separating hyperplane Sensitivity = ~0.6

02/21/00 V1.2 Examples Hierarchical clustering, Average linkage (DeRisi et al) –Cluster the genes –Examine the clusters (through human intervention) to determine whether a cluster has a genes with known functions.

02/21/00 V1.2 Co-regulated genes Problem: –Biological processes typically involve genes of many functional categories. –Knowledge of what genes act coordinately can help direct drug development Expression Group 1 Expression Group 2 Expression Group 3

02/21/00 V1.2 Inputs and Outputs Inputs –Gene expression data for all genes of interest –(Information about the experimental conditions in which the gene expression data sets were collected) Desired Outputs –Ordering of the input genes into sets of genes with related expression patterns

02/21/00 V1.2 Examples Eisen et al. (1998) PNAS 95: Input: –Log normalized data from 12 experiments on 2,467 genes Performed pair-wise average linkage cluster analysis, using a modified Pearson correlation coefficient metric Gene that cluster together are displayed in a dendrogram wherein the branch lengths correlate to the degree of similarity

02/21/00 V1.2 Examples Tavazoie et al. (1999) Nature Genetics 22: Inputs: –“Variance-normalized” data from 15 experiments on 6,220 genes. Variance normalization is X ij = (X ij – X i )/ stdev (X i ) for gene i in experiment j. Used Euclidean distance as the metric and performed k-means clustering, programmed to find 10, 30, and 60 centroids. Gene clusters were shown to contain functionally related genes as expected.

02/21/00 V1.2 Diagnostic expression patterns Problem: –Many diseases cannot be reliably distinguished through traditional techniques (microscopy, pathology, etc.) –Given gene expression data from diseased tissue, is there a set of genes that correctly distinguishes the diseases (as judged by other criteria).

02/21/00 V1.2 Inputs and Outputs Inputs –Gene expression data for all genes (available) –Information about the patients afflicted with the complex disease of interest. Desired output –The minimal set of genes that accurately partitions the disease, i.e. the minimal diagnostic gene expression pattern.

02/21/00 V1.2 Examples Alizadeh et al. (2000) Nature 403: Input: –Log normalized data from 96 experiments on 4,026 genes (out of 17,856 measured). The 96 experiments were performed on cancer biopsies from patients with Diffuse Large B-cell Lymphoma (DLBCL). Pair-wise average linkage cluster analysis, using a modified Pearson correlation coefficient metric (Eisen et al., 1998). Two previously unknown DLBCL sub-types distinguished by small gene clusters (~40 genes and ~70 genes) Subtypes correspond to prognosis: –“GC B-like”  76% survivorship –“Activated B-like”  16% survivorship (Overhead)

02/21/00 V1.2 Summary Current techniques include supervised and unsupervised classification Three main scientific questions: –Functionally classifying genes –Identifying co-regulated sets of genes –Identifying diagnostic expression “fingerprints” Data sets are relatively small now, but growing rapidly. Classification draws from the expression data and from other domain knowledge. Lots of room and need for novel statistical and computational analyses

02/21/00 V1.2 Further Reading Clustering Gene Expression Data 1.Alizadeh, et al. (2000) Nature 403: Alon, et al. (1999) PNAS 96: Butte and Kohane. (2000) Proceedings of Pacific Sym. Biocomputing. 4.Brown, et al. (2000) PNAS 97: Eisen, et al. (1998) PNAS 95: Iyer, et al. (1999) Science 283: Raychaudhuri, et al. (2000) Proceedings of Pacific Sym. Biocomputing. 8.Roberts, et al. (2000) Science 287: Ross et al. (2000) Nature Genetics 24: Scherf, et al. (2000) Nature Genetics 24: Spellman, et al. (1998) Mol Biol Cell 9: Tamayo, et al. (1999) PNAS 96: Tavazoie, et al. (1999) Nature Gen 22: Zhu and Zhang. (2000) Proceedings of Pacific Sym. Biocomputing.

02/21/00 V1.2 Further Reading Other related gene expression papers: Holstege, et al. (1998) Cell 95: DeRisi et al. (1996) Nature Genetics 14: Schena et al. (1995) Science 270: DeRisi et al. (1997) Science 278: Hilsenbeck et al. (1999) J. Natl. Cancer Inst. 91:

02/21/00 V1.2 Expression Data sets European Bioinformatics Institute (EBI) (links to refs. 4,5,6,11) –Main microarray page –Microarray public data set page (this is a great portal site from which you can browse to many of the published data sets) National Human Genome Research Institute (NHGRI) –Main page –Data set down load page ftp://kronos.nhgri.nih.gov/pub/outgoing/olga/old/ National Cancer Institute (NCI) (ref. 9 & 10) –Main page –Data set down load page Lymphoma data set (ref. 1) –Main page –Data set download page

02/21/00 V1.2 Daniel Weaver