‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Slides:



Advertisements
Similar presentations
Outlines Background & motivation Algorithms overview
Advertisements

Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Presentation Outline Introduction and general clustering techniques.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Extraction and comparison of gene expression patterns from 2D RNA in situ hybridization images BIOINFORMATICS Gene expression Vol. 26, no. 6, 2010, pages.
Sandrine Dudoit1 Microarray Experimental Design and Analysis Sandrine Dudoit jointly with Yee Hwa Yang Division of Biostatistics, UC Berkeley
Microarray GEO – Microarray sets database
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Information Aspects of Nucleic Acids Measurement Technologies Description of nucleic acid measurement technologies Algorithmic, optimization, data analysis.
Fuzzy K means.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Microarrays: Basic Principle AGCCTAGCCT ACCGAACCGA GCGGAGCGGA CCGGACCGGA TCGGATCGGA Probe Targets Highly parallel molecular search and sort process based.
Analysis of microarray data
Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
CDNA Microarrays MB206.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Correlate February 19, 2010 Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten A method for the integrative analysis of two genomic.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Gene expression analysis
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
1 FINAL PROJECT- Key dates –last day to decided on a project * 11-10/1- Presenting a proposed project in small groups A very short presentation (Max.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Statistical Analysis of DNA Microarray. An Example of HDLSS in Genetics.
Statistics for Differential Expression Naomi Altman Oct. 06.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Pan-cancer analysis of prognostic genes Jordan Anaya Omnes Res, In this study I have used publicly available clinical and.
EE150a – Genomic Signal and Information Processing On DNA Microarrays Technology October 12, 2004.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Exploring Microarray data
Functional Genomics in Evolutionary Research
Dimension Reduction via PCA (Principal Component Analysis)
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Clustering.
Volume 7, Issue 4, Pages (April 2005)
Dimension reduction : PCA and Clustering
Clustering.
Inferring Cellular Processes from Coexpressing Genes
Unsupervised Learning
Presentation transcript:

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E May 15, 2003

Presentation Outline Biology Background Reminder of Principle Component Analysis What is Gene Shaving ? The ‘Gene Shaving’ Algorithm Applications of Gene Shaving Conclusions

What is “gene expression”? Each cell contains a complete copy of all genes. The difference between a skin cell and bone cell is determined by which genes are producing proteins i.e., which genes are being “expressed”. The expression of DNA information occurs in two steps:  Transcription:DNA  mRNA  Translation:mRNA  protein DNA microarrays measure transcription (i.e., the mRNA produced)

Reference cells sample test cells sample Label with dye Transcription Hybridize to array

The Dataset N x p expression matrix X: p columns (patients) N rows (genes) Green: under-expressed genes. Red: over-expressed genes. X = [x ij ]

The ratio of the red and green intensities for each spot indicates the relative abundance of the corresponding DNA probe in the two nucleic acid target samples. X ij = log 2 (R/G) X ij < 0, gene is over expressed in test sample relative to reference sample X ij = 0, gene is expressed equally X ij > 0, gene is under expressed in test sample relative to reference. sample.

Knowing the list of human genes does not mean we know what they do. cDNA arrays help study the variation of gene expression across samples (e.g., tissues, or patients). Major challenge is interpreting data that consists of the expression levels of, say 6000 genes and 50 patients. Present goal: create a clustering that organizes genes with coherent behavior across samples. Remarks

1 st eigengene (principal component of X T ) Singular value decomposition of X T : X T = U  V T = 11 rr u1u1 v1v1 X T V= U   1 u 1 = X T v 1 = linear comb. columns of X T (genes) with highest variance g1g1 g2g2 gNgN

Introduction What is Gene Shaving ?  A new statistical method that identifies subsets of genes with coherent expression patterns and large variation across different conditions  Differs from hierarchical clustering and other widely used methods for analyzing gene expression in that genes may belong to more that one cluster.

The Gene Shaving Algorithm

Estimating the Optimal Cluster Size K Gene Shaving requires a quality measure for a cluster To select a good cluster, the method focuses on high coherence between members of the cluster

Estimating the Optimal Cluster Size K (cont.) The method defines the following measures of variances for a cluster S k : The ‘Between Variance’ is the variance of the mean gene The ‘Within Variance’ measures the variability of each gene about the average

A useful measure for choosing cluster size is the percent variance: A large R 2 implies a tight cluster of coherent genes Gene Shaving uses this measure for selecting a cluster from the shaving sequence S k Estimating the Optimal Cluster Size K (cont.)

Once a cluster is selected from the sequence, we can proceed to finding the optimal cluster size Let D k be the R 2 measure for the k-th sequence member. We wish to find the “Gap” between this value D k and D *b k, which is the R 2 measure for cluster S *b k This S *b k is the clustering sequence from a permuted matrix X *b Estimating the Optimal Cluster Size K (cont.)

The “Gap” function is defined as: Where D * k is the average of D *b k over b. The optimal cluster size K is selected such that this “Gap” is the largest: Estimating the Optimal Cluster Size K (cont.)

The Gene Shaving Algorithm (cont.)

So Far : form clusters S k with high variance across samples; high correlation among genes within a cluster; low correlation between genes in different clusters. The procedure seeks clusters S k by maximizing v(S k ) = var( vector of col. avgs. ) Now incorporate supervision: use info, y, about the patients, and seek S k by maximizing (1-  ) v(S k ) +  J ( v(S k ), y )

Goal is in predicting patient survival  Find genes whose expression correlates with patient survival.  Produce groupings of patients which are statistically different in survival.  Use additional information about the patients, y = (y 1,…, y p ), and combine unsupervised & supervised criteria into the objective function: (1-  ) v(S k ) +  J ( v(S k ), y ) 0    1

Maximize (1-  ) v(S k ) +  J ( v(S k ), y ) Information measure J ( v(S k ), y ) is a quadratic function that depends on the type of patient information, y. y = (y 1,…, y p ) may identify catagories of patients. Used here: y = (p patient survival times), and J (v(S k ), y) = g g T where g is the score vector of the Cox model for predicting survival.

They chose  = 0.1 as it “seemed to give a good mix of high gene correlation and low p-value for the Cox model”.

This produced a cluster of 234 genes. It includes “strong” genes for predicting survival (130 of the 200 stongest) as well as some“weak” genes (e.g., #1332).

(a)Gap curve for supervised shaving. (b)Survival curves in the two groups defined by the low or high expression of the 234 genes. Group I has high expression of positive genes, and low expression of negative genes; Group 2 has low expression of positive genes, and high expression of negative genes. Negative genes are those preceded by a minus sign in Table 2.

Conclusions The proposed gene shaving methods search for clusters of genes showing both high variation across the samples, and correlation across the genes. This method is a potentially useful tool for exploration of gene expression data and identification of interesting clusters of genes worth further investigation