CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page: University of South Carolina Department of Computer Science and Engineering
Outline The problem: identifying Diff Expressed Genes Statistic Methods: t-test Non-parametric: Rank product Summary 10/9/20152
The Biological Problem: Identify Differentially Expressed Genes 3 No treatmentTreatment Which pathways will be affected? Which genes are involved?
Identify differentially expressed genes One of the core goals of microarray data analysis is to identify which of the genes show good evidence of being DE. This goal has two parts. 1. The first is select a statistic which will rank the genes in order of evidence for differential expression, from strongest to weakest evidence. 2. The second is to choose a critical-value for the ranking statistic above which any value is considered to be significant.
k-fold change 1.measure of differential expression by the ratio of expression levels between two samples 2.genes with ratios above a fixed cut-off k that is, those whose expression underwent a k-fold change, were said to be differentially expressed 3.this test is not a statistical test, and there is no associated value that can indicate the level of confidence in the designation of genes as differentially expressed or not differentially expressed
k-fold change 4.replication is essential in experimental design because it allows an estimate of variability 5. ability to assess such variability allows identification of biologically reproducible changes in gene expression levels
Standard statistical tests 1.More typically, researchers now rely on variants of common statistical tests. 2.These generally involve two parts: calculating a test statistic and determining the significance of the observed statistic. 3.A standard statistical test for detecting significant change between repeated measurements of a variable in two groups is the t-test; 4.this can be generalized to multiple groups via the ANOVA F statistic.
Standard statistical tests 1.For most practical cases, computing a standard t or F statistic is appropriate, although referring to the t or F distributions to determine significance is often not. 2. The main hazard in using such methods occurs when there are too few replicates to obtain an accurate estimate of experimental variances. In such cases, modeling methods that use pooled variance estimates may be helpful.
Standard statistical tests 1.Regardless of the test statistic used, one must determine its significance 2.Standard interpretations of t-like tests assume that the data are sampled from normal populations with equal variances 3.Expression data may fail to satisfy either or both of these constraints
Standard statistical tests 1.use of non-parametric rank-based statistics is also common, via both traditional statistical methods and 2.ad hoc ones designed specifically for microarray data
RankProd : a non-parametric method to detect differentially regulated genes in replicated experiments (1) originates from an analysis of biological reasoning, easy to understand (2) fast, simple and robust to outliers (suitable for noisy data ) (3) provides statistical significance for each gene and allows for the control of the overall significance (e.g., false discovery rate) (4) provides straightforward way for cross-platform meta-analysis (integrates data generated at different laboratories/under different environments into one study, and achieves increased power) What does it do? What is the method implemented in the package RankProd utilizes the so called rank product non-parametric method (Breitling et al., 2004 ) to identify up-regulated or down-regulated genes under one condition against another condition. Rank Product is a non-parametric statistic which detects items that are consistently highly ranked in a number of lists, for example genes that are consistently found among the most strongly unregulated genes in a number of replicate experiments. How does it compare to other methods for similar purpose
Rank Product Calculate RP: Calculate significance
Permutation tests for calulating significance levels Permutation tests, generally carried out by repeatedly scrambling the samples’ class labels and computing t statistics for all genes in the scrambled data, best capture the unknown structure of the data. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, (2001). Golub, T.R. et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286, (1999). Dudoit, S., Yang, Y.-H., Callow, M.J. & Speed, T.P. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578 (Department of Statistics, University of California at Berkeley, Berkeley, CA, 2000).
Summary The problem: Identify Differentially expressed genes from Microarray data How to identify: t-test and Rank product How to evaluate significance of identified genes