6/10/20151 Microarray Data Analysis
6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks!
6/10/20153 Gene Expression Matrix After image processing, obtain a data matrixThe final gene expression matrix (on the right) is needed for higher level analysis and mining. Samples Genes Gene expression levels Images Spots Spot/Image quantiations
6/10/20154 Missing data in microarray Randomly missing values the fact that the value is missing is independent of its value methods are available for dealing with randomly missing data Non-randomly missing values: the fact that the value is missing is dependent on its value –(i.e. the value is missing because it is low expression, or the value is missing because it is high expression) available methods do not adequately deal with the situation of non-randomly missing data
6/10/20155 Missing data in microarray Randomly missing data: –spotting problems –dust –finger prints –poor hybridization –inadequate resolution –fabrication errors (e.g. scratches) –image corruption –omission of suspect values * * could also be non-random Non-randomly missing data: low expression e.g. background exceeds signal censored data Arrays max observable intensity Expression
6/10/20156 Dealing with missing data The problem: –many analyses require complete data matrices classification algorithms clustering algorithms dimension-reduction methods Solutions: –remove all genes (rows) and arrays (columns) with missing values –estimate missing values
6/10/20157 Imputation methods Naive approaches –missing values = row (gene) average –missing values = column (array) average Smarter approaches have been proposed: –K-nearest neighbors –regression-based methods –singular value decomposition like principal components for matrices with unequal numbers of rows and columns
6/10/20158 K-Nearest Neighbors (KNN) ArraysExpression ? randomly missing datum chose k genes that are most similar to the gene with the missing value (MV) estimate MV as the weighted mean of the neighbors considerations: –number of neighbors –distance metric –normalization step
6/10/20159 KNN - considerations parameter k –10 usually works (5-15) distance metric –euclidean distance –correlation-based distance Arrays Expression ?
6/10/ Ordinary Least Squares (OLS) regression-based approach also uses k-neighbors algorithm: –choose k neighbors (euclidean or correlation; normalize or not) –the gene with the MV is regressed over the neighbor genes (one at a time, i.e. simple regression) –for each neighbor, MV is predicted from the regression model –MV is imputed as the weighed average of the k predictions
6/10/ Singular Value Decomposition (SVD) goal: –use the strongest patterns of correlation within the data matrix to estimate algorithm –set MVs to row average (need a starting point) –decompose expression matrix in orthogonal components, “eigengenes”. –use the proportion, p, of eigengenes corresponding to largest eigenvalues to reconstruct the MVs from the original matrix (i.e. improve your estimate) –use EM approach to iteratively improve estimates of MVs until convergence
6/10/ Other Imputation Methods: Local Singular Value Decomposition (LSVD) –combines KNN and SVD –algorithm: start with a n genes x m arrays matrix select k neighbor genes (euclidean or correlation; normalize or not) perform SVD on the k x m array matrix Partial Least Squares (PLS) regression –uses all genes and available data from target gene Factor Analysis (FA) regression
6/10/ Which imputation method to use? KNN is the most widely-used; current standard many alternative choices: OLS, SVD, LSVD, PLS, (FA) algorithms require user-supplied parameters: k, p, distance metric, etc. No set of rules for choosing which method to use
6/10/ Characteristics of data that may affect choice of imputation method dimensionality percentage of values missing experimental design (time series, case/control, etc.) entropy - patterns of correlation in data others?
6/10/ Data Analysis Determine differential gene expression Identify up- and down-regulated genes Gene lists produced using Factor 2 Rule, t-test based methods Co-regulation of genes Clustering algorithms Identify genes that regulate other genes Networks (e.g. Bayesian)
6/10/ Methods to Decide Differential Expression Compare treatment to the control –The fold approach –The t-test –Variations of the t-test SAM: significance analysis of microarraysSAM Compare several treatments –ANOVA: analysis of variance –MAANOVA: a/index.html a/index.html
6/10/ Fold Change Measure ratios of gene expression levels. Ratio = T i /C i. Ratio of measured treatment intensity to control intensity for the i th spot The log 2 ratio treats up and down regulated genes equally –e.g. when looking for genes with more than 2 fold variation in expression
6/10/ The Fold Approach In northern analysis, a 2-fold change can be seen with bare eyes Thus biologists tend to use 2-fold as the threshold of differential expression mean(x 1, x 2 ) > 1 mean(x 1, x 2 ) < -1
6/10/ Illustration of the benefit of using Log ratios
6/10/ Two-fold up-regulation Problems with this approach: –Only identifies most changed genes. –Also identifies noise and highly variable genes. –Ratio is unstable when the denominator is small.
6/10/ Ratios are unstable Initial measurements: 30/60 = /1000 = 0.5 Add random noise (+15 numerator and -15 denominator): 45/45 = /985 = 0.52
6/10/ Types of tests Standard t-test assumes the samples are drawn from normal distributions with equal variance and different means. Welch’s t-test allows for different variances between classes. Mann-Whitney (Wilcoxon) converts the data to ranks, and does not assume a particular distribution. Permutation test computes the t-statistic for many random permutations of the labels.
6/10/ The Student’s t-test For sample sizes less than 30 we have to make use of a t-distribution We make use of this distribution in the two-sample Students t-test. This test is used to test whether two samples come from distributions with the same means. The samples are assumed to come from Gaussian (normal) distributions. The two samples must have similar dispersions
6/10/ The student’s t distribution The students t distribution –is mound shaped –is symmetrical about zero –is more widely dispersed than the standard normal distribution –it’s actual shape is dependent on the sample size different t distributions are identified by their degrees of freedom (df), where df = n-1
6/10/ The student’s t distribution (cont.) df=120 (=z) df=30 df=15 EG’s (not to scale)
Mean and Median The mean is the most common measure of the location of a set of points. However, the mean is very sensitive to outliers. Thus, the median or a trimmed mean is also commonly used. 6/10/201526
Range and Variance Range is the difference between the max and min The variance or standard deviation s x is the most common measure of the spread of a set of points. Because of outliers, other measures are often used. 6/10/201527
6/10/ Statistical Analysis control group mean treatment group mean Is there a difference?
6/10/ What does difference mean? medium variability high variability low variability The mean difference is the same for all three cases
6/10/ What does difference mean? medium variability high variability low variability Which one shows the greatest difference?
6/10/ What does difference mean? a statistical difference is a function of the difference between means relative to the variability a small difference between means with large variability could be due to chance like a signal-to-noise ratio low variability Which one shows the greatest difference?
6/10/ So we estimate low variability signal noise difference between group means variability of groups = X T - X C SE(X T - X C ) = = t-value __ __
6/10/ Probability - p With t we check the probability Reject or do not reject Null hypothesis You reject if p < 0.05 or less Difference between means (groups) is more & more significant if p is less & less
6/10/ Important notes on two sample comparisons Type I errors (false positive)– we accept a difference is real when it is not (at the 95% confidence level we are, of course, wrong 5% of the time) –We can increase the significance level to decrease these errors Type II errors (false negative)– if we increase our significance level we risk missing some real differences by making our testing too stringent. Convention is we should reduce Type I errors and be conservative Both can be minimised by increasing the sample size
6/10/ Paired and unpaired tests There are different formulas for the T-test depending on whether we have paired or unpaired data –Paired – making observations of N individuals in two different situations In this situation we can consider the difference for each individual rather than calculate separate means and SEs for the two effects –Unpaired – Two separate samples drawn from the same parent population Can have different sample sizes
6/10/ Tails Two-tailed: Do set A and set B come from different distributions? One-tailed: Does set A come from a distribution with larger mean than set B? This corresponds to finding differentially regulated genes versus finding up-regulated genes.
6/10/ Selecting genes with a t-test μ i = mean expression value in class i n i = number of examples in class i v = pooled variance across both classes Zar. Biostatistical Analysis
6/10/ Standard T Test: An example Observed gene expression values: Treatment A: Treatment B: Compute mean: mean (A) = 3.01 / 4 = mean (B) = 5.71 / 4 =
6/10/ Pooled variance The standard t-test assumes samples are drawn from distributions with the same variance. Pooled variance = (SS 1 + SS 1 ) / (n 1 + n 2 - 2) = ( ) / ( ) = SS: variance
6/10/ Selecting genes with a t-test t = ( ) / sqrt(0.2574/ /4) =
6/10/ If the Sample Variances are Unlikely to be Equal Use Welch’s t-testdegrees of freedom where
6/10/ Welch’s approximation t = Welch’s = | | / sqrt( / /4) = t-testWelch’s
6/10/ Degrees of freedom For the t-test, dof = n 1 + n For Welch’s approximation, it is not so simple. Let A i = var i / n i. Then
6/10/ Non-parametric p-value The t-test assumes the t-distribution –a parametric method –compute the test statistics –use the t pdf to determine the p-value A non-parametric method –data are labeled as X and Y –compute the test statistics with true labels –randomly permute the individual labels times, and compute the test statistics –find the rank of the true test statistics among the test statistics of random permutations –for example, if there are 10 permutations with test statistics larger than the true test statistics, then the p-value is 0.001
6/10/ Mann-Whitney u-test Mann-Whitney, also known as Wilcoxon, is a non-parametric test. Begin by converting to ranks: Treatment A: Treatment B: Treatment A: Treatment B:
6/10/ Mann-Whitney u statistic The u statistic is where R i is the sum of the ranks in class i. U = = 13
6/10/ Permutation test
6/10/ Cost-benefits analysis t-test assumes both samples are drawn from the same normal distribution. Welch’s approximation allows the samples to be drawn from different normals. Mann-Whitney makes no assumption about the distribution. The tests, as listed, yield decreasing power. The permutation test gives the most flexibility in choosing a test statistic that reflects prior knowledge, but it can be computationally expensive for small p-values.
6/10/ Multiple testing correction On an array of 10,000 spots, a p-value of may not be significant. For significance of 0.05 with 10,000 spots, you need a p-value of 5
6/10/ Family-wise Error-rate FWER Chance of any false positives Assume 0.01 significance level for one gene Multiply by the number of genes Many false positives Bonferroni correction: divide 0.01 by the number of genes Bonferroni is conservative because it assumes that all genes are independent.
6/10/ Types of errors False positive (Type I error): the experiment indicates that the gene has changed, but it actually has not. False negative (Type II error): the gene has changed, but the experiment failed to indicate the change. Typically, researchers are more concerned about false positives. Without doing many (expensive) replicates, there will always be many false negatives.
6/10/ False discovery rate The false discovery rate (FDR) is the percentage of genes above a given position in the ranked list that are expected to be false positives. False positive rate: percentage of non-differentially expressed genes that are flagged. False discovery rate: percentage of flagged genes that are not differentially expressed. 5 FP 13 TP 33 TN 5 FN FDR = FP / (FP + TP) = 5/18 = 27.8% FPR = FP / (FP + TN) = 5/38 = 13.2%
6/10/ Bonferroni vs. FDR Bonferroni controls the family-wise error rate; i.e., the probability of at least one false positive. FDR is the proportion of false positives among the genes that are flagged as differentially expressed.
6/10/ Controlling the FDR Order the unadjusted p-values p 1 p 2 … p m. To control FDR at level α, Reject the null hypothesis for j = 1, …, j*. This approach is conservative if many genes are differentially expressed. (Benjamini & Hochberg, 1995)
6/10/ q-value The p-value for a particular gene G is the probability that a randomly generated expression profile would be as or more extremely differentially expressed. The q-value for a particular gene G is the proportion of false positives among all genes that are as or more extremely differentially expressed. Equivalently, the q-value is the minimal FDR at which this gene appears significant.
6/10/ Q-value software
6/10/ SAM Significance analysis of microarrays applied to the ionizing radiation response Virginia Goss Tusher, Robert Tibshirani, and Gilbert Chu Proc. Natl. Acad. Sci. USA, Vol. 98, Issue 9, , April 24, 2001
6/10/ Abstract Method for gene filtering: find genes change that significantly across samples Significance Analysis of Microarrays (SAM) assigns a score to each gene on the basis of change in gene expression relative to the standard deviation of repeated measurements. For genes with scores greater than an adjustable threshold, SAM uses permutations of the repeated measurements to estimate the percentage of genes identified by chance, the false discovery rate (FDR).
6/10/ Introduction Suitable for oligo, cDNA, protein arrays Does not normalize the data! Challenge: –methods based on conventional t tests provide the probability (P) that a difference in gene expression occurred by chance. For an array with genes, a significance level of alpha = 0.01 would identify 100 genes by chance. –Experiments are expensive.
6/10/ Introduction Solution based on SAM: –assimilate a set of gene-specific t tests. Each gene is assigned a score on the basis of its change in gene expression relative to the standard deviation of repeated measurements for that gene. –Instead of more replicates, generate permutations of the data (mix the labels) Genes with scores greater than a threshold are deemed potentially significant. The percentage of such genes identified by chance is the false discovery rate (FDR). To estimate the FDR, nonsense genes are identified by analyzing permutations of the measurements. The threshold can be adjusted to identify smaller or larger sets of genes, and FDRs are calculated for each set. To demonstrate its utility, SAM was used to analyze a biologically important problem: the transcriptional response of lymphoblastoid cells to ionizing radiation (IR).
6/10/ Motivating Experiment Human Cell Lines Treatment Irradiated (I)Unirradiated (U) 1 One RNA sample for each combination of cell line and treatment 2
6/10/ Motivating Experiment Human Cell Lines Treatment Irradiated (I)Unirradiated (U) 1 U1A U1B U2A U2B I1A I1B I2A I2B After labeling, each RNA sample was split into two aliquots denoted A and B. 2
6/10/ Motivating Experiment Human Cell Lines Treatment Irradiated (I)Unirradiated (U) 1 U1A U1B U2A U2B I1A I1B I2A I2B 8 GeneChips, one for each sample, were used to obtain measures of expression. 2
6/10/ First glance at the data Linear Scatter plot of gene expressionCube root scatter plot of gene expression
6/10/ How to find the significant changes? Naïve method Cube root scatter plot of average gene expression from the four hybridizations with uninduced cells (avg xU) and induced cells 4 h after exposure to 5 Gy of IR (avg xI). Some of the genes that responded to IR are indicated by arrows.
6/10/ Test Statistic for the i th Gene d(i) = x I (i) – x U (i) - - s(i)+s 0 Average of 4 normalized measures from irradiated samples Average of 4 normalized measures from unirradiated samples The usual standard deviation in the denominator of a two-sample t-stat A constant common to all genes that is added to make variation in d(i) similar across genes of all intensity levels
6/10/ Selecting the constant s 0 At low expression levels, variance in d(i) can be high because of small values of s(i). To stabilize the variance of d(i) across genes, a small positive constant s 0 was used in the denominator of the test statistic. “The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s 0 was chosen to minimize the coefficient of variation.” s 0 was chosen to be 3.3 for the ionizing radiation data.
6/10/ More Detail on Selecting s 0 The d(i) are separated into approximately 100 groups. The 1% of the d(i) values with the smallest s(i) values are placed in the first group, the 1% of the d(i) values with the next smallest s(i) are placed in the second group, etc. The median absolute deviation (MAD) of the d(i) values is computed separately for each group. The coefficient of variation (CV) of these 100 MAD values is computed.
6/10/ More Detail on Selecting s 0 (continued) This process is repeated for values of s 0 equal to the minimum of s(i) over i, the 5 th percentile of the s(i) values, the 10 th percentile of the s(i) values,..., the 95 th percentile of the s(i) values, and the maximum of the s(i) values. The value of s 0 that minimizes the CV of the 100 MAD values over candidate s 0 described above is selected as the constant s 0.
6/10/ Balancing the Permutations There are differences between the two cell lines. Balanced permutations- to minimize the effects of these differences A permutation is balanced if each group of four experiments contained two experiments from line 1 and two from line 2. There are 36 balanced permutations.
6/10/ Example Permutations Human Cell Lines Treatment Irradiated (I)Unirradiated (U) 1 I1A I1B U1A U1B I2A I2B U2A U2B 2
6/10/ Scatter plots of relative difference in gene expression d(i) vs. genespecific scatter s(i).
6/10/ A Permutation Procedure for Assessing Significance 1.The irradiated and unirradiated GeneChips were shuffled within each cell line. 2.The d(i) statistic was computed for each gene and ordered across genes from smallest to largest to obtain d 1 (1)<d 1 (2)< <d 1 (g) where g denotes the number of genes. 3.Steps 1 and 2 were repeated for all possible data permutations described in step 1 to obtain d p (1)<d p (2)< <d p (g) for p=1,...,
6/10/ A Permutation Procedure for Assessing Significance (continued) 4.For each i, d 1 (i),...,d 36 (i) were averaged to obtain d E (i), the “expected relative difference.” 5.The original d(i) statistics were also sorted so that d(1)<d(2)< <d(g). 6.Genes for which | d(i) – d E (i) | > were declared significant, where is a user specified cutoff for significance....
6/10/ Example
6/10/ Plot of Observed vs. “Expected” Test Statistics d(i) d E (i) Points for genes with evidence of induction Points for genes with evidence of repression 2
6/10/ Plot of d(i) vs. log 10 s(i) for the Ionizing Radiation Data d(i) log 10 s(i) 24 induced genes 22 repressed genes
6/10/ Estimating FDR for a Selected 1.Find the smallest d(i) among those d(i) for which d(i) – d E (i) > and call it d up. 2.Find the largest d(i) among those d(i) for which d(i) - d E (i) < - and call it d down. 3.For each permuted data set, find the number of genes with d(i) >= d up or d(i) <= d down and denote these counts by n 1,...,n FDR is estimated by n / n where n is the average of n 1,...,n 36 and n is the number of genes identified as significant in the original data. --
6/10/ FDR cont’d Note: Cutoffs are asymmetric
6/10/ Counts of Genes beyond the Threshold For Each Permutation Perm Count
6/10/ Mean Count = FDR Estimate = 8.472/46 = 18.4% Perm Count
6/10/ How to choose Δ? Omitting s 0 caused higher FDR.
6/10/ Plot of Observed vs. “Expected” Test Statistics d(i) d E (i)
6/10/ Plot of d(i) vs. log 10 s(i) for the Ionizing Radiation Data d(i) log 10 s(i)
6/10/ Same Plot for One of the Permuted Data Sets d(i) log 10 s(i) only 5 genes beyond thresholds compared to 46 for original data
6/10/ SAM vs. R fold R-fold Method: Gene i is significant if r(i)>R or r(i)<1/R FDR 73%-84% - Unacceptable. Pairwise fold change: At least 12 out of 16 pairings satisfying the criteria. FDR 60%-71% - Unacceptable. Why doesn’t it work?
6/10/ Fold-change, SAM- Validation
6/10/201588
6/10/ SAM vs. Multiple t-Tests Trying to keep the FDR or FWER (Family– wise error rate). Why doesn’t it work? FWER- too stringent (Bonferroni, Westfall and Young) FDR- too granular (Benjamini and Hochberg) SAM does not assume normal distribution of the data SAM works effectively even with small sample size.
6/10/ Conclusion SAM SAM is a method for identifying genes on a microarray with statistically significant changes in expression. SAM provides an estimate of the FDR for each value of the tuning parameter. The estimated FDR is computed from permutations of the data. SAM can be generalized to other types of experiments and outcomes by redefining d(i) class.stanford.edu/SAM/SAMServlet.
6/10/ ANOVA The t-test and its variants only work when there are two sample pools. Analysis of variance (ANOVA) is a general technique for handling multiple variables, with replicates. A tutorial is available here:
6/10/ A simple experiment Measure response to a drug treatment in two different mouse strains. Repeat each measurement five times. Total experiment = 2 strains * 2 treatments * 5 repetitions = 20 arrays If you look for treatment effects using a t- test, then you ignore the strain effects.
6/10/ ANOVA lingo Factor: a variable that is under the control of the experimenter (strain, treatment). Level: a possible value of a factor (drug, no drug). Main effect: an effect that involves only one factor. Interaction effect: an effect that involves two or more factors simultaneously. Balanced design: an experiment in which each factor and level is measured an equal number of times.
6/10/ Two-factor design
6/10/ Fixed and random effects Fixed effect: a factor for which the levels would be repeated exactly if the experiment were repeated. Random effect: a term for which the levels would not repeat in a replicated experiment. In the simple experiment, treatment and strain are fixed effects, and we include a random effect to account for biological and experimental variability.
6/10/ ANOVA model is the mean expression level of the gene. T and S are main effects (treatment, strain) with n and m levels, respectively. TS is an interaction effect. p is the number of replicates per group. represents random error (to be minimized).
6/10/ ANOVA steps For each gene on the array –Fit the parameters T and S, minimizing . –Test T, S and TS for difference from zero, yielding three F statistics. –Convert the F statistics into p-values.
6/10/ ANOVA assumptions For a given gene, the random error terms are independent, normally distributed and have uniform variance. The main effects and their interactions are linear.
6/10/ Summary Individual measurements from microarray experiments are not trustworthy. Repetition or independent verification (e.g., RT- PCR) are the best means of verification. For simple designs, use Welch’s approximation of the t-test. For complex designs, use ANOVA. Correct for multiple comparisons using FDR and q-values.
6/10/ Bioconductor Bioconductor is an open source project to design and provide high quality software and documentation for bioinformatics. Current focus: microarrays and gene (transcript) annotation Most of the early developments are in the form of R packages. Open to (your?) contributions Software and documentation are available from
6/10/ Bioconductor packages General infrastructure –Biobase –annotate, AnnBuilder –tkWidgets Pre-processing for Affymetrix data –affy. Pre-processing for cDNA data –marrayClasses, marrayInput, marrayNorm, marrayPlots. Differential expression –edd, genefilter, multtest, ROC. etc.