Download presentation
Presentation is loading. Please wait.
Published byFlora O’Neal’ Modified over 9 years ago
1
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G. Dyson)
2
2 Outline Current MethodsCurrent Methods Proposed MethodologyProposed Methodology Analysis PlanAnalysis Plan ExampleExample ConclusionsConclusions
3
3 What are microarrays? Two major typesTwo major types –Oligonucleotide gene chips –Spotted glass arrays Perfect match (PM) and mismatch (MM) probes are spotted onto a gene chipPerfect match (PM) and mismatch (MM) probes are spotted onto a gene chip –~20 probes make up a probe set (or gene) –MM probe for each gene has the middle base set to the complement of its PM probe –Hybridize labeled RNA corresponding to PM probes Glass arrays involve the competitive hybridization of two RNA pools to cDNA spotted onto a glass slideGlass arrays involve the competitive hybridization of two RNA pools to cDNA spotted onto a glass slide Typically thousands on genes on a slideTypically thousands on genes on a slide
4
4 Multiplicity Problem When we make more than one comparison in a hypothesis testing situation, p-value interpretation falls throughWhen we make more than one comparison in a hypothesis testing situation, p-value interpretation falls through Control of family error rate is necessary in order to preserve nominal type I error rateControl of family error rate is necessary in order to preserve nominal type I error rate Various approaches to correct the chance of making a type I error for multiplicity, including Tukey, Bonferroni and HolmsVarious approaches to correct the chance of making a type I error for multiplicity, including Tukey, Bonferroni and Holms
5
5 Microarray Analysis Techniques Westfall Young step down (WY)Westfall Young step down (WY) Significance Analysis of Microarrays (SAM)Significance Analysis of Microarrays (SAM) Empirical Bayes (EB)Empirical Bayes (EB) Bayesian (MCMC)Bayesian (MCMC) Mixture ModelingMixture Modeling Dimension reduction techniquesDimension reduction techniques Machine learningMachine learning
6
6 Westfall Young (WY) Compute ranks of original test statistic r j such thatCompute ranks of original test statistic r j such that Construct b balanced permutations of the samples, computing the same test statistic as above for each bConstruct b balanced permutations of the samples, computing the same test statistic as above for each b ComputeCompute Repeat B times and calculate the adjust p-value asRepeat B times and calculate the adjust p-value as Less conservative than BonferroniLess conservative than Bonferroni and
7
7 Significance Analysis of Microarrays (SAM) Use a t-like statisticUse a t-like statistic Use balanced permutation method from previous slide to estimate null distribution, assuming all effects are nullUse balanced permutation method from previous slide to estimate null distribution, assuming all effects are null Call genes that fall outside bars significantCall genes that fall outside bars significant
8
8 Half-Normal Analysis
9
9 Microarray Specific Problem
10
10 Analysis Plan Robust measures of location and scaleRobust measures of location and scale Summary statisticSummary statistic Two half-normal plots (for upward- regulated and downward-regulated genes)Two half-normal plots (for upward- regulated and downward-regulated genes) Segment determinationSegment determination –Find –insignificant, borderline, significant Repeat the procedure, using as baseRepeat the procedure, using as base
11
11 Robust Measures of Location and Scale Perform transformation and suitable normalizationPerform transformation and suitable normalization Compute median and Maximum Absolute Deviation (MAD) for each geneCompute median and Maximum Absolute Deviation (MAD) for each gene –Reasonable estimates –Less affected by outliers than mean and SD –Interested in robustness rather than efficiency
12
12 Compute quasi two-sample t-statistic using robust values from above:Compute quasi two-sample t-statistic using robust values from above: c is chosen to minimizec is chosen to minimize for the middle 100*(1-2 )% of the ss l. Tusher et al. (2001) chose c to minimize the coefficient of variationTusher et al. (2001) chose c to minimize the coefficient of variation Efron et al. (2001) used the 90 th percentile of the gene standard error estimates for cEfron et al. (2001) used the 90 th percentile of the gene standard error estimates for c Summary Statistic
13
13 Construct two half-normal plots: one for the p positive and r negative ss l.Construct two half-normal plots: one for the p positive and r negative ss l. Run the procedure separately on each setRun the procedure separately on each set Denote the ordered p positive effects byDenote the ordered p positive effects by Plot abss i against half-normal distribution quantiles, i.e. the pointsPlot abss i against half-normal distribution quantiles, i.e. the points Goal: obtain set of noise effectsGoal: obtain set of noise effects Yield a baseline against which to test the rest of the effectsYield a baseline against which to test the rest of the effects Two Half-Normal Plots
14
14 Given initialize null set as points abss 1 : abss kGiven initialize null set as points abss 1 : abss k Regress null set on 1:k half-normal quantiles (Q 1 :Q k )Regress null set on 1:k half-normal quantiles (Q 1 :Q k ) Produce predicted values at the remaining quantile values (Q h :h>k)Produce predicted values at the remaining quantile values (Q h :h>k) Compute predicted statisticsCompute predicted statistics with with FindFind Segment Determination:
15
15 Segment Determination: (cont) The initial null set of k genes becomes k + m (= ) null genesThe initial null set of k genes becomes k + m (= ) null genes Now re-do the segment determination procedure, using the k + m genes as base null setNow re-do the segment determination procedure, using the k + m genes as base null set Continue until no new genes are addedContinue until no new genes are added Do for each k less than p-1Do for each k less than p-1 Store the end pointStore the end point Set the most frequent toSet the most frequent to
16
16 Sample Let k = 200, total effects = 500Let k = 200, total effects = 500 –First 200 ordered positive effects regressed on first 200 half-normal quantiles –Test ordered effects 201 to 500 using absolute value of predicted statistics –For example, effect 239 is the largest h less than the t-critical value –So would initially be 239 Redo the above, with k = 239 effects; so we test effects 240 to 500Redo the above, with k = 239 effects; so we test effects 240 to 500 –Say statistic 242 is the largest h less than t-critical value based on new regression line –So the new would be 242 Redo the above again with k = 242, test effects 243 to 500Redo the above again with k = 242, test effects 243 to 500 –No statistics are less than t critical value So is 242So is 242
17
17 Example
18
18 Will test all effects after using same statisticsWill test all effects after using same statistics To adjust for multiple testing, define NC as the number of consecutive significant effects necessary to call all subsequent effects significantTo adjust for multiple testing, define NC as the number of consecutive significant effects necessary to call all subsequent effects significant Use the Bonferroni adjustment (does not require independence):Use the Bonferroni adjustment (does not require independence): Instead of doing thousands of comparisons, only need to do NC to determine significanceInstead of doing thousands of comparisons, only need to do NC to determine significance DefineDefine Now we have identified the change points in the graph for segment detectionNow we have identified the change points in the graph for segment detection Find
19
19 Example: Downward- regulated Speed Mouse Data
20
20 Example: Downward Regulated Speed Mouse Data (cont)
21
21 Error Rate Estimation: FDR False Discovery Rate (FDR) is the expected proportion of falsely rejected hypothesesFalse Discovery Rate (FDR) is the expected proportion of falsely rejected hypotheses Permute the condition labels, maintaining balancePermute the condition labels, maintaining balance –Example: 8 replicates in conditions A and B –Each A’ and B’ will have 4 replicates from A and 4 from B –Compute the robust statistics, keeping the same c from the actual data Determine the average number of effects that fall above the positive or below the negative boundary of the significant setsDetermine the average number of effects that fall above the positive or below the negative boundary of the significant sets Divide that number by the total number of called significant effectDivide that number by the total number of called significant effect
22
22 Speed Data: Analysis and Comparison WY found 8 genes significant, with Type I error = 0.05WY found 8 genes significant, with Type I error = 0.05
23
23 WY found 253 genes significant, with Type I error = 0.05WY found 253 genes significant, with Type I error = 0.05 Lemon Data: Analysis and Comparison
24
24 Conclusions Proposed a new method for determining differential expression in genesProposed a new method for determining differential expression in genes Dealt with the multiplicity problem by using only a small subset of genesDealt with the multiplicity problem by using only a small subset of genes Can extend to other large data setsCan extend to other large data sets Allow scientists to play a role in sequential decision makingAllow scientists to play a role in sequential decision making Incorporate a priori knowledge of experiment with selection of cIncorporate a priori knowledge of experiment with selection of c
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.