1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel microarrays) For a specific gene (focusing on a single gene) x ij = i th measurement under condition j Statistical models for expression measurements under two different Identifying Differentially Expressed Genes  1,  2,  are unknown model parameters -  j represents the average expression measurement in the large number of replicated experiments,  represents the variability of measurements Question if the gene is differentially expressed corresponds to assessing if  1   2 Strength of evidence in the observed data that this is the case is expressed in terms of a p- value

1-3-20042 Estimate the model parameters based on the data P-value Calculating t-statistic which summarizes information about our hypothesis of interest (  1   2 ) Establishing the null-distribution of the t-statistic (the distribution assuming the “null- hypothesis” that  1 =  2 ) The “null-distribution” in this case turns out to be the t-distribution with n-1 degrees of freedom P-value is the probability of observing as extreme or more extreme value under the “null- distribution” as it was calculated from the data (t * )

1-3-20043 t-distribution Number of experimental replicates affects the precision at two levels 1.Everything else being equal, increase in sample size increases the t * 2.Everything else being equal, increase in sample size “shrinks” the “null-distribution” Suppose that t * =3. What is the difference in p-values depending on the sample size alone. p-value = 0.2 p-value = 0.1 p-value = 0.01 p-value = 0.003

1-3-20044 t-distribution #Plotting t-distributions with different degrees of freedom x<-seq(-5,5,.1) plot(x,dt(x,100),type="l",col="black",lwd=2,ylab="") lines(x,dt(x,10),col="green",lwd=2) lines(x,dt(x,2),col="blue",lwd=2) lines(x,dt(x,1),col="red",lwd=2) legend(2, y =0.4, c("df = 1; p-value = ","df = 2","df = 10","df = 100"), col = c("red","blue","green","black"), lty=rep("solid",4), lwd=2) #Calculating two-tailed p-values > 2*pt(3,100,lower.tail=FALSE) [1] 0.003407915 > 2*pt(3,10,lower.tail=FALSE) [1] 0.01334366 > 2*pt(3,2,lower.tail=FALSE) [1] 0.09546597 > 2*pt(3,1,lower.tail=FALSE) [1] 0.2048328 >

1-3-20045 Performing t-test >loadURL("http://eh3.uc.edu/SimpleData.RData") > SimpleData[1,] Name ID W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W6 1 no name Rn30000100 85 57 91 71 67 111 72 86 88 108 124 171 > LSimpleData<-SimpleData > LSimpleData[,3:14]<-log(SimpleData[,3:14],base=2) > LSimpleData[1,] Name ID W1 C1 W2 C2 W3 C3 C4 W4 C5 W5 C6 W6 1 no name Rn30000100 6.409391 5.83289 6.507795 6.149747 6.066089 6.794416 6.169925 6.426265 6.459432 6.754888 6.954196 7.417853 > grep("W",dimnames(SimpleData)[[2]]) [1] 3 5 7 10 12 14 > grep("C",dimnames(SimpleData)[[2]]) [1] 4 6 8 9 11 13 > W<-grep("W",dimnames(SimpleData)[[2]]) > C<-grep("C",dimnames(SimpleData)[[2]]) > t.test(LSimpleData[1,W],LSimpleData[1,C],var.equal=TRUE) Two Sample t-test data: LSimpleData[1, W] and LSimpleData[1, C] t = 0.7974, df = 10, p-value = 0.4437 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3653337 0.7725582 sample estimates: mean of x mean of y 6.597047 6.393434

1-3-20046 Performing t-test > MW<-mean(t(LSimpleData[1,W])) > MW [1] 6.597047 > MC<-mean(t(LSimpleData[1,C])) > MC [1] 6.393434 > VW<-var(t(LSimpleData[1,W])) > VW 1 1 0.2105798 > VC<-var(t(LSimpleData[1,C])) > VC 1 1 0.1806291 > NW<-sum(!is.na(LSimpleData[1,W])) > NW [1] 6 > NC<-sum(!is.na(LSimpleData[1,C])) > NC [1] 6 > VWC<-(((NW-1)*VW)+((NC-)*VC))/(NC+NW-2) > VWC 1 1 0.1956045 > DF<-NW+NC-2 > DF [1] 10 > TStat<-abs(MW-MC)/((VWC*((1/NW)+(1/NC)))^0.5) > TStat 1 1 0.7973981 > TPvalue<-2*pt(TStat,DF,lower.tail=FALSE) > TPvalue 1 1 0.4437415 > >t.test(LSimpleData[1,W],LSimpleData[1,C],var.equal=TRUE) Two Sample t-test data: LSimpleData[1, W] and LSimpleData[1, C] t = 0.7974, df = 10, p-value = 0.4437 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.3653337 0.7725582 sample estimates: mean of x mean of y 6.597047 6.393434 source(http://eh3.uc.edu/RSimpleTTest.R)http://eh3.uc.edu/RSimpleTTest.R source(http://eh3.uc.edu/MySimpleTTest.R)http://eh3.uc.edu/MySimpleTTest.R

1-3-20047 Statistical Inference and Statistical Significance – P-value Statistical Inference consists of drawing conclusions about the measured phenomenon (e.g. gene expression) in terms of probabilistic statements based on observed data. P-value is one way of doing this. P-value is NOT the probability of null hypothesis being true. Rigorous interpretation of p-value is tricky. It was introduced to measure the level of evidence against the “null-hypothesis” or better to say in favor of a “positive experimental finding” In this context p-value of 0.0001 could be interpreted as a stronger evidence than the p- value of 0.01 Establishing Statistical Significance (is a difference in expression level statistically significant or not) requires that we establish “cut-off” points for our “measure of significance” (p-value) For various historic reasons the cut-off 0.05 is generally used to establish “statistical significance”. It’s a rather arbitrary cut-off, but it is taken as a gold standard Originally the p-value was introduced as a descriptive measure to be used in conjuction with other criteria to judge the strength of evidence one way or another

1-3-20048 Statistical Inference and Statistical Significance-Hypothesis Testing The 5% cut-off points comes from the Hypothesis testing world In this world the exact magnitude of p-value does not matter. It only matters if it is smaller than the pre-specified statistical significance cut-off (  ). The null hypothesis is rejected in favor of the alternative hypothesis at a significance level of  = 0.05 if p-value<0.05 Type I error is committed when the null-hypothesis is falsely rejected Type II error is committed when the null-hypothesis is not rejected but it is false By following this “decision making scheme” you will on average falsely reject 5% of null- hypothesis If such a “decision making scheme” is adopted to identify differentially expressed genes on a microarray, 5% of non-differentially expressed genes will be falsely implicated as differentially expressed. Family-wise Type I Error is committed if any of a set of null hypothesis is falsely rejected Establishing statistical significance is a necessary but not sufficient step in assuring the “reproducibility” of a scientific finding – Important point that will be further discussed when we start talking about issues in experimental design The other essential ingredient is a “representative sample” from the “population of interest” This is still a murky point in molecular biology experimentation

1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

Similar presentations

Presentation on theme: "1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel.

Similar presentations

Presentation on theme: "1-3-20041 Assume we have two experimental conditions (j=1,2) We measure expression of all genes n times under both experimental conditions (n two- channel."— Presentation transcript:

Similar presentations

About project

Feedback