Statistical Genomics Zhiwu Zhang Washington State University Lecture 4: Statistical inference
Homework1, due Feb 3, Wednesday, 3:10PM Administration
X2 test on contingency table Empirical null distribution X2 test on variance t test Hypothesis test two types of error Power Outline
TransgeneticNon transgeneticSUM Herbicide35540 No herbicide SUM Observed and expected frequency TransgeneticNon transgeneticSUM Herbicide No herbicide SUM
Poisson distribution: Mean=Var=Expected (Observed-Expected)/Sqrt(Expected) ~ N(0,1) SUM(Observed-Expected) 2 / Expected ~ X 2 (df) df=number of independent cells Approximate Distributions
TransgeneticNon transgeneticSUM Herbicide35540 No herbicide SUM Observed and expected frequency TransgeneticNon transgeneticSUM Herbicide No herbicide SUM /28+49/12+49/42+49/18=9.72
Distribution of x2(1) Observed 9.72 P<1% 99% percentile 6.97 par(mfrow=c(2,2),mar = c(3,4,1,1)) x=rchisq(k,1) d=density(x) plot(x) plot(d) hist(x) plot(ecdf(x)) quantile(x,.99)
A sample has mean of and variance of The sample has 10 observations Q1: What is the probability that the sample was from a normal distribution with variance of 25? Q2: What is the probability that the sample was from a normal distribution with mean of 100? Tests on samples
Empirical solution: Sample ten observations from a normal distribution with variance of 25. Calculate observed variance. Repeat the sampling and get null distribution of the sample variances Find percentile of observed variance on the null distribution Q1: distribution with variance of 25
x=replicate(10000, {s=rnorm(10,0,5) var=var(s) }) Observed P>25% 75% percentile 31.6 > length(x[x>27.82])/10000 [1] par(mfrow=c(2,2),mar = c(3,4,1,1)) d=density(x) plot(x) plot(d) hist(x) plot(ecdf(x)) quantile(x,.75)
Theoretical solution: Q1: distribution with variance of 25 v=(10-1)*27.82/25= > 1-pchisq(10.026,9) [1] vs from empirical
Q2: distribution with mean of 100 Empirical solution Sample ten observations from N(100, 25) Calculate mean Repeat the process 10,000 times Null distribution of of the 10,000 means Determine the percentile of testing mean (103.6) on the null distribution
Q2: distribution with mean of 100 x=replicate(10000, {s=rnorm(10,100,5) m=mean(s) }) Observed %<P<5% 95% percentile > length(x[x>103.6])/10000 [1] par(mfrow=c(2,2),mar = c(3,4,1,1)) d=density(x) plot(x) plot(d) hist(x) plot(ecdf(x)) quantile(x,.95) quantile(x,.99) 99% percentile 102.6
t test
T=( )/(5/sqrt(10)) P=1-pt(T,9) c(T,P) Under 5% of threshold, reject the hypothesis that the sample was from a distribution with mean of 100
F test
Null hypothesis (H0): Initial assumption Alternative hypothesis (Ha): Opposite to the assumption Find the probability of H0 If the probability is too low (e.g. 5%), reject Ho and accept Ha Otherwise, accept Ho Hypothesis test
Type I error: Reject true H0, False positive, the probability is the threshold used, e.g. α=5% Type II error: Accept false H0, false negative, β Power: Probability to reject false H0, (1-β) Two types of errors and power
TestH0 is TrueHo is False Positive (reject H0) False positive Type I: α Power=1-β Negative (Accept H0) Specificity=1-α False negative Type II: β Sum100% Summary
Highlight X2 test on contingency table Empirical null distribution X2 test on variance t test Hypothesis test two types of error Power