Opinionated Lessons #13 The Yeast Genome in Statistics by Bill Press

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probabilistic models Haixu Tang School of Informatics.
Hypothesis testing Another judgment method of sampling data.
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
STA291 Statistical Methods Lecture 13. Last time … Notions of: o Random variable, its o expected value, o variance, and o standard deviation 2.
CHAPTER 13: Binomial Distributions
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Chapter – Binomial Distributions Geometric Distributions
Parameter Estimation using likelihood functions Tutorial #1
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
1 Bernoulli and Binomial Distributions. 2 Bernoulli Random Variables Setting: –finite population –each subject has a categorical response with one of.
Estimation of parameters. Maximum likelihood What has happened was most likely.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Lecture 7 1 Statistics Statistics: 1. Model 2. Estimation 3. Hypothesis test.
The t Tests Independent Samples.
Warm-up Grab a die and roll it 10 times and record how many times you roll a 5. Repeat this 7 times and record results. This time roll the die until you.
Multinomial Distribution
Math 22 Introductory Statistics Chapter 8 - The Binomial Probability Distribution.
6.2 Homework Questions.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Three Frameworks for Statistical Analysis. Sample Design Forest, N=6 Field, N=4 Count ant nests per quadrat.
HYPOTHESIS TESTING Distributions(continued); Maximum Likelihood; Parametric hypothesis tests (chi-squared goodness of fit, t-test, F-test) LECTURE 2 Supplementary.
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
The final exam solutions. Part I, #1, Central limit theorem Let X1,X2, …, Xn be a sequence of i.i.d. random variables each having mean μ and variance.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #31 A Tale.
Q1: Standard Deviation is a measure of what? CenterSpreadShape.
Machine Learning 5. Parametric Methods.
Professor William H. Press, Department of Computer Science, the University of Texas at Austin1 Opinionated in Statistics by Bill Press Lessons #32 Contingency.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
Sums of Random Variables and Long-Term Averages Sums of R.V. ‘s S n = X 1 + X X n of course.
1 Opinionated in Statistics by Bill Press Lessons #15.5 Poisson Processes and Order Statistics Professor William H. Press, Department of Computer Science,
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Chapter 6: Sampling Distributions
Evaluating Hypotheses
Statistical Estimation
Measurement, Quantification and Analysis
Chapter 23 Comparing Means.
Probability Theory and Parameter Estimation I
Discrete Probability Distributions
Let’s continue to do a Bayesian analysis
Chapter 6: Sampling Distributions
Comparing Means from Two Samples
Distribution functions
Maximum Likelihood Estimation
Lecture Slides Elementary Statistics Twelfth Edition
Goodness-of-Fit Tests
More about Posterior Distributions
Covered only ML estimator
Statistical NLP: Lecture 4
Independent Samples: Comparing Means
PSY 626: Bayesian Statistics for Psychological Science
Chapter 24 Comparing Means Copyright © 2009 Pearson Education, Inc.
#21 Marginalize vs. Condition Uninteresting Fitted Parameters
Parametric Methods Berlin Chen, 2005 References:
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
#22 Uncertainty of Derived Parameters
#10 The Central Limit Theorem
Nonparametric Statistics
Discrete Random Variables: Joint PMFs, Conditioning and Independence
Mathematical Foundations of BME Reza Shadmehr
Statistical Inference for the Mean: t-test
Presentation transcript:

Opinionated Lessons #13 The Yeast Genome in Statistics by Bill Press Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Saccharomyces cerevisiae = baker’s yeast For practice with p- and t-values, let’s look at the Sac cer genome. We’ll use as a data set all of Chromosome 4. Yeast and Human are very close relatives in the great scheme of things. Saccharomyces cerevisiae = baker’s yeast Chromosome 4: ACACCACACC…(1531894 omitted)…TAGCTTTTGG Professor William H. Press, Department of Computer Science, the University of Texas at Austin

p = : 2 5 ? p = ¼ : 3 1 9 ? Count nucleotides A,C,G,T on SacCer Chr4: Take the file SacSerChr4.txt (on course web site). Count the letters A,C,G,T. You should get: A = 476750 C = 289341 G = 291352 T = 474471 Are these counts consistent with the model p A = C T : 2 5 ? (Of course not! But we’ll check.) Are they consistent with the model p A = T ¼ : 3 1 C 9 ? That’s a deeper question! You might think yes, because of A-T and C-G base pairing. Professor William H. Press, Department of Computer Science, the University of Texas at Austin

B i n ( ; p ) ¼ N o r m a l 1 ¡ ¹ = p £ 1 + ( ¡ ) ¾ As always, the starting point is to write down a model. Bayesian: What is the probability of the data. Frequentist: What is the probability of a test statistic for a null hypothesis. A possible model is multinomial: At each position an i.i.d. choice of A,C,G,T, with respective probabilities adding up to 1. Almost equivalent (and simpler for now) is 4 separate binomial models: At each position an i.i.d. choice of A vs. not A with some probability pA. Then do separately for pC, pG, pT. The counts are all so large that the normal approximation is highly accurate: B i n ( ; p ) ¼ N o r m a l 1 ¡ Why? CLT applies to binomial because it’s sum of Bernoulli r.v.’s: N tries of an r.v. with values 1 (prob p) or 0 (prob 1-p). ¹ = p £ 1 + ( ¡ ) ¾ 2 Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Can we rule out the silly hypothesis that all p’s = 0.25?: The test statistic: the value of the observed count under the null hypothesis that it is binomially (or equivalent normally) distributed with p=0.25. ¹ = : 2 5 N ¾ p £ 7 t n ¡ [ 1 P o r m a l ( j ) ] t-value = number of standard deviations p-value = tail probability (here, 2-tailed) t-value p-value A 174.965  0 C –174.715 G –170.963 T 170.713 The null hypothesis is (totally, infinitely, beyond any possibility of redemption!) ruled out. Professor William H. Press, Department of Computer Science, the University of Texas at Austin

^ p = ( n + ) N n » N o r m a l ( ^ p ; 1 ¡ ) 2 The not-silly model: A and T occur with identical probabilities, as do C and G. The test statistic: Difference between A and T (or C and G) counts under the null hypothesis that they have the same p. We estimate p by its maximum likelihood estimate (see Bernoulli Trials lecture). ^ p A T = 1 2 ( n + ) N C G n A » N o r m a l ( ^ p T ; 1 ¡ ) 2 the difference of two Normals is itself Normal the variance of the sum (or difference) is the sum of the variances It makes Bayesians nervous to see parameters estimated by MLE, then re-used in estimating other parameters. People do this all the time, and it’s usually OK. But Bayesians feel more secure estimating the full posterior probability of all the parameters at once! Professor William H. Press, Department of Computer Science, the University of Texas at Austin

Calculate the answer (here in MATLAB): dif = [count(1)-count(3); count(2)-count(4) ] pdiff = [0.5*(count(1)+count(3))/n; 0.5*(count(2)+count(4))/n] mu = [0; 0]; sig = sqrt(2 .* pdiff .* (1 - pdiff) .* len) tval = (dif - mu) ./ sig pval = 2*(1-normcdf(abs(tval),0,1)) A = 476750 C = 289341 G = 291352 T = 474471 2-tailed dif = -2279 -2011 pdiff = 0.3097 0.1889 mu = sig = 809.3402 685.1154 tval = -2.8159 -2.9353 pval = 0.0049 0.0033 Why? Because, we’re discovering genes! The fluctuating “units” are indeed not single bases. Rather, they are genes which, individually, do not have (or prefer) A=T, C=G. Their placement on one strand or the other is random. Surprise! The model is ruled out with high significance (small p-value)! Professor William H. Press, Department of Computer Science, the University of Texas at Austin