Estimating parameters from data Gil McVean, Department of Statistics Tuesday 3 rd November 2009.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Statistical Estimation and Sampling Distributions
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian inference Gil McVean, Department of Statistics Monday 17 th November 2008.
Maximum likelihood (ML) and likelihood ratio (LR) test
Point estimation, interval estimation
Maximum likelihood (ML)
Chapter 7 Sampling and Sampling Distributions
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
2. Point and interval estimation Introduction Properties of estimators Finite sample size Asymptotic properties Construction methods Method of moments.
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Statistical Background
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
G. Cowan 2011 CERN Summer Student Lectures on Statistics / Lecture 41 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 7 Sampling.
Inference about a Mean Part II
Experimental Evaluation
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Maximum likelihood (ML)
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Statistical modelling Gil McVean, Department of Statistics Tuesday 24 th Jan 2012.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
G. Cowan 2009 CERN Summer Student Lectures on Statistics1 Introduction to Statistics − Day 4 Lecture 1 Probability Random variables, probability densities,
G. Cowan Lectures on Statistical Data Analysis Lecture 3 page 1 Lecture 3 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Properties of OLS How Reliable is OLS?. Learning Objectives 1.Review of the idea that the OLS estimator is a random variable 2.How do we judge the quality.
1 Lecture 16: Point Estimation Concepts and Methods Devore, Ch
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
1 Introduction to Statistics − Day 3 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Brief catalogue of probability densities.
G. Cowan Computing and Statistical Data Analysis / Stat 9 1 Computing and Statistical Data Analysis Stat 9: Parameter Estimation, Limits London Postgraduate.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
In Bayesian theory, a test statistics can be defined by taking the ratio of the Bayes factors for the two hypotheses: The ratio measures the probability.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Statistical Estimation
STATISTICS POINT ESTIMATION
Point and interval estimations of parameters of the normally up-diffused sign. Concept of statistical evaluation.
More about Posterior Distributions
Discrete Event Simulation - 4
Parametric Methods Berlin Chen, 2005 References:
Applied Statistics and Probability for Engineers
Presentation transcript:

Estimating parameters from data Gil McVean, Department of Statistics Tuesday 3 rd November 2009

Questions to ask… How can I estimate model parameters from data? What should I worry about when choosing between estimators? Is there some optimal way of estimating parameters from data? How can I compare different parameter values? How should I make statements about certainty regarding estimates and hypotheses?

Motivating example I I conduct an experiment where I measure the weight of 100 mice that were exposed to a normal diet and 50 mice exposed to a high-energy diet I want to estimate the expected gain in weight due to the change in diet NormalHigh-calorie

Motivating example II I observe the co-segregation of two traits (e.g. a visible trait and a genetic marker) in a cross I want to estimate the recombination rate between the two markers Bateson and Punnett experiment Phenotype and genotype Observed Expected from 9:3:3:1 ratio Purple, long (P_L_) Purple, round (P_ll)2172 Red, long (ppL_)2172 Red, round (ppll)5524

Parameter estimation We can formulate most questions in statistics in terms of making statements about underlying parameters We want to devise a framework for estimating those parameters and making statements about our certainty In this lecture we will look at several different approaches to making such statements –Moment estimators –Likelihood –Bayesian estimation

Moment estimation You have already come across one way of estimating parameter values – moment methods In such techniques parameter values are found that match sample moments (mean, variance, etc.) to those expected E.g. for random variables X 1, X 2, etc. sampled from a N( ,  2 ) distribution

Example: fitting a gamma distribution The gamma distribution is parameterised by a shape parameter, , and a scale parameter,  The mean of the distribution is  /  and the variance is  /  2 We can fit a gamma distribution by looking at the first two sample moments Alkaline phosphatase measurements in 2019 mice  

Bias Although the moment method looks sensible, it can lead to biased estimators In the previous example, estimates of both parameters are upwardly biased Bias is measured by the difference between the expected estimate and the truth However, bias is not the only thing to worry about –For example, the value of the first observation is an unbiased estimator of the mean for a Normal distribution. However it is a rubbish estimator We also need to worry about the variance of an estimator

Example: estimating the population mutation rate In population genetics, a parameter of interest is the population-scaled mutation rate There are two common estimators for this parameter –The average number of differences between two sequences –The total number of polymorphic sites in the sample divided by a constant that is approximately the log of the sample size Which is better? The first estimator has larger variance than the second – suggesting that it is an inferior estimator It is actually worse than this – it is not even guaranteed to converge on the truth as the sample size gets infinitely large –A property called consistency

The bias-variance trade off Some estimators may be biased Some estimators may have large variance Which is better? A simple way of combining both metrics is to consider the mean-squared error of an estimator

Example Consider two ways of estimating the variance of a Normal distribution from the sample variance The second estimator is unbiased, but the first estimator has lower MSE Actually, there is a third estimator, which is even more biased than the first, but which has even lower MSE

Problems with moment estimation It is not always possible to exactly match sample moments with their expectation It is not clear when using moment methods how much of the information in the data about the parameters is being used –Often not much.. Why should MSE be the best way of measuring the value of an estimator?

Is there an optimal way to estimate parameters? For any model the maximum information about model parameters is obtained by considering the likelihood function The likelihood function is proportional to the probability of observing the data given a specified parameter value One natural choice for point estimation of parameters is the maximum likelihood estimate, the parameter values that maximise the probability of observing the data The maximum likelihood estimate (mle) has some useful properties (though is not always optimal in every sense )

An intuitive view on likelihood

An example Suppose we have data generated from a Poisson distribution. We want to estimate the parameter of the distribution The probability of observing a particular random variable is If we have observed a series of iid Poisson RVs we obtain the joint likelihood by multiplying the individual probabilities together

Comments Note in the likelihood function the factorials have disappeared. This is because they provide a constant that does not influence the relative likelihood of different values of the parameter It is usual to work with the log likelihood rather than the likelihood. Note that maximising the log likelihood is equivalent to maximising the likelihood We can find the mle of the parameter analytically Note that here the mle is the same as the moment estimator Find where the derivative of the log likelihood is zero Take the natural log of the likelihood function

Sufficient statistics In this example we could write the likelihood as a function of a simple summary of the data – the mean This is an example of a sufficient statistic. These are statistics that contain all information about the parameter(s) under the specified model For example, support we have a series of iid normal RVs Mean square Mean

Properties of the maximum likelihood estimate The maximum likelihood estimate can be found either analytically or by numerical maximisation The mle is consistent in that it converges to the truth as the sample size gets infinitely large The mle is asymptotically efficient in that it achieves the minimum possible variance (the Cramér-Rao Lower Bound) as n→∞ However, the mle is often biased for finite sample sizes –For example, the mle for the variance parameter in a normal distribution is the sample variance

Comparing parameter estimates Obtaining a point estimate of a parameter is just one problem in statistical inference We might also like to ask how good different parameter values are One way of comparing parameters is through relative likelihood For example, suppose we observe counts of 12, 22, 14 and 8 from a Poisson process The maximum likelihood estimate is 14. The relative likelihood is given by

Using relative likelihood The relative likelihood and log likelihood surfaces are shown below

Interval estimation In most cases the chance that the point estimate you obtain for a parameter is actually the correct one is zero We can generalise the idea of point estimation to interval estimation Here, rather than estimating a single value of a parameter we estimate a region of parameter space –We make the inference that the parameter of interest lies within the defined region The coverage of an interval estimator is the fraction of times the parameter actually lies within the interval The idea of interval estimation is intimately linked to the notion of confidence intervals

Example Suppose I’m interested in estimating the mean of a normal distribution with known variance of 1 from a sample of 10 observations I construct an interval estimator The chart below shows how the coverage properties of this estimator vary with a If I choose a to be 0.62 I would have coverage of 95%

Confidence intervals It is a short step from here to the notion of confidence intervals We find an interval estimator of the parameter that, for any value of the parameter that might be possible, has the desired coverage properties We then apply this interval estimator to our observed data to get a confidence interval We can guarantee that among repeat performances of the same experiment the true value of the parameter would be in this interval 95% of the time We cannot say ”There is a 95% chance of the true parameter being in this interval”

Example – confidence intervals for normal distribution Creating confidence intervals for the mean of normal distributions is relatively easy because the coverage properties of interval estimators do not depend on the mean (for a fixed variance) For example, the interval estimator below has 95% coverage properties for any mean As you’ll see later, there is an intimate link between confidence intervals and hypothesis testing

Example: confidence intervals for exponential distribution For most distributions, the coverage properties of an estimator will depend on the true underlying parameter However, we can make use of the CLT to make confidence intervals for means For example, for the exponential distribution with different means, the graph shows the coverage properties for the interval estimator (n=100)

Confidence intervals and likelihood Thanks to the CLT there is another useful result that allows us to define confidence intervals from the log-likelihood surface Specifically, the set of parameter values for which the log-likelihood is not more than 1.92 less than the maximum likelihood will define a 95% confidence interval –In the limit of large sample size the LRT is approximately chi-squared distributed under the null This is a very useful result, but shouldn’t be assumed to hold –i.e. Check with simulation

Bayesian estimators As you may notice, the notion of a confidence interval is very hard to grasp and has remarkably little connection to the data that you have collected It seems much more natural to attempt to make statements about which parameter values are likely given the data you have collected To put this on a rigorous probabilistic footing we want to make statements about the probability (density) of any particular parameter value given our data We use Bayes theorem Prior Likelihood Posterior Normalising constant

Bayes estimators The single most important conceptual difference between Bayesian statistics and frequentist statistics is the notion that the parameters you are interested in are themselves random variables This notion is encapsulated in the use of a subjective prior for your parameters Remember that to construct a confidence interval we have to define the set of possible parameter values A prior does the same thing, but also gives a weight to different values

Example: coin tossing I toss a coin twice and observe two heads I want to perform inference about the probability of obtaining a head on a single throw for the coin in question The point estimate/MLE for the probability is 1.0 – yet I have a very strong prior belief that the answer is 0.5 Bayesian statistics forces the researcher to be explicit about prior beliefs but, in return, can be very specific about what information has been gained by performing the experiment

The posterior Bayesian inference about parameters is contained in the posterior distribution The posterior can be summarised in various ways Prior Posterior Posterior mean Credible Interval

Bayesian inference and the notion of shrinkage The notion of shrinkage is that you can obtained better estimates by assuming a certain degree of similarity among the things you want to estimate Practically, this means two things –Borrowing information across observations –Penalising inferences that are very different from anything else The notion of shrinkage is implicit in the use of priors in Bayesian statistics There are also forms of frequentist inference where shrinkage is used –But NOT MLE