Statistical Decision Theory

Slides:

Advertisements

Similar presentations

Statistical Decision Theory Abraham Wald ( ) Wald’s test Rigorous proof of the consistency of MLE “Note on the consistency of the maximum likelihood.

Advertisements

Hypothesis testing Another judgment method of sampling data.

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.

Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.

Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 

Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.

Statistical Estimation and Sampling Distributions

Visual Recognition Tutorial

EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.

Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Presenting: Assaf Tzabari

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

STATISTICAL INFERENCE PART VI

July 3, A36 Theory of Statistics Course within the Master’s program in Statistics and Data mining Fall semester 2011.

Inferences About Process Quality

Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.

Lecture II-2: Probability Review

The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,

1 Dr. Jerrell T. Stracener EMIS 7370 STAT 5340 Probability and Statistics for Scientists and Engineers Department of Engineering Management, Information.

Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.

1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.

Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.

Additional Slides on Bayesian Statistics for STA 101 Prof. Jerry Reiter Fall 2008.

Estimation Basic Concepts & Estimation of Proportions

The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.

Model Inference and Averaging

Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher.

Bayesian Inference Ekaterina Lomakina TNU seminar: Bayesian inference 1 March 2013.

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

Random Sampling, Point Estimation and Maximum Likelihood.

Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.

Theory of Probability Statistics for Business and Economics.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.

Week 71 Hypothesis Testing Suppose that we want to assess the evidence in the observed data, concerning the hypothesis. There are two approaches to assessing.

Statistics In HEP Helge VossHadron Collider Physics Summer School June 8-17, 2011― Statistics in HEP 1 How do we understand/interpret our measurements.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Consistency An estimator is a consistent estimator of θ, if , i.e., if

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Confidence Interval & Unbiased Estimator Review and Foreword.

Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.

The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.

Sampling and estimation Petter Mostad

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Brief Review Probability and Statistics. Probability distributions Continuous distributions.

Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.

Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,

G. Cowan Lectures on Statistical Data Analysis Lecture 4 page 1 Lecture 4 1 Probability (90 min.) Definition, Bayes’ theorem, probability densities and.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.

Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.

Chapter 15 Maximum Likelihood Estimation, Likelihood Ratio Test, Bayes Estimation, and Decision Theory Bei Ye, Yajing Zhao, Lin Qian, Lin Sun, Ralph Hurtado,

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.

G. Cowan Lectures on Statistical Data Analysis Lecture 12 page 1 Statistical Data Analysis: Lecture 12 1Probability, Bayes’ theorem 2Random variables and.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Bayesian Estimation and Confidence Intervals Lecture XXII.

CONCEPTS OF HYPOTHESIS TESTING

CONCEPTS OF ESTIMATION

More about Posterior Distributions

Bayesian Inference, Basics

Statistical NLP: Lecture 4

Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.

Presentation transcript:

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions

The Bayesian “philosophy” The classical approach (frequentist’s view): The random sample X = (X1, … , Xn ) is assumed to come from a distribution with a probability density function f (x;  ) where  is an unknown but fixed parameter. The sample is investigated from its random variable properties relating to f (x;  ) . The uncertainty about  is solely assessed on basis of the sample properties.

The Bayesian approach: The random sample X = (X1, … , Xn ) is assumed to come from a distribution with a probability density function f (x;  ) where the uncertainty about  is modelled with a probability distribution (i.e. a p.d.f), called the prior distribution The obtained values of the sample, i.e. x = (x1, … , xn ) are used to update the information from the prior distribution to a posterior distribution for  Main differences: In the classical approach,  is fix, while in the Bayesian approach  is a random variable. In the classical approach focus is on the sampling distribution of X, while in the Bayesian the sample focus is on the variation of . Bayesian: “What we observe is fixed, what we do not observe is random.” Frequentist: “What we observe is random, what we do not observe is fixed.”

Concepts of the Bayesian framework Prior density: p( ) Likelihood: L( ; x ) “as before” Posterior density: q( | x ) = q( ; x ) The book uses the second notation Relation through Bayes’ theorem:

The textbook writes Still the posterior is referred to as the distribution of  conditional on x

Decision-theoretic elements One of a number of actions should be decided on. State of nature: A number of states possible. Usually represented by  For each state of nature the relative desirability of each of the different actions possible can be quantified Prior information for the different states of nature may be available: Prior distribution of  Data may be available. Usually represented by x. Can be used to update the knowledge about the relative desirability of (each of) the different actions.

In mathematical notation for this course: True state of nature:  Uncertainty described by the prior p ( ) Data: x observation of X, whose p.d.f. depends on  (data is thus assumed to be available) Decision procedure:  Action:  (x) The decision procedure becomes an action when applied to given data x Loss function: LS ( ,  (x) ) measures the loss from taking action  (x) when  holds Risk function

Note that the risk function is the expected loss with respect to the simultaneous distribution of X1, … , Xn Note also that the risk function is for the decision procedure, and not for the particular action Admissible procedures: A procedure  1 is inadmissible if there exists another procedure such that R( , 1 )  R( ,  2 ) for all values of . A procedure which is not inadmissible (i.e. no other procedure with lower risk function for any  can be found) is said to be admissible

Minimax procedure: A procedure  * is a minimax procedure if i.e.  is chosen to be the “worst” possible value, and under that value the procedure that gives the lowest possible risk is chosen The minimax procedure uses no prior information about  , thus it is not a Bayesian procedure.

Example Suppose you are about to make a decision on whether you should buy or rent a new TV.   1 = “Buy the TV”  2 = “Rent the TV” Now, assume  is the mean time until the TV breaks down for the first time Let  assume three possible values 6, 12 and 24 months The cost of the TV is $500 if you buy it and $30 per month if you rent it If the TV breaks down after 12 months you’ll have to replace it for the same cost as you bought it if you bought it. If you rented it you will get a new TV for no cost provided you proceed with your contract. Let X be the time in months until the TV breaks down and assume this variable is exponentially distributed with mean   A loss function for an ownership of maximum 24 months may be defined as LS ( ,  1(X ) ) = 500 + 500  H (X – 12) and LS ( ,  2(X ) ) = 30  24 = 720

Now compare the risks for the three possible values of  Then Now compare the risks for the three possible values of  Clearly the risk for the first procedure increases with  while the risk for the second in constant. In searching for the minimax procedure we therefore focus on the largest possible value of  where  2 has the smallest risk   2 is the minimax procedure  R( ,  1 ) R( ,  2 ) 6 568 720 12 684 24 803

Bayes procedure Bayes risk: Uses the prior distribution of the unknown parameter A Bayes procedure is a procedure that minimizes the Bayes risk

Example cont. Assume the three possible values of  (6, 12 and 24) has the prior probabilities 0.2, 0.3 and 0.5. Then Thus the Bayes risk is minimized by  1 and therefore  1 is the Bayes procedure

Decision theory applied on point estimation The action is a particular point estimator State of nature is the true value of  The loss function is a measure of how good (desirable) the estimator is of  : Prior information is quantified by the prior distribution (p.d.f.) p( ) Data is the random sample x from a distribution with p.d.f. f (x ;  )

Three simple loss functions Zero-one loss: Absolute error loss: Quadratic (error) loss:

Minimax estimators: Find the value of  that maximizes the expected loss with respect to the sample values, i.e. that maximizes Then, the particular estimator that minimizes the risk for that value of  is the minimax estimator Not so easy to find!

Bayes estimators A Bayes estimator is the estimator that minimizes  For any given value of x what has to be minimized is

The Bayes philosophy is that data (x ) should be considered to be given and therefore the minimization cannot depend on x. Now minimization with respect to different loss functions will result in measures of location in the posterior distribution of . Zero-one loss: Absolute error loss: Quadratic loss:

About prior distributions Conjugate prior distributions Example: Assume the parameter of interest is , the proportion of some property of interest in the population (i.e. the probability for this property to occur) A reasonable prior density for  is the Beta density:

Now, assume a sample of size n from the population in which y of the values possess the property of interest. The likelihood becomes  Thus, the posterior density is also a Beta density with parameters y +  and n – y + 

Prior distributions that combined with the likelihood gives a posterior in the same distributional family are named conjugate priors. (Note that by a distributional family we mean distributions that go under a common name: Normal distribution, Binomial distribution, Poisson distribution etc. ) A conjugate prior always go together with a particular likelihood to produce the posterior. We sometimes refer to a conjugate pair of distributions meaning (prior distribution, sample distribution = likelihood)

In particular, if the sample distribution, i. e In particular, if the sample distribution, i.e. f (x; ) belongs to the k-parameter exponential family of distributions: we may put where 1 , … , k + 1 are parameters of this prior distribution and K( ) is a function of 1 , … , k + 1 only .

Then i.e. the posterior distribution is of the same form as the prior distribution but with parameters instead of

Some common cases: Conjugate prior Sample distribution Posterior Beta Binomial Beta Normal Normal, known  2 Normal Gamma Poisson Gamma Pareto Uniform Pareto

Example Assume we have a sample x = (x1, … , xn ) from U (0, ) and that a prior density for  is the Pareto density What is the Bayes estimator of  under quadratic loss? The Bayes estimator is the posterior mean. The posterior distribution is also Pareto with

Non-informative priors (uninformative) A prior distribution that gives no more information about  than possibly the parameter space is called a non-informative or uninformative prior. Example: Beta(1,1) for an unknown proportion  simply says that the parameter can be any value between 0 and 1 (which coincides with its definition) A non-informative prior is characterized by the property that all values in the parameter space are equally likely. Proper non-informative priors: The prior is a true density or mass function Improper non-informative priors: The prior is a constant value over Rk Example: N ( , ) for the mean of a normal population

Decision theory applied on hypothesis testing Test of H0:  =  0 vs. H1:  =  1 Decision procedure: C = Use a test with critical region C Action: C (x) = “Reject H0 if x C , otherwise accept H0 ” Loss function: H0 true H1 true Accept H0 b Reject H0 a

Risk function Assume a prior setting p0 = Pr (H0 is true) = Pr ( =  0) and p1 = Pr (H1 is true) = Pr ( =  1)  The prior expected risk becomes

Bayes test: Minimax test: Lemma 6.1: Bayes tests and most powerful tests (Neyman-Pearson lemma) are equivalent in that every most powerful test is a Bayes test for some values of p0 and p1 and every Bayes test is a most powerful test with

Example: Assume x = (x1, x2 ) is a random sample from Exp( ), i.e. We would like to test H0:  = 1 vs. H0:  = 2 with a Bayes test with losses a = 2 and b = 1 and with prior probabilities p0 and p1

Now, A fixed size  gives conditions on p0 and p1, and a certain choice will give a minimized 

Sequential probability ratio test (SPRT) Suppose that we consider the sampling to be the observation of values in a “stream” x1, x2, … , i.e. we do not consider a sample with fixed size. We would like to test H0:  = 0 vs. H1:  = 1 After n observations have been taken we have xn = (x1, … , xn ) , and we put as the current test statistic.

The frequentist approach: Specify two numbers K1 and K2 not depending on n such that 0 < K1 < K2 <  . Then If LR(n)  K1  Stop sampling, accept H0 If LR(n)  K2  Stop sampling, reject H0 If K1 < LR(n) < K2  Take another observation Usual choice of K1 and K2 (Property 6.3): If the size  and the power 1 –  are pre-specified, put This gives approximate true size  and approximate true power 1 – 

The Bayesian approach: The structure is the same, but the choices of K1 and K2 is different. Let c be the cost of taking one observation, and let as before a and b be the loss values for taking the wring decisions, and p0 and p1 be the prior probabilities of H0 and H1 respectively. Then the Bayesian choices of K1 and K2 are