Bayesian Inference, Basics Professor Wei Zhu 1. Bayes Theorem Bayesian statistics named after Thomas Bayes (1702-1761) -- an English statistician, philosopher.

Slides:



Advertisements
Similar presentations
Bayesian Statistics and Inference
Advertisements

Lecture (11,12) Parameter Estimation of PDF and Fitting a Distribution Function.
Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Chapter 7 Title and Outline 1 7 Sampling Distributions and Point Estimation of Parameters 7-1 Point Estimation 7-2 Sampling Distributions and the Central.
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Copyright © Cengage Learning. All rights reserved. 9 Inferences Based on Two Samples.
Hypothesis testing Some general concepts: Null hypothesisH 0 A statement we “wish” to refute Alternative hypotesisH 1 The whole or part of the complement.
Point estimation, interval estimation
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Business Statistics for Managerial Decision
Sampling Distributions
Inferences About Process Quality
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
The Neymann-Pearson Lemma Suppose that the data x 1, …, x n has joint density function f(x 1, …, x n ;  ) where  is either  1 or  2. Let g(x 1, …,
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
1 As we have seen in section 4 conditional probability density functions are useful to update the information about an event based on the knowledge about.
Statistical Decision Theory
Model Inference and Averaging
QBM117 Business Statistics Estimating the population mean , when the population variance  2, is known.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Estimation Bias, Standard Error and Sampling Distribution Estimation Bias, Standard Error and Sampling Distribution Topic 9.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
1 7. Two Random Variables In many experiments, the observations are expressible not as a single quantity, but as a family of quantities. For example to.
Bayesian Analysis and Applications of A Cure Rate Model.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian vs. frequentist inference frequentist: 1) Deductive hypothesis testing of Popper--ruling out alternative explanations Falsification: can prove.
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Week 41 Estimation – Posterior mean An alternative estimate to the posterior mode is the posterior mean. It is given by E(θ | s), whenever it exists. This.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Sampling and estimation Petter Mostad
Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Topic 5: Continuous Random Variables and Probability Distributions CEE 11 Spring 2002 Dr. Amelia Regan These notes draw liberally from the class text,
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Business Statistics: A First Course 5 th Edition.
STATISTICS People sometimes use statistics to describe the results of an experiment or an investigation. This process is referred to as data analysis or.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Virtual University of Pakistan
Model Inference and Averaging
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Chapter 8: Inference for Proportions
Chapter 9 Hypothesis Testing.
More about Posterior Distributions
Bayesian Inference, Basics
Bayes for Beginners Luca Chech and Jolanda Malamud
Parametric Methods Berlin Chen, 2005 References:
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Applied Statistics and Probability for Engineers
Presentation transcript:

Bayesian Inference, Basics Professor Wei Zhu 1

Bayes Theorem Bayesian statistics named after Thomas Bayes ( ) -- an English statistician, philosopher and Presbyterian minister. 2

Bayes Theorem Bayes Theorem for probability events A and B Or for a set of mutually exclusive and exhaustive events (i.e. ), then 3

Example – Diagnostic Test A new Ebola (EB) test is claimed to have “95% sensitivity and 98% specificity” In a population with an EB prevalence of 1/1000, what is the chance that a patient testing positive actually has EB? Let A be the event patient is truly positive, A’ be the event that they are truly negative Let B be the event that they test positive 4

Diagnostic Test, ctd. We want P(A|B) “95% sensitivity” means that P(B|A) = 0.95 “98% specificity” means that P(B|A’) = 0.02 So from Bayes Theorem Thus over 95% of those testing positive will, in fact, not have EB. 5

Bayesian Inference In Bayesian inference there is a fundamental distinction between Observable quantities x, i.e. the data Unknown quantities θ θ can be statistical parameters, missing data, latent variables… Parameters are treated as random variables In the Bayesian framework we make probability statements about model parameters In the frequentist framework, parameters are fixed non- random quantities and the probability statements concern the data. 6

Prior distributions As with all statistical analyses we start by posting a model which specifies f(x| θ) This is the likelihood which relates all variables into a ‘full probability model’ However from a Bayesian point of view :  is unknown so should have a probability distribution reflecting our uncertainty about it before seeing the data Therefore we specify a prior distribution f(θ) Note this is like the prevalence in the example 7

Posterior Distributions Also x is known so should be conditioned on and here we use Bayes theorem to obtain the conditional distribution for unobserved quantities given the data which is known as the posterior distribution. The prior distribution expresses our uncertainty about  before seeing the data. The posterior distribution expresses our uncertainty about  after seeing the data. 8

Examples of Bayesian Inference using the Normal distribution Known variance, unknown mean It is easier to consider first a model with 1 unknown parameter. Suppose we have a sample of Normal data: Let us assume we know the variance,  2 and we assume a prior distribution for the mean,  based on our prior beliefs: Now we wish to construct the posterior distribution f(  |x). 9

Posterior for Normal distribution mean So we have 10

Posterior for Normal distribution mean (continued) For a Normal distribution with response y with mean  and variance  we have We can equate this to our posterior as follows: 11

Large sample properties As n   So posterior variance Posterior mean And so the posterior distribution Compared to in the Frequentist setting 12

Sufficient Statistic Intuitively, a sufficient statistic for a parameter is a statistic that captures all the information about a given parameter contained in the sample. Sufficiency Principle: If T(X) is a sufficient statistic for θ, then any inference about θ should depend on the sample X only through the value of T(X). That is, if x and y are two sample points such that T(x) = T(y), then the inference about θ should be the same whether X = x or X = y. Definition: A statistic T(x) is a sufficient statistic for θ if the conditional distribution of the sample X given T(x) does not depend on θ. 13

Definition: Let X 1, X 2, … X n denote a random sample of size n from a distribution that has a pdf f(x,θ), θ ε Ω. Let Y 1 =u 1 (X 1, X 2, … X n ) be a statistic whose pdf or pmf is f Y1 (y 1,θ). Then Y 1 is a sufficient statistic for θ if and only if where H(X 1, X 2, … X n) does not depend on θ ε Ω. 14

Example: Normal sufficient statistic: Let X 1, X 2, … X n be independently and identically distributed N(μ,σ 2 ) where the variance is known. The sample mean is the sufficient statistic for μ. 15

Starting with the joint distribution function 16

Next, we add and subtract the sample average yielding 17

Where the last equality derives from Given that the distribution of the sample mean is 18

The ratio of the information in the sample to the information in the statistic becomes 19

which is a function of the data X 1, X 2, … X n only, and does not depend on μ. Thus we have shown that the sample mean is a sufficient statistic for μ. 20

Theorem (Factorization Theorem) Let f(x|θ) denote the joint pdf or pmf of a sample X. A statistic T(X) is a sufficient statistic for θ if and only if there exists functions g(t|θ) and h(x) such that, for all sample points x and all parameter points θ 21

Posterior Distribution Through Sufficient Statistics Theorem: The posterior distribution depends only on sufficient statistics. Proof: let T(X) be a sufficient statistic for θ, then 22

Posterior Distribution Through Sufficient Statistics Example: Posterior for Normal distribution mean (with known variance) Now, instead of using the entire sample, we can derive the posterior distribution using the sufficient statistic 23 Exercise: Please derive the posterior distribution using this approach.

Girls Heights Example 10 girls aged 18 had both their heights and weights measured. Their heights (in cm) where as follows: 169.6,166.8,157.1,181.1,158.4,165.6,166.7,156.5,168.1,165.3 We will assume the variance is known to be 50. Two individuals gave the following prior distributions for the mean height Individual 1 Individual 2 24

Constructing posterior 1 To construct the posterior we use the formulae we have just calculated From the prior, From the data, The posterior is therefore 25

Prior and posterior comparison 26

Constructing posterior 2 Again to construct the posterior we use the earlier formulae we have just calculaed From the prior, From the data, The posterior is therefore 27

Prior 2 comparison Note this prior is not as close to the data as prior 1 and hence posterior is somewhere between prior and likelihood -- represented by the pdf of 28

Other conjugate examples When the posterior is in the same family as the prior we have conjugacy. Examples include: LikelihoodParameterPriorPosterior NormalMeanNormal BinomialProbabilityBeta PoissonMeanGamma 29

In all cases The posterior mean is a compromise between the prior mean and the MLE The posterior s.d. is less than both the prior s.d. and the s.e. (MLE) ‘A Bayesian is one who, vaguely expecting a horse and catching a glimpse of a donkey, strongly concludes he has seen a mule’ -- Senn, As n   The posterior mean  the MLE The posterior s.d.  the s.e. (MLE) The posterior does not depend on the prior. 30

Non-informative priors We often do not have any prior information, although true Bayesian’s would argue we always have some prior information! We would hope to have good agreement between the Frequentist approach and the Bayesian approach with a non-informative prior. Diffuse or flat priors are often better terms to use as no prior is strictly non-informative! For our example of an unknown mean, candidate priors are a Uniform distribution over a large range or a Normal distribution with a huge variance. 31

Improper priors The limiting prior of both the Uniform and Normal is a Uniform prior on the whole real line. Such a prior is defined as improper as it is not strictly a probability distribution and doesn’t integrate to 1. Some care has to be taken with improper priors however in many cases they are acceptable provided they result in a proper posterior distribution. Uniform priors are often used as non-informative priors however it is worth noting that a uniform prior on one scale can be very informative on another. For example: If we have an unknown variance we may put a uniform prior on the variance, standard deviation or log(variance) which will all have different effects. 32

Point and Interval Estimation In Bayesian inference the outcome of interest for a parameter is its full posterior distribution however we may be interested in summaries of this distribution. A simple point estimate would be the mean of the posterior. (although the median and mode are alternatives.) Interval estimates are also easy to obtain from the posterior distribution and are given several names, for example credible intervals, Bayesian confidence intervals and Highest density regions (HDR). All of these refer to the same quantity. 33

Definition (Mean Square Error) Let x = (x 1,x 2,x 3,..., x n ) denote the vector of observations having joint density f(x|  ) where the unknown parameter vector  . Let T(x) be an estimator of the parameter  (  ). Then the Mean Square Error of T(x) is defined to be:

Bayes Estimator Theorem: The Bayes estimator that will Minimize the mean square error is That is, the posterior mean.

Bayes Estimator Lemma: Suppose Z and W are real random variables, then That is, the posterior mean E(Z|W) will minimize the quadratic loss (mean square error).

Bayes Estimator Proof: Conditioning on W, the cross term is zero. Thus

Bayes Estimator Recall: Theorem: The posterior distribution depends only on sufficient statistics. Therefore, let T(X) be a sufficient statistic, then

Credible Intervals If we consider the heights example with our first prior then our posterior is μ|x ~ N(165.23,2.222), and a 95% credible interval for μ is ±1.96×sqrt(2.222) = (162.31,168.15). Similarly prior 2 results in a 95% credible interval for μ is (163.61,170.63). Note that credible intervals can be interpreted in the more natural way that there is a probability of 0.95 that the interval contains μ rather than the Frequentist conclusion that 95% of such intervals contain μ. 39

Hypothesis Testing Another big issue in statistical modelling is the ability to test hypotheses and model comparisons in general. The Bayesian approach is in some ways more straightforward. For an unknown parameter θ we simply calculate the posterior probabilities and decide between H 0 and H 1 accordingly. We also require the prior probabilities to achieve this 40

Acknowledgement Part of this presentation was based on publicly available resources posted on-line. 41