Bayesian Reasoning: Maximum Entropy A/Prof Geraint F. Lewis Rm 560:

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
Monte Carlo Methods and Statistical Physics
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Visual Recognition Tutorial
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Bayesian estimation Bayes’s theorem: prior, likelihood, posterior
Statistical Background
Sampling Distributions
Inferential Statistics
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Probability, Bayes’ Theorem and the Monty Hall Problem
Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
01/24/05© 2005 University of Wisconsin Last Time Raytracing and PBRT Structure Radiometric quantities.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Statistical Decision Theory
Julian Center on Regression for Proportion Data July 10, 2007 (68)
Probability, contd. Learning Objectives By the end of this lecture, you should be able to: – Describe the difference between discrete random variables.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 1 – Slide 1 of 34 Chapter 11 Section 1 Random Variables.
R. Kass/W03P416/Lecture 7 1 Lecture 7 Some Advanced Topics using Propagation of Errors and Least Squares Fitting Error on the mean (review from Lecture.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Uncertainty Management in Rule-based Expert Systems
Chapter 2 Statistical Background. 2.3 Random Variables and Probability Distributions A variable X is said to be a random variable (rv) if for every real.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Probability Theory Modelling random phenomena. Permutations the number of ways that you can order n objects is: n! = n(n-1)(n-2)(n-3)…(3)(2)(1) Definition:
Lecture 8 Source detection NASSP Masters 5003S - Computational Astronomy
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Review of statistical modeling and probability theory Alan Moses ML4bio.
What’s the Point (Estimate)? Casualty Loss Reserve Seminar September 12-13, 2005 Roger M. Hayne, FCAS, MAAA.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Lecture 3 Patterson functions. Patterson functions The Patterson function is the auto-correlation function of the electron density ρ(x) of the structure.
Updating Probabilities Ariel Caticha and Adom Giffin Department of Physics University at Albany - SUNY MaxEnt 2006.
R. Kass/Sp07P416/Lecture 71 More on Least Squares Fit (LSQF) In Lec 5, we discussed how we can fit our data points to a linear function (straight line)
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Data Modeling Patrice Koehl Department of Biological Sciences
Lecture 1.31 Criteria for optimal reception of radio signals.
Lesson 8: Basic Monte Carlo integration
12. Principles of Parameter Estimation
More about Posterior Distributions
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
12. Principles of Parameter Estimation
Advanced deconvolution techniques and medical radiography
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Bayesian Reasoning: Maximum Entropy A/Prof Geraint F. Lewis Rm 560:

Lecture 8 Common Sense We have spent quite a bit of time exploring the posterior probability distribution, but, of course, to calculate this we need to use the likelihood function and our prior knowledge. However, how our prior knowledge is encoded is the biggest source of argument about Bayesian statistics, with cries of subjective choice influencing outcomes (but shouldn’t this be the case?) Realistically, we could consider a wealth of prior probability distributions that agree with constraints (i.e. the mean is specified), but which do we choose? Answer: we pick the one which is maximally non-committal about missing information.

Lecture 8 Shannon’s theorem In 1948, Shannon developed a measure on the uncertainty of a probability distribution which he labeled Entropy. He showed that the uncertainty of a discrete probability distribution is Jaynes argued that the maximally non-committal probability distribution is the one with the maximum entropy; hence, of all possible probability distributions we should choose the one that maximizes S. The other distributions will imply some sort of correlation (we’ll see this in a moment).

Lecture 8 Example You are told that an experiment has two possible outcomes; what is the maximally non-committal distribution you should assign to the two outcomes? Clearly, if we assign p 1 =x, then p 2 =(1-x) and the entropy is The maximum value of the entropy occurs at p 1 =p 2 =1/2. But isn’t this what you would have guessed? If we have any further information (i.e. the existence of any correlations between the outcome of 1 and 2) we can build this into our measure above and re-maximize.

Lecture 8 The Kangaroo Justification Suppose you are given some basic information about the population of Australian kangaroos; 1) 1/3 of kangaroos have blue eyes 2) 1/3 of kangaroos are left handed How many kangaroos are blue eyed and left handed? We know that; Blue eyesLeft-Handed TrueFalse True p1p1 p2p2 False p3p3 p4p4

Lecture 8 The Kangaroo Justification What are the options? 1)Independent case (no correlation) 2)Maximal positive correlation 3)Maximal negative correlation

Lecture 8 The Kangaroo Justification So there are a range of potential p 1 values (which set all the other values), but which do we choose? Again, we wish to be non-committal and not assume any prior correlations (unless we have evidence to support any particular prior). What constraint can we put on {p i } to select this particular case; So the variational function that selects the non-committal case is the entropy. As we will see, this is very important for image reconstruction. Variation functionOptimal zImplied Correlation -  p i ln( p i ) 1/9uncorrleated -  p i 2 1/12negative  ln( p i ) positive  p i 1/ positive

Lecture 8 Incorporating a prior Section 8.4 of the textbook discusses a justification of the MaxEnt approach, considering the rolling of a weighted die and examining the “multiplicity” of the outcomes (i.e. some potential outcomes are more likely than others). Suppose you have some prior information you want to incorporate some prior information into entropy measure, so we have {m i } prior estimates of our probabilities {p i }, following the arguments we see that the quantity we want to maximize is the Shannon-Jaynes entropy If m i are equal, this has no influence on the maximization; we will see this is important in considering image reconstruction.

Lecture 8 Incorporating a prior When considering a continuous probability distribution, the entropy becomes where m(y) is known as the Lebesgue measure. This quantity (which still encodes our prior) ensure that the entropy is insensitive to a change of coordinates (as m(x) and p(x) transform the same way).

Lecture 8 Some examples Suppose you are told some experiment has n possible outcomes. Without further information, what prior distribution would you assign the outcomes? Your prior estimates of the outcomes (without additional information) would be to assign {m i } = 1/n; what does MaxEnt say the values of {p i } should be? The quantity we maximize is our entropy with a Lagrange Multiplier to account for the constraint on the probability;

Lecture 8 Some examples Taking the (partial) derivative of Sc with respect to the p i and multiplier, we can show that and so All that is left is to evaluate, which we get from the constraint so Given the constraints on {m i }, =-1 and {m i }= {p i }.

Lecture 8 Nicer examples What if you have additional constraints, such as knowing the mean of the outcome? Then your constrained entropy is Where we now have two Lagrange multipliers, one each for each of the constraints. Through the same procedure, we can look for the maximum, and find; Generally, solving for either is difficult analytically, but is straight- forward numerically.

Lecture 8 Nicer examples Suppose that you are told that a die has a mean score of  dots per roll; what is the probability weighting of each face? If these are equal, the die is unweighted and fair. If, however, the probabilities are different, we should suppose that the die is unfair. If  =3.5, it’s easy to show from the constraints that 0 =-1 and 1 =0 (write out the two constraints in terms of the previous equation and divide out 0 ). If we have no prior reason to thing otherwise, each face would be weighted equally and so the final result is that {p i } = {m i }. The result is as we expect; the an (unweighted) average of 3.5, the most probable distribution is that all faces have equal weight.

Lecture 8 Nicer examples Suppose, however, you were told that the mean was  =4.5, what is the most probable distribution for {p i }? We can follow the same procedure as in the previous example, but now find that 0 =-0.37 and 1 =0.49; with this, the distribution in {p i } is As we would expect, the distribution is now skewed to the higher die faces (increasing the mean on a sequence of rolls).

Lecture 8 Additional constraints Additional information will provide additional constraints on the probability distribution. If we know a mean and a variance, then; Given what we have seen previously, we should expect the solution to be of the form (when taking the continuum limit) of Which, when appropriately normalized, is the (expected) Gaussian distribution.

Lecture 8 Image Reconstruction In science, we are interested in gleaning underlying physical properties from data sets, although in general data contains signals which are blurred (through optics or physical effects), with added noise (such as photon arrival time or detector noise). So, how do we extract our image from the blurry, noisy data?

Lecture 8 Image Reconstruction Naively, you might assume that you can simply “invert” the process and recover the original image. However, the problem is ill-posed, and a “deconvolution” will amplify the noise in a (usually) catastrophic way. We could attempt to suppress the noise (e.g. Wiener filtering) but isn’t there another way?

Lecture 8 Image Reconstruction Our image consists of a series of pixels, each with a photon count of I i. We can treat this as a probability distribution, such that The value in each pixel, therefore, is the probability that the next photon will arrive in that pixel. Note that for an image, p i ≥0, and so we are dealing with a “positive, additive distribution” (note, this is important, as some techniques like to add negative flux in regions to improve a reconstruction).

Lecture 8 Image Reconstruction We can apply Bayes theorem to calculate the posterior probability of a proposed “true” image, Im i, from the data. Following the argument given in the text, we see that

Lecture 8 Image Reconstruction So we aim to maximize The method, therefore, requires us to have a method for generating proposal images (i.e. throwing down blobs of light), convolving with our blurring function (to give I ji ) and comparing to the data through  2. The requirements on p i ensures that proposal image is everywhere positive (which is good!). What does the entropy term do? It provides a “regularization” which drives the solution towards our prior distribution (m i ) while the  2 drives a fit to the data. Note, however, we sometimes need to add additional regularization terms to enforce smoothness on the solution.

Lecture 8 Image Reconstruction Here is an example of MaxEnt reconstruction with differing point- spread functions (psf) and added noise. Exactly what you get back depends on the quality of your data, in each case you can read the recovered message.

Lecture 8 Image Reconstruction Reconstruction of the radio galaxy M87 (Bryan & Skilling 1980) using MaxEnt. Note the reduction in the noise and higher detail visible in the radio jet.

Lecture 8 Image Reconstruction Not always a good thing!! (MaxEnt)