Mathematical Foundations of BME Reza Shadmehr

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Wrap-Up (probably). 5 minutes of math... Marginal probabilities If you have a joint PDF:... and want to know about the probability of just one.
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Bayesian learning finalized (with high probability)
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Parameters and Statistics Probabilities The Binomial Probability Test.
Computer vision: models, learning and inference
Thanks to Nir Friedman, HU
Hamid R. Rabiee Fall 2009 Stochastic Processes Review of Elementary Probability Lecture I.
Crash Course on Machine Learning
Recitation 1 Probability Review
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
The Posterior Distribution. Bayesian Theory of Evolution!
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Chapter 5 Joint Probability Distributions and Random Samples  Jointly Distributed Random Variables.2 - Expected Values, Covariance, and Correlation.3.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Applied statistics Usman Roshan.
Lecture 1.31 Criteria for optimal reception of radio signals.
MECH 373 Instrumentation and Measurements
Bayesian Estimation and Confidence Intervals
12. Principles of Parameter Estimation
Bayesian approach to the binomial distribution with a discrete prior
CS 2750: Machine Learning Density Estimation
Ch3: Model Building through Regression
Chapter 5 Joint Probability Distributions and Random Samples
Bayes Net Learning: Bayesian Approaches
Bayes for Beginners Stephanie Azzopardi & Hrvoje Stojic
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Chapter 5 Sampling Distributions
Basic Probability Theory
Binomial Distribution & Bayes’ Theorem
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
More about Posterior Distributions
Discrete random variable X Examples: shoe size, dosage (mg), # cells,…
Mathematical Foundations of BME Reza Shadmehr
Statistical NLP: Lecture 4
Example Human males have one X-chromosome and one Y-chromosome,
Random Variables Binomial Distributions
Chapter 5 Sampling Distributions
Mathematical Foundations of BME
LECTURE 09: BAYESIAN LEARNING
LECTURE 07: BAYESIAN ESTIMATION
Bayes for Beginners Luca Chech and Jolanda Malamud
CS 594: Empirical Methods in HCC Introduction to Bayesian Analysis
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Mathematical Foundations of BME
11. Conditional Density Functions and Conditional Expected Values
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
Mathematical Foundations of BME Reza Shadmehr
11. Conditional Density Functions and Conditional Expected Values
12. Principles of Parameter Estimation
Applied Statistics and Probability for Engineers
Presentation transcript:

580.704 Mathematical Foundations of BME Reza Shadmehr Bayesian learning 1: Bayes rule, priors and maximum a posteriori

Frequentist vs. Bayesian Statistics Frequentist Thinking True parameter: Estimate of this parameter: Bayesian Thinking Does not have the concept of a true parameter. Rather, at every given time we have knowledge about w (the prior), gain new data, and then update our knowledge using Bayes rule (the posterior). Many different ways in which we can come up with estimates (e.g. Maximum Likelihood estimate), and we can evaluate them. Conditional Distr. Prior Distr. Posterior distr. Given Bayes rule, there is only ONE correct way of learning.

Binomial distribution and discrete random variables Suppose a random variable can only take one of two variables (e.g., 0 and 1, success and failure, etc.). Such trials are termed Bernoulli trials. Probability density or distribution Probability distribution of a specific sequence of successes and failures Why is p(n) different from p(x)? Suppose we have h heads. In p(x), we have the probability for a particular sequence of throws that gave us h heads. Now it is possible that there are other ways to get h heads. Suppose that sequence is y. Therefore, p(x)=p(y), since both had the same number of heads. So we see that p(h), which reflects the various ways one can get h heads. It is always higher than p(x), where x has h heads in it.

Poor performance of ML estimators with small data samples Suppose we have a coin and wish to estimate the outcome (head or tail) from observing a series of coin tosses. q = probability of tossing a head. After observing n coin tosses, we note that: out of which h trials are head. To estimate whether the next toss will be head or tail, we form an ML estimator: After one toss, if it comes up tail, our ML estimate predicts zero probability of seeing heads. If first n tosses are tails, the ML continues to predict zero prob. of seeing heads. Probability of observing a particular sequence of heads and tails in D

Including prior knowledge into the estimation process Even though the ML estimator might say , we “know” that the coin can come up both heads and tails, i.e.: Starting point for our consideration is that q is not only a number, but we will give q a full probability distribution function Suppose we know that the coin is either fair (q=0.5) with prob. p or in favor of tails (q=0.4) with probability 1-p. We want to combine this prior knowledge with new data D (i.e. number of heads in n throws) to arrive at a posterior distribution for q. We will apply Bayes rule: Prior Distr. Conditional Distr. Posterior distr. The numerator is just the joint distribution of q and D, evaluated at a particular D. The denominator is the marginal distribution of the data D, that is, it is just a number that makes the Numerator integrate to one.

Bayesian estimation for a potentially biased coin Suppose that we believe that the coin is either fair, or that it is biased toward tails: q = probability of tossing a head. After observing n coin tosses, we note that: out of which h trials are head. If more than half of the trials are turning up heads, then the coin is judged fair. Otherwise, our MAP estimate is that the coin is unfair. Now we can accurately calculate the probability that we have a fair coin, given some data D. In contrast to the ML estimate, which only gave us one number qML, we have here a full probability distribution, that is we know also how certain we are that we have a fair or unfair coin. In some situation we would like a single number, that represents our best guess of q. One possibility for this best guess is the maximum a-posteriori estimate (MAP).

Maximum a-posteriori estimate We define the MAP estimate as the maximum (i.e. mode) of the posterior distribution. MAP estimator: The latter version makes the comparison to the maximum likelihood estimate easy: We see that ML and MAP are identical, if p(q) is a constant that does not depend on q. Thus our prior would be a uniform distribution over the domain of q. We call such a prior for obvious reasons a flat or uniformed prior.

Formulating a continuous prior for the coin toss problem In the last example the probability of tossing a head, represented by q, could only be either 0.5 or p=0.4. How should we choose a prior distribution if q can be between 0 and 1? Suppose we observed n tosses. The probability density that exactly h of those tosses were heads is: Binomial distribution q = probability of tossing a head 5 10 15 20 0.05 0.1 0.15 0.2 0.25

Formulating a continuous prior for the coin toss problem q represents the probability of a head. We want a continuous distribution that is defined between q=0 and q=1, and is 0 for q=0 and q=1. Beta distribution normalizing constant q = probability of tossing a head 0.2 0.4 0.6 0.8 1 0.5 1.5 2 2.5 0.2 0.4 0.6 0.8 1 0.5 1.5 2 2.5 3 3.5

Formulating a continuous prior for the coin toss problem In general, let’s assume our knowledge comes in the form of a beta distribution: When we apply Bayes rule to integrate some old knowledge (the prior) in the form of a beta-distribution with parameters a and b, with some new knowledge h and n (coming from a binomial distribution), then we find that the posterior distribution also has the form of a beta distribution with parameters a+h and b+n-h. Beta and binomial distribution are therefore called conjugate distributions.

MAP estimator for the coin toss problem Let us look at the MAP estimator if we start with a prior of a=1, n=2, i.e. we have a slight belief in the fact that the coin is fair. Our posterior is then: 0.2 0.4 0.6 0.8 1 0.5 1.5 Let’s calculate the MAP-estimate so that we can compare it to the ML estimate. We can impose a stronger bias: after n+20 tosses, we should see h+10 heads. Now qmap = (h+10)/(n+20). Note that after one toss, if we get a tail, our probability of tossing a head is 0.33, not zero as in the ML case.

Classification with a continuous conditional distribution Assume you only know the height of a person, but not their gender. Can height tell you something about gender? Assume y=height and x=gender (0=male or 1=female). What we have: densities What we want: probability Probability of height, given that you have observed a female. Height is normally distributed in the population of men and in the population of women, with different means, and similar variances. Let x be an indicator variable for being a female. Then the conditional distribution of y (the height becomes):

Classification with a continuous conditional distribution Let us further assume that we start with a prior distribution, such that x is 1 with probability p. The posterior is a logistic function of a linear function of the data and parameters (remember this result the section on classification!). The maximum-likelihood argument would just have decided under which model the data would have been more likely. The posterior distribution gives us the full probability that we have a male or female. We can also include prior knowledge in our scheme. Probability of the subject being female, given that you have observed height y.

Classification with a continuous conditional distribution Computing the probability that the subject is female, given that we observed height y. 120 140 160 180 200 220 0.2 0.4 0.6 0.8 1 Our prior probability Posterior probability:

Summary Bayesian estimation involves the application of Bayes rule to combine a prior density and a conditional density to arrive at a posterior density. Maximum a posteriori (MAP) estimation: If we need a “best guess” from our posterior distribution, often the maximum of the posterior distribution is used. The MAP and ML estimate are identical, when our prior is uniformly distributed on q, i.e. is flat or uniformed. With a two-way classification problem and data that is Gaussian given the category membership, the posterior is a logistic function, linear in the data.