The Nathan Kline Institute, NY

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Bayesian and Least Squares fitting: Problem: Given data (d) and model (m) with adjustable parameters (x), what are the best values and uncertainties for.
Monte Carlo Methods and Statistical Physics
Pattern Recognition and Machine Learning
Bayesian Estimation in MARK
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Chapter 4: Linear Models for Classification
Bayesian Reasoning: Markov Chain Monte Carlo
Visual Recognition Tutorial
Computing the Posterior Probability The posterior probability distribution contains the complete information concerning the parameters, but need often.
Bayesian estimation Bayes’s theorem: prior, likelihood, posterior
Lecture 5: Learning models using EM
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
Lecture 8 The Principle of Maximum Likelihood. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
Visual Recognition Tutorial
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Computer vision: models, learning and inference
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Lecture II-2: Probability Review
Robin McDougall, Ed Waller and Scott Nokleby Faculties of Engineering & Applied Science and Energy Systems & Nuclear Science 1.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Monte Carlo Methods D.J.C. Mackay.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
1 Physical Fluctuomatics 5th and 6th Probabilistic information processing by Gaussian graphical model Kazuyuki Tanaka Graduate School of Information Sciences,
Exam I review Understanding the meaning of the terminology we use. Quick calculations that indicate understanding of the basis of methods. Many of the.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
Suppressing Random Walks in Markov Chain Monte Carlo Using Ordered Overrelaxation Radford M. Neal 발표자 : 장 정 호.
- 1 - Bayesian inference of binomial problem Estimating a probability from binomial data –Objective is to estimate unknown proportion (or probability of.
Molecular Systematics
Bayesian Reasoning: Tempering & Sampling A/Prof Geraint F. Lewis Rm 560:
Lecture 2: Statistical learning primer for biologists
Machine Learning 5. Parametric Methods.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CHAPTER 2.3 PROBABILITY DISTRIBUTIONS. 2.3 GAUSSIAN OR NORMAL ERROR DISTRIBUTION  The Gaussian distribution is an approximation to the binomial distribution.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Markov Chain Monte Carlo in R
Lecture 1.31 Criteria for optimal reception of radio signals.
Bayesian Neural Networks
MCMC Output & Metropolis-Hastings Algorithm Part I
12. Principles of Parameter Estimation
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Ch3: Model Building through Regression
LECTURE 03: DECISION SURFACES
The Nathan Kline Institute, NY
Bayesian inference Presented by Amir Hadadi
Markov chain monte carlo
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Graduate School of Information Sciences, Tohoku University
Multidimensional Integration Part I
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Graduate School of Information Sciences, Tohoku University
Parametric Methods Berlin Chen, 2005 References:
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Markov Networks.
12. Principles of Parameter Estimation
Computing and Statistical Data Analysis / Stat 10
Probabilistic Surrogate Models
Presentation transcript:

The Nathan Kline Institute, NY Applied Bayesian Inference: A Short Course Lecture 3 – Computational Techniques Kevin H Knuth, Ph.D. Center for Advanced Brain Imaging and Cognitive Neuroscience and Schizophrenia, The Nathan Kline Institute, NY

Outline Exploring the Posterior Probability Obtaining Answers to Questions Cost Functions MAP estimates Markov Chain Monte Carlo More Examples

Bayesian Inference Bayes' Theorem describes how our prior knowledge about a model, based on our prior information I, is modified by the acquisition of new information or data: Rev. Thomas Bayes 1702-1761 Likelihood Posterior Probability Prior Probability Evidence

Exploring the Posterior We have looked at one-dimensional problems The data is The signal model is Our estimate is Probability Phase Angle

Exploring the Posterior And two-dimensional problems... a b Log P

Looking at the Mode The mode of the probability density is the parameter value at which the peak occurs. This is often called the Maximum A Posteriori (MAP) estimate. It is often most easily found by looking for the parameter value at which the derivative of the probability (or Log p) is zero. The variance can be estimated from the local curvature at the peak One Dimensional Case Multi-Dimensional Case

Looking at the Mean The mean is also commonly reported. For a symmetric density it is equal to the mode. The variance can be computed by calculating the expected squared deviation from the mean, or sometimes confidence intervals are defined to delineate regions of high probability.

An Example with Difficulties

Stellar Distances from Parallax Hajian, Knuth, Armstrong Here we consider a “simple” one-dimensional problem. Given a measurement of stellar parallax, infer the distance to the star. D 

What is the Answer?? If we are given a mean and standard deviation for the parallax, we can derive a Gaussian prior probability. The probability for the distance can readily be found by the transform where

What is the Answer?? Resulting in Do we report the mode? The mean? The most probable distance estimate does not correspond with the best parallax estimate! The result could make a big difference for someone planning a trip Distance - D

Warning! Be careful, the mode of the posterior of a parameter value will not in general correspond to the mode of a transform of that parameter! Often researchers will compute one optimal parameter and perform a transform assuming that the result is optimal. A physics example of this difficulty is a comparison of the frequency spectrum and wavelength spectrum of blackbody radiation.

Cost Functions To obtain a certain answer we must employ additional optimality criteria. By minimizing the expected squared error with respect to our optimized distance we obtain as a “best” estimate of the distance the mean.

Cost Functions Other optimality criteria lead to different “best” estimates. Minimization of the expected absolute value of the error results in the median Whereas, maximization of leads to the mode. The mean minimizes the expected squared deviation of our chosen answer from the correct solution, whereas the mode in the least- squares approximation minimizes the expected squared deviation of our predicted results and the data.

Marginalizing Problems Away When one has a multi-dimensional posterior where many of the parameters are uninteresting, we can marginalize over those parameters to reduce the dimensionality of the problem. However, rarely can more than 3 or 4 parameters be analytically marginalized over. Also conventional numerical integration techniques often fail due to accumulation of round-off error

Other Difficulties How does one handle Multiply-peaked Distributions? P(x|data,I) x

Other Difficulties Or other multi-dimensional difficulties

Sampling from the Posterior There is a way to sample from the posterior using a dynamical system that evolves according to the probability density Markov chain Monte Carlo Allow exploration of high-dimensional parameter spaces These techniques can actually increase in accuracy with higher dimensional problems. Converge slowly - O(n-1/2)

Markov chain Monte Carlo Start with a position in parameter space X Make a transition to a new point Y with the transition probability T(Y|X) We want the relative occurrences of X and Y to be proportional to the ratio of their probabilities. We can control this by either accepting or rejecting the transition with an acceptance probability . Y X Model Parameter Space - M

Strolling along Hypothesis Space Metropolis-Hastings Algorithm The transition is performed by incrementing X with a random vector. We employ a symmetric transition probability with an acceptance probability of

Markov chain Monte Carlo If the step size for the transition is too small, almost every step is accepted, but it takes many steps for the samples to be independent. If the step size is too large then the transitions are rarely accepted. It is best to keep the acceptance rate between 30% and 70%

Running the Simulation We evolve many (50) simulations simultaneously. In my MCMC I change one parameter at a time. The numerous simulations allow one to adjust the step size for each parameter to make sure that the acceptance rate is high enough. Bad runs (low probability simulations) can be abandoned and good ones duplicated. Duplicated runs will diverge in time.

Simulated Annealing Bring in the data slowly by writing the posterior as By varying  from 0 to 1 we can slowly turn on the effect of the data. = 0 = 0.5 = 1

Benefits of Annealing Now with We define a function so that

Estimating the Evidence We find so that we have so we can calculate the evidence using simulated annealing. This is sometimes called Thermodynamic Integration.

A MCMC Example

Estimating the BOLD Response Knuth, Ardekani, Helpern ISMRM 2001 Estimate the shape of the hemodynamic response function or Blood Oxygenation Level Dependent (BOLD) response in an event-related functional Magnetic Resonance Imaging (er-fMRI) experiment. Time (sec) Amplitude (arb units)

Modeling the Experiment We assume a parameterized form of the response (a unit amplitude normalized Gamma function) Amplitude A, time dilation , shape parameter . The peak occurs at time   Time (sec) Amplitude (arb units) Given a series of stimuli at times {1, 2, ...i} we expect the total hemodynamic response to be

Modeling the Experiment We expect that there will be a baseline voxel intensity, as well as a possible linear drift of the intensity due to the magnetic field drifting during the experiment. The intensity of a voxel is modeled as where a is the linear drift, b is the baseline intensity and n(t) is an unpredictable noise component.

The Data The experiment was a visual oddball experiment where the subject pressed a button in response to the oddball. Data, v(t), was taken from a single voxel in the motor strip. The presentation times of S=36 oddball stimuli are used in modeling the expected response. A total of 256 gradient echo single shot EPI images were obtained using a Siemens 1.5T Magneton Vision system with: TR=2s, TE=60ms, image size of 64x64, 20 axial slices, 5mm slice thickness, and FOV=250mm. Voxel Intensity Time (seconds)

Assigning Probabilities We have using Bayes’ Theorem Assign a Gaussian for the Likelihood The prior probabilities are Cutoffs are employed as A, ,  are nonnegative,  positive, and  reflects time scales shorter than 30sec

Solving the Problem The posterior is then we marginalize over  to obtain We still have a five-dimensional space to work in and must use MCMC.

Looking at the Posterior We can take some slices of the logarithm of the five-dimensional posterior through some likely parameter values... Log Probability Log Probability Amplitude A Baseline Intensity b Slice taken through  = 1  = 5 a = 0 b = 400 Slice taken through A = 1  = 1  = 5 a = 0

Monte Carlo Results MCMC ran for 225,00 iterations, last 25,000 kept. A = 3.38±1.28 (arb. intensity units)  = 4.84±4.10 (unitless)  = 1.28±1.73s a = 0.00212 ±0.00161 (arb int/sec) b = 405.00 ±0.517 (arb int) peak time =   = 2.980.27s (but notice the OTHER histogram peak!)

Monte Carlo Results A density plot of the 25,000 sampled waveforms gives a feel for the response and our uncertainty. A = 3.38±1.28 (arb. intensity units) peak time =   = 2.980.27s