580.691 Learning Theory Reza Shadmehr Bayesian Learning 2: Gaussian distribution & linear regression Causal inference.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Computer vision: models, learning and inference Chapter 8 Regression.
Psychology 290 Special Topics Study Course: Advanced Meta-analysis April 7, 2014.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Integration of sensory modalities
What is Statistical Modeling
Classification and risk prediction
Mobile Intelligent Systems 2004 Course Responsibility: Ola Bengtsson.
Visual Recognition Tutorial
Lecture 5 Probability and Statistics. Please Read Doug Martinson’s Chapter 3: ‘Statistics’ Available on Courseworks.
Lecture 4 Probability and what it has to do with data analysis.
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Lecture II-2: Probability Review
Modern Navigation Thomas Herring
Classification and Prediction: Regression Analysis
Separate multivariate observations
Today Wrap up of probability Vectors, Matrices. Calculus
Probability, Bayes’ Theorem and the Monty Hall Problem
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Machine Learning Queens College Lecture 3: Probability and Statistics.
Principles of Pattern Recognition
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Some matrix stuff.
EM and expected complete log-likelihood Mixture of Experts
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
Instrumental Variables: Problems Methods of Economic Investigation Lecture 16.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Multiple Random Variables Two Discrete Random Variables –Joint pmf –Marginal pmf Two Continuous Random Variables –Joint Distribution (PDF) –Joint Density.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
Statistical learning and optimal control: A framework for biological learning and motor control Lecture 4: Stochastic optimal control Reza Shadmehr Johns.
Learning Theory Reza Shadmehr Optimal feedback control stochastic feedback control with and without additive noise.
This supervised learning technique uses Bayes’ rule but is different in philosophy from the well known work of Aitken, Taroni, et al. Bayes’ rule: Pr is.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
The Multivariate Gaussian
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Probability and statistics review ASEN 5070 LECTURE.
Lecture 2: Statistical learning primer for biologists
Sampling and estimation Petter Mostad
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
University of Colorado Boulder ASEN 5070: Statistical Orbit Determination I Fall 2015 Professor Brandon A. Jones Lecture 14: Probability and Statistics.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)
G. Cowan Lectures on Statistical Data Analysis Lecture 10 page 1 Statistical Data Analysis: Lecture 10 1Probability, Bayes’ theorem 2Random variables and.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Statistical Interpretation of Least Squares ASEN.
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
CH 5: Multivariate Methods
Classification of unlabeled data:
Computing and Statistical Data Analysis / Stat 8
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
Integration of sensory modalities
Gaussian distribution & linear regression
Pattern Recognition and Machine Learning
Mathematical Foundations of BME
Learning Theory Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME
Mathematical Foundations of BME Reza Shadmehr
Mathematical Foundations of BME
Mathematical Foundations of BME Reza Shadmehr
Probabilistic Surrogate Models
Presentation transcript:

Learning Theory Reza Shadmehr Bayesian Learning 2: Gaussian distribution & linear regression Causal inference

The numerator is just the joint distribution of w and y, evaluated at a particular y(n). The denominator is the marginal distribution of y, evaluated at y(n), that is, it is just a number that makes the numerator integrate to one. For today’s lecture we will attack the problem of how to apply Bayes rule when both our prior (p(w)) and our condition p(y|w) are Gaussian: Posterior distr. Prior Distr. Conditional Distr. w y y (n) Prior dist p(w) Marginal dist p(y) Joint distribution Evaluated at y(n) Joint distribution p(w,y)

Example: Linear regression with a prior

So the joint probability is Normally distributed. Now what we would like to do is to factor this expression so that we can write it as a conditional probability times a prior. If we can do this, then the conditional probability is the posterior that we are looking for.

For the rest of the lecture we will try to solve this problem when our prior and the conditional distribution are both Normally distributed. The Multivariate Normal distribution is: Where x is a d x 1 vector and Sigma is a d x d variance covariance matrix.. The distribution has two parts: The exponential part is a quadratic form that determines the form of the Gaussian curve. The factor before is just a constant factor that makes the exponential part integrate to 1 (it does not depend on x). Now let’s start with two variables that have a joint Gaussian distribution: x 1 is a px1 vector and x 2 a qx1 vector. They have covariance  12 : pxppxq qxqqxp

How would we calculate? The following calculation for Gaussians will be a little long, but it is worth it, because the result will be extremely useful. Often we have things that are Gaussian and often we can use the Gaussian distribution as approximations. To calculate the posterior probability, we need to know how to factorize the joint probability into a part that depends on x 1 and x 2 and one that only depends on x 2. So, we need to learn how to block-diagonalize the variance-covariance matrix: M/H is called the Schur complement of the matrix M with respect to H.

Now let’s take the determinant of the above equation. Remember for square matrices A and B: det(AB)=det(A)*det(B). Also remember that the determinant of a block-triangular matrix is just the product of the determinants of the diagonal blocks. As a second result, what is M -1 ? Result 1 Result 2

We use result 1 to split the constant first factor out of multivariate Gaussian into two factors. Now we can factorize the exponential part into two, using result 2: (A)(B) (C) (D)

Now see that part A and C and part B and D each combine to a normal distribution. Thus we can write: If x 1 and x 2 are jointly normally distributed, with: Then x 1 given x 2 has a normal distribution with:

Linear regression with a prior and the relationship to Kalman gain mean variance

Recall that in the hiking problem we had two GPS devices that measured our position. We combined the reading from the two devices to form an estimate of our location. This approach makes sense if the two readings are close to each other. However, we can hardly be expected to combine the two readings if one of them is telling us that we are on the north bank of the river and the other is telling us that we are on the south bank. We know that we are not in the middle of the river! In this case the idea of combining the two readings makes little sense. Wallace and colleagues (2004) examined this question by placing people in a room where LEDs and small speakers were placed around a semi-circle (Fig. 1A). A volunteer was placed in the center of the semi-circle and held a pointer in hand. The experiment began by the volunteer fixating a location (fixation LED, Fig. 1A). An auditory stimulus was presented from one of the speakers, and then one of the LEDs was turned on 200, 500, or 800ms later. The volunteer estimated the location of the sound by pointing (pointer, Fig. 1A). Then the subject pressed a switch with their foot if they thought that the light and the sound came from the same location. The results of the experiment are plotted in Fig. 1B and C. The perception of unity was highest when the two events occurred in close temporal and spatial proximity. Importantly, when the volunteers perceived a common source, their perception of the location of the sound was highly affected by the location of the light. If location of the sound is: Location of the LED is: Estimate of the location of the sound: The estimate of location of sound was biased by the location of the LED when the volunteer thought that there was a common source (Fig. 1C). This bias fell to near zero when the volunteer perceived light and sound to originate from different sources Causal inference

People were asked to report their perception of unity, i.e., whether the location and light and sound were the same. Wallace et al. (2004) Exp Brain Res 158:

When our various sensory organs produce reports that are temporally and spatially in agreement, we tend to believe that there was a single source that was responsible for both observations. In this case, we combine the readings from the sensors to estimate the state of the source. On the other hand, if our sensory measurements are temporally or spatially inconsistent, then we view the events as having disparate sources, and we do not combine the sources. Therefore, the nature of our belief as to whether there was a common source or not is not black or white. Rather, there is some probability that there was a common source. In that case, this probability should have a lot to do with how we combine the information from the various sensors