Integration of sensory modalities

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

The Simple Regression Model
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Linear Regression. PSYC 6130, PROF. J. ELDER 2 Correlation vs Regression: What’s the Difference? Correlation measures how strongly related 2 variables.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
The loss function, the normal equation,
Visual Recognition Tutorial
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML)
SUMS OF RANDOM VARIABLES Changfei Chen. Sums of Random Variables Let be a sequence of random variables, and let be their sum:
Maximum likelihood (ML) and likelihood ratio (LR) test
Visual Recognition Tutorial
Visual Recognition Tutorial
Continuous Random Variables and Probability Distributions
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
Maximum likelihood (ML)
Lecture II-2: Probability Review
Review of Lecture Two Linear Regression Normal Equation
Crash Course on Machine Learning
PATTERN RECOGNITION AND MACHINE LEARNING
Learning Theory Reza Shadmehr Bayesian Learning 2: Gaussian distribution & linear regression Causal inference.
EM and expected complete log-likelihood Mixture of Experts
Random Sampling, Point Estimation and Maximum Likelihood.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Multiple Regression The Basics. Multiple Regression (MR) Predicting one DV from a set of predictors, the DV should be interval/ratio or at least assumed.
Modern Navigation Thomas Herring
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ELEC 303 – Random Signals Lecture 18 – Classical Statistical Inference, Dr. Farinaz Koushanfar ECE Dept., Rice University Nov 4, 2010.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Machine Learning 5. Parametric Methods.
Continuous Random Variables and Probability Distributions
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Lecture 1: Basic Statistical Tools. A random variable (RV) = outcome (realization) not a set value, but rather drawn from some probability distribution.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Estimation Econometría. ADE.. Estimation We assume we have a sample of size T of: – The dependent variable (y) – The explanatory variables (x 1,x 2, x.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Visual Recognition Tutorial
12. Principles of Parameter Estimation
LECTURE 11: Advanced Discriminant Analysis
Data Mining Lecture 11.
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
Integration of sensory modalities
Simple Linear Regression
Computing and Statistical Data Analysis / Stat 7
Mathematical Foundations of BME
Learning Theory Reza Shadmehr
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Mathematical Foundations of BME Reza Shadmehr
Mathematical Foundations of BME
12. Principles of Parameter Estimation
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Presentation transcript:

Integration of sensory modalities 580.691 Learning Theory Reza Shadmehr Maximum likelihood Integration of sensory modalities

We showed that linear regression, steepest decent algorithm, and LMS all minimize the cost function This is just one possible cost function. What is the justification for this cost function? Today we will see that this cost function gives rise to the maximum likelihood estimate if the data is normally distributed.

Expected value and variance of scalar random variables

Statistical view of regression Suppose the outputs y were actually produced by the process: The “true” underlying process What we measured Our model of the process Given a constant X, the underlying process would give us different y every time we observe it. Given each “batch”, we fit our parameters. What is the probability of observing the particular y in trial i?

Probabilistic view of linear regression Linear regression expresses the random variable y(n) in terms of the input-independent variation e around the mean y Let us assume: x variance Probability of observing some specific value of y, given that we have hypothesized model structure, a specific input x, and specific weights w. That probability will have its peak at our expectation yhat, and a variance that depends on the variance of the residual. Normal distribution Mean zero Then outputs y given x are:

Probabilistic view of linear regression As variance, i.e., spread, of the residual e increases, our confidence about our model’s guess decreases. 1 2 3 4 5 6 0.2 0.4 0.6 0.8

Probabilistic view of linear regression Example: suppose the underlying process was: Given some data points, we estimate w and also guess the variance of the noise, we could compute probability of each y that we observed. In this example, -1.5 -1 -0.5 0.5 1 1.5 2 4 6 y 0.4 7 0.3 0.2 0.1 y 1.5 x x -1.5 We want to find a set of parameters that maximize P for all the data.

Maximum likelihood estimation We view the outputs y(n) as random variables that were generated by a probabilistic process that had some distribution with unknown parameters q (e.g., mean and variance). The “best” guess for q is one that maximizes the joint probability that the observed data came from that distribution.

Maximum likelihood estimation: uniform distribution Suppose that n numbers y(i) were drawn from a distribution and we need to estimate the parameters of that distribution. Suppose that the distribution was uniform. Likelihood that the data came from a model with our specific parameter value

Maximum likelihood estimation: exponential distribution Log-likelihood An example of exponential distribution is time between two events. A Poisson process would give rise to such a probability density function.

Maximum likelihood estimation: Normal distribution Now we see that if s is a constant, the log-likelihood is proportional to our cost function (the sum of squared errors!)

Maximum likelihood estimation: Normal distribution

Probabilistic view of linear regression If we assume that y(i) are independently and identically distributed (I.I.D.), conditional on x(i), then the joint conditional distribution of the data y is obtained by taking the product of the individual conditional probabilities: When you’ve been given a batch of data and a model, you can compute the probability of observing those data point that were given to you, given that you have input X and model w. Given our model, we can assign a probability to our observation. We want to find parameters that maximize the probability that we will observe data like the one that we were given.

Probabilistic view of linear regression Given some data D, and two models: (w1,s) and (w2,s), the better model has the larger joint probability for the actually observed data. 8 8 6 6 4 4 2 2 Given a variance sigma, the probability density function is smaller the farther away the point is from the mean (the predicted value). -1.5 -1 -0.5 0.5 1 1.5 -1.5 -1 -0.5 0.5 1 1.5

Probabilistic view of linear regression Given some data D, and two models: (w,s1) and (w,s2), the better model has the larger joint probability for the actually observed data. 7 ´ 10 -38 -1.5 -1 -0.5 0.5 1 1.5 2 3 4 5 6 7 6 ´ 10 -38 5 ´ 10 -38 4 ´ 10 -38 3 ´ 10 -38 2 ´ 10 -38 1 ´ 10 -38 0.6 0.8 1 1.2 1.4 The underlying process here was generated with a s=1, our model was second order, and our joint probability on this data set happened to peak near s=1.

The same underlying process will generate different D on each run, resulting in different estimates of w and s, despite the fact that the underlying process did not change. -1.5 -1 -0.5 0.5 1 1.5 2 3 4 5 6 7 0.6 0.8 1 1.2 1.4 ´ 10 -38 2 3 4 5 6 7 -1.5 -1 -0.5 0.5 1 1.5 2 4 6 8 0.6 0.8 1 1.2 1.4 5 ´ 10 -41 -40 1.5 2 2.5 3 -1.5 -1 -0.5 0.5 1 1.5 2 4 6 0.6 0.8 1 1.2 1.4 ´ 10 -37 2 3 4 5 6

Likelihood of our model Given some observed data: and model structure: Try to find the parameters w and s that maximize the joint probability over the observed data: Likelihood that the data came from a model with our specific parameter values

Maximizing the likelihood It’s easier to maximize the log of the likelihood function. Log-likelihood So minimizing this particular loss function maximizes the likelihood of our model if the data came from a probability density function that was normally distributed around our model. Finding w to maximize the likelihood is equivalent to finding w so to minimize loss function:

Finding weights that maximize the likelihood Log-likelihood (mx1) (nxm) (all remaining terms are scalars) Above is the ML estimate of w, given model:

Finding the noise variance that maximizes the likelihood Above is the ML estimate of s2, given model:

The hiking in the woods problem: combining information from various sources We have gone on a hiking trip and taken with us two GPS devices, one from a European manufacturer, and the other from a US manufacturer. These devices use different satellites for positioning. Our objective is to figure out how to combine the information from the two sensors. (a 4x1 vector) Likelihood function We want to find the position x that maximizes this likelihood.

Our most likely location is one that weighs the reading from each device by the inverse of the device’s probability covariance. In other words, we should discount the reading from each device according to the inverse of each device’s uncertainty. If we stay still and do not move, the variance in our readings is simply due to noise in the devices. By combining the information from the two devices, the variance of our estimate is less than the variance of each device.

ML estimate

Marc Ernst and Marty Banks (2002) were first to demonstrate that when our brain makes a decision about a physical property of an object, it does so by combining various sensory information about that object in a way that is consistent with maximum likelihood state estimation. Ernst and Banks began by considering a hypothetical situation in which one has to estimate the height of an object. Suppose that you use your index and thumb to hold an object. Your haptic system and your visual system report its height. If the noise in the two sensors is equal, then the weights that you apply to the sensors are equal as well. This case is illustrated in the left column of next figure. On the other hand, if the noise is larger for proprioception, your uncertainty is greater for that sensor and so you apply a smaller weight to its reading (right column of next fig).

Equal uncertainty in vision and prop. More uncertain of proprioception

Measuring the noise in a biological sensor If one was to ask you to report the height of the object, of course you would not report your belief as a probability distribution. To estimate this distribution, Ernst and Banks acquired a psychometric function, shown in the lower part of the graph. To acquire this function, they provided their subjects a standard object of height 5.5cm. They then presented a second object of variable length and asked whether it was taller than the first object. If the subject represented the height of the standard object with a maximum likelihood estimate, then the probability of classifying the second object as being taller is simply the cumulative probability distribution. This is called a psychometric function. The point of subject equality (PSE) is the height at which the probability function is at 0.5. 2 4 6 8 10 12 0.1 0.2 0.3 0.4 -2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 -3 -2 -1 1 2 3 0.2 0.4 0.6 0.8

The authors estimated that the noise in the haptic sense was four times larger than the noise in the visual sense. This implies that in integrating visual and haptic information about an object, the brain should ‘weigh’ the visual information 4 times are much as haptic information. To test for this, subjects were presented with a standard object for which the haptic information indicated a height of and visual information indicated a height of Subjects would assign a weight of around 0.8 to the visual information and around 0.2 to the haptic information. To estimate these weights, they presented a second object (for which the haptic and visual information agreed) and ask which one was taller.

Summary The “true” underlying process What we measured Our model of the process ML estimate of model parameters, given X: