Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

General Linear Model Beatriz Calvo Davina Bristow.
Linear Regression.
Pattern Recognition and Machine Learning
The General Linear Model Or, What the Hell’s Going on During Estimation?
Integration of sensory modalities
Chapter 4: Linear Models for Classification
The loss function, the normal equation,
Visual Recognition Tutorial
x – independent variable (input)
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Radial Basis Function Networks
Variance and covariance Sums of squares General linear models.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Principles of Pattern Recognition
EM and expected complete log-likelihood Mixture of Experts
Brain Mapping Unit The General Linear Model A Basic Introduction Roger Tait
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Linear Models for Classification
CHAPTER 10 Widrow-Hoff Learning Ming-Feng Yeh.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Statistical Modelling
PDF, Normal Distribution and Linear Regression
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
CH 5: Multivariate Methods
Classification of unlabeled data:
Machine learning, pattern recognition and statistical data modelling
Simple Linear Regression - Introduction
Probabilistic Models for Linear Regression
Modelling data and curve fitting
Ying shen Sse, tongji university Sep. 2016
The General Linear Model (GLM)
Mathematical Foundations of BME Reza Shadmehr
10701 / Machine Learning Today: - Cross validation,
Integration of sensory modalities
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Learning Theory Reza Shadmehr
The loss function, the normal equation,
The General Linear Model (GLM)
Mathematical Foundations of BME Reza Shadmehr
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Multivariate Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
Mathematical Foundations of BME
Presentation transcript:

Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function

Review of regression Multivariate regression: Batch algorithm Steepest descent LMS

Finding the minimum of a function in a single step Taylor series expansion (If J is quadratic, otherwise more terms here)

Newton-Raphson method

The gradient of the loss function Newton-Raphson

The gradient of the loss function

LMS algorithm with Newton-Raphson Steepest descent algorithm LMS is a singular matrix. Note:

Weighted Least Squares Suppose some data points are more important than others. We want to weight the errors in matching those data points more.

In fMRI, we typically measure the signal intensity from N voxels at acquisition time t=1…T. Each of these T measurements constitutes an image. We assume that the time series of voxel n is an arbitrary linear function of the design matrix X plus a noise term: T x 1 column vector T x p design matrix p x 1 vector If one source of noise is due to random discrete events, for example, artifacts arising from the participant moving their jaw, then only some images will be influenced, violating the assumption of a stationary noise process. To relax this assumption, a simple approach is to allow the variance of noise in each image to be scaled by a separate parameter. Under the temporal independence assumption, the variance-covariance matrix of the noise process might be: a variance scaling parameter for the i-th time that the voxel was imaged How to handle artifacts in FMRI data Diedrichsen and Shadmehr, NeuroImage (2005)

Discrete events (e.g., swallowing) will impact only those images that were acquired during the event. What should be done with these images, once they are identified? A typical approach would be to discard images based on some fixed threshold. If we knew the optimal approach would be to weight the images by the inverse of their variance. But how do we get V? We can use the residuals from our model: This is a good start, but has some issues regarding bias of our estimator of variance. To improve things, see Diedrichsen and Shadmehr (2005).

“Normal equations” for weighted least squares Weighted Least Squares Weighted LMS

Regression with basis functions In general, predictions can be based on a linear combination of a set of basis functions: basis set: Examples: Linear basis set: Gaussian basis set: Each basis is a local expert. This measures how close are the features of the input to that preferred by expert i. Radial basis set (RBF)

Input space Collection of experts Output 1

Regression with basis functions

Choice of loss function In learning, our aim is to find parameters w so to minimize the expected loss: Probability density of error, given our model parameters This is a weighted sum. The loss is weighted by the likelihood of observing that loss.

Inferring the choice of loss function from behavior Kording & Wolpert, PNAS (2004) A trial lasted 6 seconds. Over this period, a series of ‘peas’ appeared near the target, drawn from a distribution that depended on the finger position. The object was to “place the finger so that on average, the peas land as close as possible to the target”.

Then the smallest possible expected value of loss occurs when p(y) has its peak at yerror =0 Loss Imagine that the learner cannot arbitrarily change the density of the errors through learning. All the learner can do is shift the density left or right through setting the parameter w. If the learning uses this loss function: Therefore, in the above plot choice of w2 is better than w1. In effect, the w that the learner chooses will depend on the exact shape of p(y). The delta loss function

Suppose the “outside” system (e.g., the teacher) sets . Given the loss function, we can predict what the best w will be for the learner. Behavior with the delta loss function

Behavior with the squared error loss function We have a p(ytilda) with a variance that is independent of w. So to minimize E(loss), we should pick a w that produces the smallest E[ytilda]. That happens at a w that sets mean of p(ytilda) equal to zero.

delta Typical subjects (cm) Kording & Wolpert, PNAS (2004) Results: large errors are penalized by less than a squared term. The loss function was estimated at: However, note that the largest errors tended to occur very infrequently in this experiment.

Mean and variance of mixtures of normal distributions