10701 / Machine Learning Today: - Cross validation,

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Pattern Recognition and Machine Learning
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Estimation  Samples are collected to estimate characteristics of the population of particular interest. Parameter – numerical characteristic of the population.
Integration of sensory modalities
The General Linear Model. The Simple Linear Model Linear Regression.
AGC DSP AGC DSP Professor A G Constantinides© Estimation Theory We seek to determine from a set of data, a set of parameters such that their values would.
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
INTRODUCTION TO Machine Learning 3rd Edition
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
ES 07 These slides can be found at optimized for Windows)
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Computacion Inteligente Least-Square Methods for System Identification.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Presentation : “ Maximum Likelihood Estimation” Presented By : Jesu Kiran Spurgen Date :
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
STA302/1001 week 11 Regression Models - Introduction In regression models, two types of variables that are studied:  A dependent variable, Y, also called.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Data Modeling Patrice Koehl Department of Biological Sciences
The simple linear regression model and parameter estimation
The Maximum Likelihood Method
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Visual Recognition Tutorial
12. Principles of Parameter Estimation
(5) Notes on the Least Squares Estimate
Probability Theory and Parameter Estimation I
Model Inference and Averaging
Ch3: Model Building through Regression
The Maximum Likelihood Method
10701 / Machine Learning.
Maximum Likelihood Estimation
Bias and Variance of the Estimator
Simple Linear Regression - Introduction
Probabilistic Models for Linear Regression
The Maximum Likelihood Method
Modelling data and curve fitting
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Integration of sensory modalities
Biointelligence Laboratory, Seoul National University
Simple Linear Regression
Overfitting and Underfitting
Computing and Statistical Data Analysis / Stat 7
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Mathematical Foundations of BME Reza Shadmehr
12. Principles of Parameter Estimation
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
INTRODUCTION TO Machine Learning 3rd Edition
Applied Statistics and Probability for Engineers
Probabilistic Surrogate Models
Presentation transcript:

10701 / 15781 Machine Learning Today: - Cross validation, - maximum likelihood estimation Next time: Mixtures models, EM

Linear regression A simple linear regression function is given by: We can estimate the parameters w by minimizing the empirical (or training) error The hope here is that the resulting parameters/linear function has a low “generalization error”, i.e., error on the new examples

Beyond linear regression Linear regression for input vectors: Can also generalize these classes of functions to be non-linear functions of the inputs x but still linear in the parameters w.

Polynomial regression examples

Over fitting With too few training examples our polynomial regression model may achieve zero training error but nevertheless has a large generalization error When the training error no longer bears any relation to the generalization error we say that the function overfits the (training) data

Cross validation Cross-validation allows us to estimate the generalization error based on training examples alone. For example, the leave-one-out cross-validation error is given by where are the least squares estimates of the parameters w computed without the tth training example.

Cross validation: Example

Statistical view of linear regression In a statistical regression model we model both the function and noise where ~N(0,2) Whatever we cannot capture with our chosen family of functions will be interpreted as noise

Statistical view of linear regression Our function f(x; w) here is trying to capture the mean of the observations y given the input where E{ y | x, model} is the conditional expectation (mean) of y given x, evaluated according to the model.

Conditional probability According to our statistical model the outputs y given x are normally distributed with mean f(x; w) and variance 2 thus,

Conditional probability According to our statistical model the outputs y given x are normally distributed with mean f(x; w) and variance 2 thus, As a result we can also measure the uncertainty in the predictions, not just the mean Loss function? Estimation?

Maximum likelihood estimation Given observations Dn = {(x1,y1),...,(xn,yn)}, we find the parameters w that maximize the likelihood of the outputs where:

Log likelihood estimation Likelihood of data given model: It is often easier (but equivalent) to try to maximize the log-likelihood:

Log likelihood estimation Likelihood of data given model: It is often easier (but equivalent) to try to maximize the log-likelihood:

Maximum likelihood vs. loss Log likelihood: Our model of the noise in the outputs and the resulting (effective) loss-function in maximum likelihood estimation are intricately related

Maximum likelihood estimation The likelihood of observations is a generic fitting criterion. We can also fit the noise variance 2 by maximizing the log-likelihood l(D; w,2) with respect to 2 if are the maximum likelihood parameters for f(x;w), then the optimal choice for 2 is i.e., mean squared prediction error.

Solving for w When the noise is assumed to be zero mean Gaussian, maximum likelihood setting of the linear regression parameters w =[w0,w1]T reduces to least squares fitting: where

Properties of w We can study the estimator w further if we assume that the linear model is indeed correct, i.e., the outputs were generated according to with some unknown parameters

Bias and variance of estimator Major assumption: where: We keep the training inputs or X fixed and study how varies if we resample the corresponding outputs Bias: whether deviates from w* on average Variance (covariance): how much varies around its mean

Computing the bias We use the fact that y=Xw*+e to represent in terms of w*:

Computing the bias We use the fact that y=Xw*+e to represent in terms of w*:

Computing the bias We use the fact that y=Xw*+e to represent in terms of w*:

Computing the bias We use the fact that y=Xw*+e to represent in terms of w*: The parameter estimate based on the sampled data is therefore the correct parameter plus estimate based purely on noise.

Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is:

Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is unbiased:

Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is unbiased:

Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is unbiased: where the conditional expectation is over the noisy outputs while keeping the inputs x1,...,xn, or X, fixed.

Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:

Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:

Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:

Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:

Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:

ML summary When the assumptions in the linear model are correct, the ML estimator follows a simple Gaussian distribution given by The above Gaussian distribution summarizes the uncertainty that we have about the parameters based on a training set Dn = {(x1,y1),...,(xn,yn)},

Acknowledgment These slides are based in part on slides from previous machine learning classes taught by Andrew Moore at CMU and Tommi Jaakkola at MIT. I thank Andrew and Tommi for letting me use their slides.