Download presentation
Presentation is loading. Please wait.
Published byDylan Sullivan Modified over 6 years ago
1
10701 / 15781 Machine Learning Today: - Cross validation,
- maximum likelihood estimation Next time: Mixtures models, EM
2
Linear regression A simple linear regression function is given by:
We can estimate the parameters w by minimizing the empirical (or training) error The hope here is that the resulting parameters/linear function has a low “generalization error”, i.e., error on the new examples
3
Beyond linear regression
Linear regression for input vectors: Can also generalize these classes of functions to be non-linear functions of the inputs x but still linear in the parameters w.
4
Polynomial regression examples
5
Over fitting With too few training examples our polynomial regression model may achieve zero training error but nevertheless has a large generalization error When the training error no longer bears any relation to the generalization error we say that the function overfits the (training) data
6
Cross validation Cross-validation allows us to estimate the generalization error based on training examples alone. For example, the leave-one-out cross-validation error is given by where are the least squares estimates of the parameters w computed without the tth training example.
7
Cross validation: Example
8
Statistical view of linear regression
In a statistical regression model we model both the function and noise where ~N(0,2) Whatever we cannot capture with our chosen family of functions will be interpreted as noise
9
Statistical view of linear regression
Our function f(x; w) here is trying to capture the mean of the observations y given the input where E{ y | x, model} is the conditional expectation (mean) of y given x, evaluated according to the model.
10
Conditional probability
According to our statistical model the outputs y given x are normally distributed with mean f(x; w) and variance 2 thus,
11
Conditional probability
According to our statistical model the outputs y given x are normally distributed with mean f(x; w) and variance 2 thus, As a result we can also measure the uncertainty in the predictions, not just the mean Loss function? Estimation?
12
Maximum likelihood estimation
Given observations Dn = {(x1,y1),...,(xn,yn)}, we find the parameters w that maximize the likelihood of the outputs where:
13
Log likelihood estimation
Likelihood of data given model: It is often easier (but equivalent) to try to maximize the log-likelihood:
14
Log likelihood estimation
Likelihood of data given model: It is often easier (but equivalent) to try to maximize the log-likelihood:
15
Maximum likelihood vs. loss
Log likelihood: Our model of the noise in the outputs and the resulting (effective) loss-function in maximum likelihood estimation are intricately related
16
Maximum likelihood estimation
The likelihood of observations is a generic fitting criterion. We can also fit the noise variance 2 by maximizing the log-likelihood l(D; w,2) with respect to 2 if are the maximum likelihood parameters for f(x;w), then the optimal choice for 2 is i.e., mean squared prediction error.
17
Solving for w When the noise is assumed to be zero mean Gaussian, maximum likelihood setting of the linear regression parameters w =[w0,w1]T reduces to least squares fitting: where
18
Properties of w We can study the estimator w further if we assume that the linear model is indeed correct, i.e., the outputs were generated according to with some unknown parameters
19
Bias and variance of estimator
Major assumption: where: We keep the training inputs or X fixed and study how varies if we resample the corresponding outputs Bias: whether deviates from w* on average Variance (covariance): how much varies around its mean
20
Computing the bias We use the fact that y=Xw*+e to represent in terms of w*:
21
Computing the bias We use the fact that y=Xw*+e to represent in terms of w*:
22
Computing the bias We use the fact that y=Xw*+e to represent in terms of w*:
23
Computing the bias We use the fact that y=Xw*+e to represent in terms of w*: The parameter estimate based on the sampled data is therefore the correct parameter plus estimate based purely on noise.
24
Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is:
25
Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is unbiased:
26
Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is unbiased:
27
Bias (cont) Since the noise is zero mean by assumption, our parameter estimate is unbiased: where the conditional expectation is over the noisy outputs while keeping the inputs x1,...,xn, or X, fixed.
28
Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:
29
Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:
30
Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:
31
Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:
32
Estimator variance We can also evaluate the (conditional) covariance of the parameters, i.e., how the individual parameters co-vary due to the noise in the outputs:
33
ML summary When the assumptions in the linear model are correct, the ML estimator follows a simple Gaussian distribution given by The above Gaussian distribution summarizes the uncertainty that we have about the parameters based on a training set Dn = {(x1,y1),...,(xn,yn)},
34
Acknowledgment These slides are based in part on slides from previous machine learning classes taught by Andrew Moore at CMU and Tommi Jaakkola at MIT. I thank Andrew and Tommi for letting me use their slides.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.