Download presentation
Presentation is loading. Please wait.
1
Machine learning overview
Computational method to improve performance on a task by using training data. This shows a NN, but can replace with other ML methods
2
Example: Linear regression
Task: predict y from x, using or Other forms possible, such as This is still linear in the parameters w. Loss: mean squared error (MSE) between predictions and targets
3
Capacity and data fitting
Capacity: a measure of the ability to fit complex data Increased capacity means we can make the training error small. Overfitting: like memorizing the training inputs. Capacity large enough to reproduce training data, but does poorly on test data. Too much capacity for the available data. Underfitting: like ignoring details. Not enough capacity for the available detail.
4
Capacity and data fitting
5
Capacity and generalization error
6
Regularization Sometimes minimizing the performance or loss function directly promotes overfitting. E.g., the Runge phenomenon of interpolating by a polynomial using evenly spaced points. Red = target function Blue = degree 5 Green = degree 9 Output is linear in the coeffs
7
Regularization Can get a better fit by using a penalty on the coefficients. E.g.,
8
Example: Classification
Task: predict one of several classes for a given input. E.g., Decide if a movie review is positive or negative. Identify one of several possible topics for a news piece Output: A probability distribution on possible outcomes. Loss: Cross-entropy (a way to compare distributions)
9
Information For a probability distribution p(X) for a rv X, define the information of outcome x to be (log = nat log) I(x) = - log p(x) This is 0 if p is 1 (no information for a certain outcome) and is large if p is near 0 (lots of information if the event is not likely). Additivity: If X and Y are indep, then info is additive: I(x,y) = - log p(x,y) = - log p(x)p(y) = - log p(x) - log p(y) = I(x) + I(y)
10
Entropy Entropy is the expected information of a rand var:
Note that 0 log 0 = 0 Entropy is a measure of unpredictability of a random variable. For a given set of states, equal probability gives maximum entropy.
11
Cross-entropy Compare one distribution to another.
Suppose we have distribution p,q on same set W. Then In the discrete case,
12
Cross entropy as loss function
Question: given p(x), what q(x) minimizes the cross entropy (in the discrete case)? Constrained optimization:
13
Constrained optimization
More general constrained optimization: f is the objection function (loss) gi are the equality constraints hj are the inequality constraints If no constraints: look for a point where gradient of f vanishes. But we need to include constraints.
14
Intuition Given g = 0. Try to find points where f’ = 0 since these points might be minima. Two possibilities: We could be following a contour line of f (f does not change along contour lines). So the contour lines of f and g are parallel here. We have a local minimum of f (gradient of f is 0).
15
Intuition If the contours of f and g are parallel, then the gradients of f and g are parallel. Thus we want points (x, y) where g(x, y) = 0 and for some λ. This is the idea of Lagrange multipliers.
16
KKT conditions Start with Make the Lagrangian function
Take gradient and set to 0 – but other conds also.
17
KKT conditions Make the Lagrangian function
Necessary conditions to have a minimum are
18
Cross entropy Exercise for the reader: Use the KKT conditions to show that if pi are fixed, positive, and sum to 1, then the qi that solves is qi = pi. That is,
19
Regularization and constraints
Regularization is something like a weak constraint. E.g., for L2 penalty, instead of requiring the weights to be small with a penalty like < c we just prefer them to be small by adding to the objective function.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.