Machine learning overview Computational method to improve performance on a task by using training data. This shows a NN, but can replace with other ML methods
Example: Linear regression Task: predict y from x, using or Other forms possible, such as This is still linear in the parameters w. Loss: mean squared error (MSE) between predictions and targets
Capacity and data fitting Capacity: a measure of the ability to fit complex data Increased capacity means we can make the training error small. Overfitting: like memorizing the training inputs. Capacity large enough to reproduce training data, but does poorly on test data. Too much capacity for the available data. Underfitting: like ignoring details. Not enough capacity for the available detail.
Capacity and data fitting
Capacity and generalization error
Regularization Sometimes minimizing the performance or loss function directly promotes overfitting. E.g., the Runge phenomenon of interpolating by a polynomial using evenly spaced points. Red = target function Blue = degree 5 Green = degree 9 Output is linear in the coeffs
Regularization Can get a better fit by using a penalty on the coefficients. E.g.,
Example: Classification Task: predict one of several classes for a given input. E.g., Decide if a movie review is positive or negative. Identify one of several possible topics for a news piece Output: A probability distribution on possible outcomes. Loss: Cross-entropy (a way to compare distributions)
Information For a probability distribution p(X) for a rv X, define the information of outcome x to be (log = nat log) I(x) = - log p(x) This is 0 if p is 1 (no information for a certain outcome) and is large if p is near 0 (lots of information if the event is not likely). Additivity: If X and Y are indep, then info is additive: I(x,y) = - log p(x,y) = - log p(x)p(y) = - log p(x) - log p(y) = I(x) + I(y)
Entropy Entropy is the expected information of a rand var: Note that 0 log 0 = 0 Entropy is a measure of unpredictability of a random variable. For a given set of states, equal probability gives maximum entropy.
Cross-entropy Compare one distribution to another. Suppose we have distribution p,q on same set W. Then In the discrete case,
Cross entropy as loss function Question: given p(x), what q(x) minimizes the cross entropy (in the discrete case)? Constrained optimization:
Constrained optimization More general constrained optimization: f is the objection function (loss) gi are the equality constraints hj are the inequality constraints If no constraints: look for a point where gradient of f vanishes. But we need to include constraints.
Intuition Given g = 0. Try to find points where f’ = 0 since these points might be minima. Two possibilities: We could be following a contour line of f (f does not change along contour lines). So the contour lines of f and g are parallel here. We have a local minimum of f (gradient of f is 0). https://en.wikipedia.org/wiki/Lagrange_multiplier
Intuition If the contours of f and g are parallel, then the gradients of f and g are parallel. Thus we want points (x, y) where g(x, y) = 0 and for some λ. This is the idea of Lagrange multipliers. https://en.wikipedia.org/wiki/Lagrange_multiplier
KKT conditions Start with Make the Lagrangian function Take gradient and set to 0 – but other conds also.
KKT conditions Make the Lagrangian function Necessary conditions to have a minimum are
Cross entropy Exercise for the reader: Use the KKT conditions to show that if pi are fixed, positive, and sum to 1, then the qi that solves is qi = pi. That is,
Regularization and constraints Regularization is something like a weak constraint. E.g., for L2 penalty, instead of requiring the weights to be small with a penalty like < c we just prefer them to be small by adding to the objective function.