10701 / 15781 Machine Learning
The distribution of w When the assumptions in the linear model are correct, the ML estimator follows a simple Gaussian distribution given by This results from our assumption about the distribution of y
Bias Many estimators can have no bias. Of these, we are interested in the minimum variance unbiased estimator (MVUE) Maximum likelihood estimators can be biased (though they are asymptotically unbiased)
Topics Experiment Selection Modeling Supervised Learning (Classification) Unsupervised Learning (Clustering) Data Analysis (Estimation and Regression)
Today: - Logistic regression, - Regularization Next time: Clustering
Categorical regression So far we discussed regression using real valued outcomes. However in some cases we are interested in categorical or binary outcomes. Examples: Documents, web pages, protein function etc. Not real valued output
Logistic regression Logistic regression represents the probability of category i using a function of a linear regression on the input variables: where for i<k and for k
Logistic regression Logistic regression represents the probability of category i using a function of a linear regression on the input variables: for i<k and for k Sum to 1 for all categories
Logistic regression The name comes from the logit transformation:
Binary logistic regression We only need one set of parameters This results in a “squashing function” which turns linear predictions into probabilities
Binary logistic regression We only need one set of parameters This results in a “squashing function” which turns linear predictions into probabilities
Logistic regression vs. Linear regression
Example
Log likelihood
Log likelihood Note: this likelihood is a concave
Maximum likelihood estimation Common (but not only) approaches: Numerical Solutions: Line Search Simulated Annealing Gradient Descent Newton’s Method No close form solution! prediction error
Gradient descent
Gradient ascent We eventually reach the maximum Iteratively updating the weights in this fashion increases likelihood each round. We eventually reach the maximum We are near the maximum when changes in the weights are small. Thus, we can stop when the sum of the absolute values of the weight differences is less than some small number.
Example We get a monotonically increasing log likelihood of the training labels as a function of the iterations
Convergence The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction: This condition means (again) that the prediction error is uncorrelated with the components of the input vector
Additive models We can extend the logistic regression models to additive (logistic) models We are free to choose the basis functions i(x) For example, each i can act as an adjustable “feature detector”:
Regularization With more basis functions and parameters the models are more flexible to fit any relation (or noise) in the data. We need to find ways of constraining or “regularizing” such models in ways other than limiting the number of parameters We can limit the “effective” number of parameter choices, where the number depends on the resolution
Regularization: basic idea The set of (0/1) coins parameterized by the probability p of getting “1” How many coins are there? - Case 1: uncountable number - Case 2: 5 coins (p1,...,p5) so that predictions of any other coin (indexed by p) is no more than = 0.1 away: for any p, |p- pj|≤0.1 for at least one j - Case 3: only 1 coin if =0.5
Regularization: logistic regression Simple logistic regression model parameterized by w=(w0,w1) we assume that x [-1,1], i.e., that the inputs remain bounded. We can now divide the parameter space into regions with centers w1,w2,... such that the predictions of any w (for any x [-1,1]) are close to those of one of the centers:
Regularization: logistic regression We can view the different discrete parameter choices w1,w2,...,wn,… as “centroids” of regions in the parameter space such that within each region
Regularizing logistic regression We can regularize the models by imposing a penalty in the estimation criterion that encourages ||w|| to remain small Maximum penalized log likelihood criterion: where larger values of C impose stronger regularization. More generally, we can assign penalties based on prior distributions over the parameters, i.e., add logP(w) in the log likelihood criterion.
Regularization: gradient ascent As with the non-regularized version, we maximize the likelihood of the regularized function: Taking the derivative with respect to wj : Thus, our new update role becomes:
Training vs. test errors How do the training/test conditional log likelihoods behave as a function of the regularization parameter ?
Acknowledgment These slides are based in part on slides from previous machine learning classes taught by Andrew Moore at CMU and Tommi Jaakkola at MIT. I thank Andrew and Tommi for letting me use their slides.