10701 / 15781 Machine Learning.

10701 / Machine Learning

The distribution of w When the assumptions in the linear model are correct, the ML estimator follows a simple Gaussian distribution given by This results from our assumption about the distribution of y

Bias Many estimators can have no bias.
Of these, we are interested in the minimum variance unbiased estimator (MVUE) Maximum likelihood estimators can be biased (though they are asymptotically unbiased)

Topics Experiment Selection Modeling
Supervised Learning (Classification) Unsupervised Learning (Clustering) Data Analysis (Estimation and Regression)

Today: - Logistic regression,
- Regularization Next time: Clustering

Categorical regression
So far we discussed regression using real valued outcomes. However in some cases we are interested in categorical or binary outcomes. Examples: Documents, web pages, protein function etc. Not real valued output

Logistic regression Logistic regression represents the probability of category i using a function of a linear regression on the input variables: where for i<k and for k

Logistic regression Logistic regression represents the probability of category i using a function of a linear regression on the input variables: for i<k and for k Sum to 1 for all categories

Logistic regression The name comes from the logit transformation:

Binary logistic regression
We only need one set of parameters This results in a “squashing function” which turns linear predictions into probabilities

Logistic regression vs. Linear regression

Example

Log likelihood

Log likelihood Note: this likelihood is a concave

Maximum likelihood estimation
Common (but not only) approaches: Numerical Solutions: Line Search Simulated Annealing Gradient Descent Newton’s Method No close form solution! prediction error

Gradient descent

Gradient ascent We eventually reach the maximum
Iteratively updating the weights in this fashion increases likelihood each round. We eventually reach the maximum We are near the maximum when changes in the weights are small. Thus, we can stop when the sum of the absolute values of the weight differences is less than some small number.

Example We get a monotonically increasing log likelihood of the training labels as a function of the iterations

Convergence The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction: This condition means (again) that the prediction error is uncorrelated with the components of the input vector

Additive models We can extend the logistic regression models to additive (logistic) models We are free to choose the basis functions i(x) For example, each i can act as an adjustable “feature detector”:

Regularization With more basis functions and parameters the models are more flexible to fit any relation (or noise) in the data. We need to find ways of constraining or “regularizing” such models in ways other than limiting the number of parameters We can limit the “effective” number of parameter choices, where the number depends on the resolution

Regularization: basic idea
The set of (0/1) coins parameterized by the probability p of getting “1” How many coins are there? - Case 1: uncountable number - Case 2: 5 coins (p1,...,p5) so that predictions of any other coin (indexed by p) is no more than  = 0.1 away: for any p, |p- pj|≤0.1 for at least one j - Case 3: only 1 coin if =0.5

Regularization: logistic regression
Simple logistic regression model parameterized by w=(w0,w1) we assume that x  [-1,1], i.e., that the inputs remain bounded. We can now divide the parameter space into regions with centers w1,w2,... such that the predictions of any w (for any x  [-1,1]) are close to those of one of the centers:

Regularization: logistic regression
We can view the different discrete parameter choices w1,w2,...,wn,… as “centroids” of regions in the parameter space such that within each region

Regularizing logistic regression
We can regularize the models by imposing a penalty in the estimation criterion that encourages ||w|| to remain small Maximum penalized log likelihood criterion: where larger values of C impose stronger regularization. More generally, we can assign penalties based on prior distributions over the parameters, i.e., add logP(w) in the log likelihood criterion.

Regularization: gradient ascent
As with the non-regularized version, we maximize the likelihood of the regularized function: Taking the derivative with respect to wj : Thus, our new update role becomes:

Training vs. test errors
How do the training/test conditional log likelihoods behave as a function of the regularization parameter ?

Acknowledgment These slides are based in part on slides from previous machine learning classes taught by Andrew Moore at CMU and Tommi Jaakkola at MIT. I thank Andrew and Tommi for letting me use their slides.

10701 / 15781 Machine Learning.

Similar presentations

Presentation on theme: "10701 / 15781 Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

10701 / 15781 Machine Learning.

Similar presentations

Presentation on theme: "10701 / 15781 Machine Learning."— Presentation transcript:

Similar presentations

About project

Feedback