Download presentation
Presentation is loading. Please wait.
1
Logistic Regression Rong Jin
2
Logistic Regression Model In Gaussian generative model: Generalize the ratio to a linear model Parameters: w and c
3
Logistic Regression Model In Gaussian generative model: Generalize the ratio to a linear model Parameters: w and c
4
Logistic Regression Model The log-ratio of positive class to negative class Results
5
Logistic Regression Model The log-ratio of positive class to negative class Results
6
Logistic Regression Model Assume the inputs and outputs are related in the log linear function Estimate weights: MLE approach
7
Example 1: Heart Disease Input feature x: age group id output y: having heart disease or not +1: having heart disease -1: no heart disease 1: 25-29 2: 30-34 3: 35-39 4: 40-44 5: 45-49 6: 50-54 7: 55-59 8: 60-64
8
Example 1: Heart Disease Logistic regression model Learning w and c: MLE approach Numerical optimization: w = 0.58, c = -3.34
9
Example 1: Heart Disease W = 0.58 An old person is more likely to have heart disease C = -3.34 xw+c < 0 p(+|x) < p(-|x) xw+c > 0 p(+|x) > p(-|x) xw+c = 0 decision boundary x* = 5.78 53 year old
10
Naïve Bayes Solution Inaccurate fitting Non Gaussian distribution i* = 5.59 Close to the estimation by logistic regression Even though naïve Bayes does not fit input patterns well, it still works fine for the decision boundary
11
Problems with Using Histogram Data?
12
Uneven Sampling for Different Ages
13
Solution w = 0.63, c = -3.56 i* = 5.65
14
Solution w = 0.63, c = -3.56 i* = 5.65 < 5.78
15
Solution w = 0.63, c = -3.56 i* = 5.65 < 5.78
16
Example: Text Classification Learn to classify text into predefined categories Input x: a document Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …} Output y: if the document is politics or not +1 for political document, -1 for not political document Training data:
17
Example 2: Text Classification Logistic regression model Every term t i is assigned with a weight w i Learning parameters: MLE approach Need numerical solutions
18
Example 2: Text Classification Logistic regression model Every term t i is assigned with a weight w i Learning parameters: MLE approach Need numerical solutions
19
Example 2: Text Classification Weight w i w i > 0: term t i is a positive evidence w i < 0: term t i is a negative evidence w i = 0: term t i is irrelevant to the category of documents The larger the | w i |, the more important t i term is determining whether the document is interesting. Threshold c
20
Example 2: Text Classification Weight w i w i > 0: term t i is a positive evidence w i < 0: term t i is a negative evidence w i = 0: term t i is irrelevant to the category of documents The larger the | w i |, the more important t i term is determining whether the document is interesting. Threshold c
21
Example 2: Text Classification Dataset: Reuter-21578 Classification accuracy Naïve Bayes: 77% Logistic regression: 88%
22
Why Logistic Regression Works better for Text Classification? Optimal linear decision boundary Generative model Weight ~ logp(w|+) - logp(w|-) Sub-optimal weights Independence assumption Naive Bayes assumes that each word is generated independently Logistic regression is able to take into account of the correlation of words
23
Discriminative Model Logistic regression model is a discriminative model Models the conditional probability p(y|x), i.e., the decision boundary Gaussian generative model Models p(x|y), i.e., input patterns of different classes
24
Comparison Generative Model Model P(x|y) Model the input patterns Usually fast converge Cheap computation Robust to noise data But Usually performs worse Discriminative Model Model P(y|x) directly Model the decision boundary Usually good performance But Slow convergence Expensive computation Sensitive to noise data
25
Comparison Generative Model Model P(x|y) Model the input patterns Usually fast converge Cheap computation Robust to noise data But Usually performs worse Discriminative Model Model P(y|x) directly Model the decision boundary Usually good performance But Slow convergence Expensive computation Sensitive to noise data
26
A Few Words about Optimization Convex objective function Solution could be non-unique
27
Problems with Logistic Regression? How about words that only appears in one class?
28
Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight Consider the derivative of l(D train ) with respect to w w will be infinite !
29
Overfitting Problem with Logistic Regression Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight Consider the derivative of l(D train ) with respect to w w will be infinite !
30
Example of Overfitting for LogRes Iteration Decrease in the classification accuracy of test data
31
Solution: Regularization Regularized log-likelihood s||w|| 2 is called the regularizer Favors small weights Prevents weights from becoming too large
32
The Rare Word Problem Consider word t that only appears in one document d, and d is a positive document. Let w be its associated weight
33
The Rare Word Problem Consider the derivative of l(D train ) with respect to w When s is small, the derivative is still positive But, it becomes negative when w is large
34
The Rare Word Problem Consider the derivative of l(D train ) with respect to w When w is small, the derivative is still positive But, it becomes negative when w is large
35
Regularized Logistic Regression Using regularization Without regularization Iteration
36
Interpretation of Regularizer Many interpretation of regularizer Bayesian stat.: model prior Statistical learning: minimize the generalized error Robust optimization: min-max solution
37
Regularizer: Robust Optimization assume each data point is unknown-but-bounded in a sphere of radius s and center x i find the classifier w that is able to classify the unknown-but- bounded data point with high classification confidence
38
Sparse Solution What does the solution of regularized logistic regression look like ? A sparse solution Most weights are small and close to zero
39
Sparse Solution What does the solution of regularized logistic regression look like ? A sparse solution Most weights are small and close to zero
40
Why do We Need Sparse Solution? Two types of solutions 1. Many non-zero weights but many of them are small 2. Only a small number of non-zero weights, and many of them are large Occam’s Razor: the simpler the better A simpler model that fits data unlikely to be coincidence A complicated model that fit data might be coincidence Smaller number of non-zero weights less amount of evidence to consider simpler model case 2 is preferred
41
Occam’s Razer
42
Occam’s Razer: Power = 1
43
Occam’s Razer: Power = 3
44
Occam’s Razor: Power = 10
45
Finding Optimal Solutions Concave objective function No local maximum Many standard optimization algorithms work
46
Gradient Ascent Maximize the log-likelihood by iteratively adjusting the parameters in small increments In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient) Predication ErrorsPreventing weights from being too large
47
Graphical Illustration No regularization case
48
Gradient Ascent Maximize the log-likelihood by iteratively adjusting the parameters in small increments In each iteration, we adjust w in the direction that increases the log-likelihood (toward the gradient)
49
When should Stop? Log-likelihood will monotonically increase during the gradient ascent iterations When should we stop?
50
Using regularization Without regularization Iteration
51
When should Stop? The gradient ascent learning method converges when there is no incentive to move the parameters in any particular direction:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.