Download presentation
Presentation is loading. Please wait.
1
Statistical Learning Dong Liu Dept. EEIS, USTC
2
Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification
3
Classification in practice
Age 26 22 19 … Marital status M S Children number 1 Salary 25000 18000 N/A Loan amount 100000 10000 1000 Profession Teacher Student Defaulted? N Y Multi-class classification Binary classification Multi-label classification 2018/11/15 Chap 2. Linear Classification
4
Classification versus Regression
Both characterize the correlation between two variables When the dependent one is treated continuous → regression When the dependent one is treated discrete → classification A must step in classification is to ensure discrete output (i.e. quantization) 2018/11/15 Chap 2. Linear Classification
5
Chap 2. Linear Classification
Example Binary classification on a 2-D plane Independent variable: Dependent variable: We may use linear regression And then produce a classification function We have a decision boundary And this is linear (binary) classification 2018/11/15 Chap 2. Linear Classification
6
Failure of linear regression for classification
Purple line: linear regression Green line: logistic regression 2018/11/15 Chap 2. Linear Classification
7
From regression to classification
Why is regression method not suitable for classification? Training – Using mismatch But it is difficult to involve quantization into regression Consider to solve 2018/11/15 Chap 2. Linear Classification
8
Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification
9
Chap 2. Linear Classification
Logistic regression Use the sigmoid function to replace the sign function, giving rise to an easier problem Remap the class variables Use the cross-entropy instead of SSE 2018/11/15 Chap 2. Linear Classification
10
Interpretation of cross-entropy
Logistic regression is not to regress the target class, but rather to regress the probability The predicted probability is And the likelihood function is Which can be rewritten as Maximizing log-likelihood 2018/11/15 Chap 2. Linear Classification
11
More about cross-entropy
For the datum The “ground-truth” probability distribution is The “predicted” probability distribution is How to measure the difference between two distributions? Cross-entropy Entropy Kullback-Leibler divergence 2018/11/15 Chap 2. Linear Classification
12
Interpretation of logistic regression
Consider the predicted probability If assuming Gaussian distribution: Then 2018/11/15 Chap 2. Linear Classification
13
Illustration of logistic regression
Two Gaussian distributions corresponding to two classes 2018/11/15 Chap 2. Linear Classification
14
Generalized logistic regression
Using basis functions The predicted probability is Then solve Similar to the case of logistic regression, if we assume Then A distribution that belongs to the exponential family 2018/11/15 Chap 2. Linear Classification
15
Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification
16
Chap 2. Linear Classification
Exponential family Probability distributions that can be written as For example, Gaussian distribution can be written as So it belongs to the exponential family with Termed “natural parameter” Termed “sufficient statistic” 2018/11/15 Chap 2. Linear Classification
17
Table of exponential family
Distribution Natural parameter Sufficient statistic Bernoulli distribution Poisson distribution Exponential distribution Laplace distribution … 2018/11/15 Chap 2. Linear Classification
18
Properties of exponential family
Sufficient statistic: such statistic is sufficient to estimate the parameter of distribution Cumulant function It can be proved that Such distribution has the maximum entropy 2018/11/15 Chap 2. Linear Classification
19
Chap 2. Linear Classification
Maximum entropy Consider a probability function for discrete random variable Assume we have observed some information as We want to estimate the probability function that has the maximum entropy (uncertainty) Gibbs distribution Belongs to exponential family 2018/11/15 Chap 2. Linear Classification
20
Maximum differential entropy
For continuous random variable, we cannot define the normal entropy (why?), so we define differential entropy Maximizing differential entropy also gives out exponential family distributions For example, if constrained by We have Gaussian distribution Details c.f. PRML 2018/11/15 Chap 2. Linear Classification
21
Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification
22
Logistic regression: closed-form solution?
Generalized logistic regression If we directly calculate the gradient, we have So we have no closed-form solution 2018/11/15 Chap 2. Linear Classification
23
Numerical algorithms for optimization problems
Many optimization problems have no closed-form solution, we have to retreat to numerical algorithms More tractable problem, more efficient algorithm Second-order: Newton-Raphson First-order: gradient descent, Frank-Wolfe 2018/11/15 Chap 2. Linear Classification
24
Newton-Raphson method
Taylor’s expansion If we want to find a local minimum/maximum: Generalized to vector 2018/11/15 Chap 2. Linear Classification
25
Newton method for logistic regression 1/2
First-order: Second-order: 2018/11/15 Chap 2. Linear Classification
26
Newton method for logistic regression 2/2
Then the update formula is It is actually the solution of a weighted least squares problem The weight is higher for more “uncertain” data It is termed iterative reweighted least squares (IRLS) problem 2018/11/15 Chap 2. Linear Classification
27
Gradient descent method
Taylor’s expansion If we want to find a local minimum: Generalized to vector 2018/11/15 Chap 2. Linear Classification
28
Chap 2. Linear Classification
Frank-Wolfe method For constrained optimization Taylor’s expansion We may consider This is linear and easy And search along the line: 2018/11/15 Chap 2. Linear Classification
29
Bayesian logistic regression 1/2
Using the Bayesian approach, we need to specify a prior, e.g. Gaussian prior Note that the likelihood is So the posterior is 2018/11/15 Chap 2. Linear Classification
30
Bayesian logistic regression 2/2
Under the Bayesian framework, we can have MAP estimation Can be solved by numeric algorithms like Newton We can have the predictive distribution 2018/11/15 Chap 2. Linear Classification
31
Laplace approximation 1/2
We may use a Gaussian distribution to approximate another distribution around its mode 2018/11/15 Chap 2. Linear Classification
32
Laplace approximation 2/2
Now we estimate the predictive distribution Another approximation: Sigmoid ≈ Integral of Gaussian Details c.f. PRML 2018/11/15 Chap 2. Linear Classification
33
Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification
34
Geometric methods for classification
In logistic regression, we have It can be viewed as a projection followed by a thresholding How to decide the projection? 2018/11/15 Chap 2. Linear Classification
35
Linear discriminant analysis (LDA)
Fisher proposed to maximize inter-class separation and minimize intra-class variance The solution is 2018/11/15 Chap 2. Linear Classification
36
Analogy between LDA and Gaussian-Logistic
If Then LDA is So LDA is an approximation of Gaussian-Logistic 2018/11/15 Chap 2. Linear Classification
37
Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification
38
Optimization methods for classification
The original problem is quite difficult In logistic regression, sign is replaced by sigmoid, squared error is replaced by cross-entropy Many different optimization problems have been formulated 2018/11/15 Chap 2. Linear Classification
39
Chap 2. Linear Classification
Perceptron Rosenblatt proposed to solve Using the standard gradient descent method Using the stochastic/incremental gradient descent method 2018/11/15 Chap 2. Linear Classification
40
Illustration of perceptron
Reproduced from PRML Black arrow represents Black line is decision boundary Red points are positive, blue are negative Set learning rate If a linear classifier exists for the given dataset, then perceptron can converge 2018/11/15 Chap 2. Linear Classification
41
Dual form of perceptron
Set initial value , after iterations So we can learn If Update After learning, for a new point to classify, we have In the dual form, we use inner products of data rather than raw data 2018/11/15 Chap 2. Linear Classification
42
Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification
43
Multi-class by binary classifier
Method 1: one-versus-the-rest Method 2: one-versus-one Both have limitations 2018/11/15 Chap 2. Linear Classification
44
Multi-class, extended from logistic 1/2
Recall logistic regression 2018/11/15 Chap 2. Linear Classification
45
Multi-class, extended from logistic 2/2
Extended to multi-class If we define Then it is known as softmax regression Learning: cross-entropy with one-hot vector Using: linear & convex decision boundary 2018/11/15 Chap 2. Linear Classification
46
Multi-label by binary classifier
Since the classes are not exclusive, it is natural to use yes-versus-no for each class Thus, it is simply concatenation of multiple binary classifiers Using: we may rank the predicted probabilities to decide which label Learning: we may also learn to rank using e.g. pairwise objective function 2018/11/15 Chap 2. Linear Classification
47
Chap 2. Linear Classification
Chapter summary Dictionary Toolbox Binary classification Decision boundary Entropy; cross ~; differential ~ Exponential family Gibbs distribution Kullback-Leibler divergence Multi-class classification Multi-label classification Frank-Wolfe Gradient descent; stochastic ~ Laplace approximation (Fisher’s) Linear discriminant analysis Logistic regression; generalized ~ Newton-Raphson Perceptron Sigmoid Softmax 2018/11/15 Chap 2. Linear Classification
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.