Statistical Learning Dong Liu Dept. EEIS, USTC.

Statistical Learning Dong Liu Dept. EEIS, USTC

Chapter 2. Linear Classification
The ABC of classification Logistic regression formulation The exponential family and maximum entropy Logistic regression solution Fisher’s linear discriminant analysis Perceptron Multi-class / Multi-label 2018/11/15 Chap 2. Linear Classification

Classification in practice
Age 26 22 19 … Marital status M S Children number 1 Salary 25000 18000 N/A Loan amount 100000 10000 1000 Profession Teacher Student Defaulted? N Y Multi-class classification Binary classification Multi-label classification 2018/11/15 Chap 2. Linear Classification

Classification versus Regression
Both characterize the correlation between two variables When the dependent one is treated continuous → regression When the dependent one is treated discrete → classification A must step in classification is to ensure discrete output (i.e. quantization) 2018/11/15 Chap 2. Linear Classification

Chap 2. Linear Classification
Example Binary classification on a 2-D plane Independent variable: Dependent variable: We may use linear regression And then produce a classification function We have a decision boundary And this is linear (binary) classification 2018/11/15 Chap 2. Linear Classification

Failure of linear regression for classification
Purple line: linear regression Green line: logistic regression 2018/11/15 Chap 2. Linear Classification

From regression to classification
Why is regression method not suitable for classification? Training – Using mismatch But it is difficult to involve quantization into regression Consider to solve 2018/11/15 Chap 2. Linear Classification

Logistic regression Use the sigmoid function to replace the sign function, giving rise to an easier problem Remap the class variables Use the cross-entropy instead of SSE 2018/11/15 Chap 2. Linear Classification

Interpretation of cross-entropy
Logistic regression is not to regress the target class, but rather to regress the probability The predicted probability is And the likelihood function is Which can be rewritten as Maximizing log-likelihood 2018/11/15 Chap 2. Linear Classification

More about cross-entropy
For the datum The “ground-truth” probability distribution is The “predicted” probability distribution is How to measure the difference between two distributions? Cross-entropy Entropy Kullback-Leibler divergence 2018/11/15 Chap 2. Linear Classification

Interpretation of logistic regression
Consider the predicted probability If assuming Gaussian distribution: Then 2018/11/15 Chap 2. Linear Classification

Illustration of logistic regression
Two Gaussian distributions corresponding to two classes 2018/11/15 Chap 2. Linear Classification

Generalized logistic regression
Using basis functions The predicted probability is Then solve Similar to the case of logistic regression, if we assume Then A distribution that belongs to the exponential family 2018/11/15 Chap 2. Linear Classification

Exponential family Probability distributions that can be written as For example, Gaussian distribution can be written as So it belongs to the exponential family with Termed “natural parameter” Termed “sufficient statistic” 2018/11/15 Chap 2. Linear Classification

Table of exponential family
Distribution Natural parameter Sufficient statistic Bernoulli distribution Poisson distribution Exponential distribution Laplace distribution … 2018/11/15 Chap 2. Linear Classification

Properties of exponential family
Sufficient statistic: such statistic is sufficient to estimate the parameter of distribution Cumulant function It can be proved that Such distribution has the maximum entropy 2018/11/15 Chap 2. Linear Classification

Maximum entropy Consider a probability function for discrete random variable Assume we have observed some information as We want to estimate the probability function that has the maximum entropy (uncertainty) Gibbs distribution Belongs to exponential family 2018/11/15 Chap 2. Linear Classification

Maximum differential entropy
For continuous random variable, we cannot define the normal entropy (why?), so we define differential entropy Maximizing differential entropy also gives out exponential family distributions For example, if constrained by We have Gaussian distribution Details c.f. PRML 2018/11/15 Chap 2. Linear Classification

Logistic regression: closed-form solution?
Generalized logistic regression If we directly calculate the gradient, we have So we have no closed-form solution 2018/11/15 Chap 2. Linear Classification

Numerical algorithms for optimization problems
Many optimization problems have no closed-form solution, we have to retreat to numerical algorithms More tractable problem, more efficient algorithm Second-order: Newton-Raphson First-order: gradient descent, Frank-Wolfe 2018/11/15 Chap 2. Linear Classification

Newton-Raphson method
Taylor’s expansion If we want to find a local minimum/maximum: Generalized to vector 2018/11/15 Chap 2. Linear Classification

Newton method for logistic regression 1/2
First-order: Second-order: 2018/11/15 Chap 2. Linear Classification

Newton method for logistic regression 2/2
Then the update formula is It is actually the solution of a weighted least squares problem The weight is higher for more “uncertain” data It is termed iterative reweighted least squares (IRLS) problem 2018/11/15 Chap 2. Linear Classification

Gradient descent method
Taylor’s expansion If we want to find a local minimum: Generalized to vector 2018/11/15 Chap 2. Linear Classification

Frank-Wolfe method For constrained optimization Taylor’s expansion We may consider This is linear and easy And search along the line: 2018/11/15 Chap 2. Linear Classification

Bayesian logistic regression 1/2
Using the Bayesian approach, we need to specify a prior, e.g. Gaussian prior Note that the likelihood is So the posterior is 2018/11/15 Chap 2. Linear Classification

Bayesian logistic regression 2/2
Under the Bayesian framework, we can have MAP estimation Can be solved by numeric algorithms like Newton We can have the predictive distribution 2018/11/15 Chap 2. Linear Classification

Laplace approximation 1/2
We may use a Gaussian distribution to approximate another distribution around its mode 2018/11/15 Chap 2. Linear Classification

Laplace approximation 2/2
Now we estimate the predictive distribution Another approximation: Sigmoid ≈ Integral of Gaussian Details c.f. PRML 2018/11/15 Chap 2. Linear Classification

Geometric methods for classification
In logistic regression, we have It can be viewed as a projection followed by a thresholding How to decide the projection? 2018/11/15 Chap 2. Linear Classification

Linear discriminant analysis (LDA)
Fisher proposed to maximize inter-class separation and minimize intra-class variance The solution is 2018/11/15 Chap 2. Linear Classification

Analogy between LDA and Gaussian-Logistic
If Then LDA is So LDA is an approximation of Gaussian-Logistic 2018/11/15 Chap 2. Linear Classification

Optimization methods for classification
The original problem is quite difficult In logistic regression, sign is replaced by sigmoid, squared error is replaced by cross-entropy Many different optimization problems have been formulated 2018/11/15 Chap 2. Linear Classification

Perceptron Rosenblatt proposed to solve Using the standard gradient descent method Using the stochastic/incremental gradient descent method 2018/11/15 Chap 2. Linear Classification

Illustration of perceptron
Reproduced from PRML Black arrow represents Black line is decision boundary Red points are positive, blue are negative Set learning rate If a linear classifier exists for the given dataset, then perceptron can converge 2018/11/15 Chap 2. Linear Classification

Dual form of perceptron
Set initial value , after iterations So we can learn If Update After learning, for a new point to classify, we have In the dual form, we use inner products of data rather than raw data 2018/11/15 Chap 2. Linear Classification

Multi-class by binary classifier
Method 1: one-versus-the-rest Method 2: one-versus-one Both have limitations 2018/11/15 Chap 2. Linear Classification

Multi-class, extended from logistic 1/2
Recall logistic regression 2018/11/15 Chap 2. Linear Classification

Multi-class, extended from logistic 2/2
Extended to multi-class If we define Then it is known as softmax regression Learning: cross-entropy with one-hot vector Using: linear & convex decision boundary 2018/11/15 Chap 2. Linear Classification

Multi-label by binary classifier
Since the classes are not exclusive, it is natural to use yes-versus-no for each class Thus, it is simply concatenation of multiple binary classifiers Using: we may rank the predicted probabilities to decide which label Learning: we may also learn to rank using e.g. pairwise objective function 2018/11/15 Chap 2. Linear Classification

Chapter summary Dictionary Toolbox Binary classification Decision boundary Entropy; cross ~; differential ~ Exponential family Gibbs distribution Kullback-Leibler divergence Multi-class classification Multi-label classification Frank-Wolfe Gradient descent; stochastic ~ Laplace approximation (Fisher’s) Linear discriminant analysis Logistic regression; generalized ~ Newton-Raphson Perceptron Sigmoid Softmax 2018/11/15 Chap 2. Linear Classification

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations

Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Learning Dong Liu Dept. EEIS, USTC.

Similar presentations

Presentation on theme: "Statistical Learning Dong Liu Dept. EEIS, USTC."— Presentation transcript:

Similar presentations

About project

Feedback