Download presentation
Presentation is loading. Please wait.
1
Ying shen Sse, tongji university Sep. 2016
Linear Model Ying shen Sse, tongji university Sep. 2016
2
The basic form of the linear model
Given a sample ๐= ๐ฅ 1 , ๐ฅ 2 , โฆ, ๐ฅ ๐ ๐ with d attributes The linear model tries to a learn a prediction function using a linear combination of all attributes, i.e. ๐ ๐ = ๐ค 1 ๐ฅ 1 + ๐ค 2 ๐ฅ 2 +โฆ+ ๐ค ๐ ๐ฅ ๐ +๐ The vector form of the function is ๐ ๐ = ๐ ๐ ๐+๐ where ๐= ๐ค 1 , ๐ค 2 ,โฆ, ๐ค ๐ ๐ Once w and d have been learned from samples, f will be determined. For example ๐ ๅฅฝ็ =0.2โ ๐ฅ ่ฒๆณฝ +0.5โ ๐ฅ ๆ น่ +0.3โ ๐ฅ ๆฒๅฃฐ +1 11/29/2018 Pattern recognition
3
Linear regression Given a dataset ๐ท= ๐ 1 , ๐ฆ 1 , ๐ 2 , ๐ฆ 2 ,โฆ, ๐ ๐ , ๐ฆ ๐ ; ๐ ๐ = ๐ฅ ๐1 , ๐ฅ ๐2 ,โฆ, ๐ฅ ๐๐ ๐ , the task of a linear regression is to learn a linear model which can predict a value for a new sample x' that close to its true value y'. When ๐=1, ๐ ๐ = ๐ฅ ๐ Hours Spent Studying 4 9 10 14 7 12 22 1 3 8 Math SAT Score 390 580 650 730 410 530 600 790 350 400 590 11/29/2018 Pattern recognition
4
๐ ๐ฅ ๐ =๐ค ๐ฅ ๐ +๐, ๐ ๐ข๐โ ๐กโ๐๐ก ๐( ๐ฅ ๐ )โ
๐ฆ ๐
Linear regression We will learn a linear regression model ๐ ๐ฅ ๐ =๐ค ๐ฅ ๐ +๐, ๐ ๐ข๐โ ๐กโ๐๐ก ๐( ๐ฅ ๐ )โ
๐ฆ ๐ How do we determine w and b? 11/29/2018 Pattern recognition
5
Linear regression Mean squared error (MSE) is a commonly used performance measure: We want to minimize MSE between f(xi) and yi: ๐๐๐ธ= 1 ๐ ๐=1 ๐ ๐ฆ ๐ โฒ โ ๐ฆ ๐ 2 ๐ค โ , ๐ โ = arg min (๐ค,๐) ๐=1 ๐ ๐ ๐ฅ ๐ โ ๐ฆ ๐ 2 = arg min (๐ค,๐) ๐=1 ๐ ๐ฆ ๐ โ๐ค ๐ฅ ๐ โ๐ 2 11/29/2018 Pattern recognition
6
Linear regression The method of determining the fitting model based on MSE is called the least square method In linear regression problem, the least square method aims to find a line such that the sum of distances of all the samples to it is the smallest. 11/29/2018 Pattern recognition
7
Pre-requisite A stationary point of a differentiable function of one variable is a point of the domain of the function where the derivative is zero Single-variable function: f(x) is differentiable in (a, b). At x0, Two-variables function: f(x, y) is differentiable in its domain. At (x0, y0), ๐๐ ๐๐ฅ โ ๐ฅ 0 =0 ๐๐ ๐๐ฅ โ ๐ฅ 0 , ๐ฆ 0 =0, ๐๐ ๐๐ฆ โ ๐ฅ 0 , ๐ฆ 0 =0 11/29/2018 Pattern recognition
8
๐๐ ๐ ๐ฅ 1 โ ๐ 0 =0, ๐๐ ๐ ๐ฅ 2 โ ๐ 0 =0,โฆ, ๐๐ ๐ ๐ฅ ๐ โ ๐ 0 =0,
Pre-requisite In general case, if x0 is a stationary point of f(x), ๐โ โ ๐ร1 Proposition: Let f be a differentiable function of n variables defined on the convex set S, and let x0 be in the interior of S. If f is convex then x0 is a global minimizer of f in S if and only if it is a stationary point of f (i.e. ๐๐ ๐ ๐ฅ ๐ โ ๐ 0 =0 for i = 1, ..., n). ๐๐ ๐ ๐ฅ 1 โ ๐ 0 =0, ๐๐ ๐ ๐ฅ 2 โ ๐ 0 =0,โฆ, ๐๐ ๐ ๐ฅ ๐ โ ๐ 0 =0, 11/29/2018 Pattern recognition
9
๐๐ธ ๐๐ค =2(๐ค ๐=1 ๐ ๐ฅ ๐ 2 โ ๐=1 ๐ ๐ฆ ๐ โ๐ ๐ฅ ๐ ),
Parameter estimation Function ๐ธ ๐ค,๐ = ๐=1 ๐ ๐ฆ ๐ โ๐ค ๐ฅ ๐ โ๐ 2 is a convex function The extremum can be achieved at the stationary point, i.e. ๐๐ธ ๐๐ค =2(๐ค ๐=1 ๐ ๐ฅ ๐ 2 โ ๐=1 ๐ ๐ฆ ๐ โ๐ ๐ฅ ๐ ), ๐๐ธ ๐๐ =2(๐๐โ ๐=1 ๐ ๐ฆ ๐ โ๐ค ๐ฅ ๐ ) ๐๐ธ ๐๐ค =0, ๐๐ธ ๐๐ =0 11/29/2018 Pattern recognition
10
Parameter estimation Solve the equations and we can have closed-form expression of w and b Where ๐ฅ = 1 ๐ ๐=1 ๐ ๐ฅ ๐ , ๐ฆ = 1 ๐ ๐=1 ๐ ๐ฆ ๐ is the mean of x and y ๐ค= ๐=1 ๐ ๐ฆ ๐ ( ๐ฅ ๐ โ ๐ฅ ) ๐=1 ๐ ๐ฅ ๐ 2 โ 1 ๐ ๐=1 ๐ ๐ฅ ๐ , ๐= 1 ๐ ๐=1 ๐ ( ๐ฆ ๐ โ๐ค ๐ฅ ๐ ) = ๐ฆ โ๐ค ๐ฅ 11/29/2018 Pattern recognition
11
Multivariate linear regression
In a general case, given a dataset D with ๐โฅ1, we try to learn a model: ๐ ๐ ๐ = ๐ ๐ ๐ ๐ +๐, ๐ ๐ข๐โ ๐กโ๐๐ก ๐( ๐ ๐ )โ
๐ฆ ๐ We can also use the least square method to estimate w and b Firstly, denote ๐ = ๐,๐ ๐ and ๐= ๐ฅ 11 ๐ฅ 21 โฎ ๐ฅ ๐1 ๐ฅ 12 ๐ฅ 22 โฎ ๐ฅ ๐2 โฏ โฏ โฑ โฏ ๐ฅ 1๐ ๐ฅ 2๐ โฎ ๐ฅ ๐๐ โฎ 1 = ๐ 1 ๐ 1 ๐ 2 ๐ 1 โฎ ๐ ๐ ๐ โฎ 1 ; ๐= ๐ฆ 1 ๐ฆ 2 โฎ ๐ฆ ๐ ๐ ๐ = ๐ 1 ๐ ๐+๐, ๐ 2 ๐ ๐+๐,โฆ, ๐ ๐ ๐ ๐+๐ ๐ 11/29/2018 Pattern recognition
12
๐๐ ๐๐ก = ๐ ๐ 1 (๐ก) ๐๐ก , ๐ ๐ 2 (๐ก) ๐๐ก ,โฆ, ๐ ๐ ๐ (๐ก) ๐๐ก ๐
Pre-requisite Matrix differentiation Function is a vector and the variable is a scalar ๐ ๐ก = ๐ 1 ๐ก , ๐ 2 ๐ก ,โฆ, ๐ ๐ ๐ก ๐ Definition ๐๐ ๐๐ก = ๐ ๐ 1 (๐ก) ๐๐ก , ๐ ๐ 2 (๐ก) ๐๐ก ,โฆ, ๐ ๐ ๐ (๐ก) ๐๐ก ๐ 11/29/2018 Pattern recognition
13
Pre-requisite Matrix differentiation
Function is a matrix and the variable is a scalar Definition 11/29/2018 Pattern recognition
14
Pre-requisite Matrix differentiation
Function is a scalar and the variable is a vector Definition In a similar way 11/29/2018 Pattern recognition
15
Pre-requisite Matrix differentiation
Function is a vector and the variable is a vector Definition 11/29/2018 Pattern recognition
16
Pre-requisite Matrix differentiation
Function is a vector and the variable is a vector In a similar way 11/29/2018 Pattern recognition
17
Pre-requisite Matrix differentiation
Function is a vector and the variable is a vector Example ๐= ๐ฆ 1 (๐) ๐ฆ 2 (๐) , ๐= ๐ฅ 1 ๐ฅ 2 ๐ฅ 3 , ๐ฆ 1 ๐ = ๐ฅ 1 2 โ ๐ฅ 2 , ๐ฆ 2 ๐ = ๐ฅ ๐ฅ 2 ๐ ๐ ๐ ๐๐ = ๐ ๐ฆ 1 (๐) ๐ ๐ฅ 1 ๐ ๐ฆ 2 (๐) ๐ ๐ฅ 1 ๐ ๐ฆ 1 (๐) ๐ ๐ฅ 2 ๐ ๐ฆ 2 (๐) ๐ ๐ฅ 2 ๐ ๐ฆ 1 (๐) ๐ ๐ฅ 3 ๐ ๐ฆ 2 (๐) ๐ ๐ฅ 3 = 2 ๐ฅ 1 0 โ ๐ฅ 3 11/29/2018 Pattern recognition
18
Pre-requisite Useful results ๐,๐โ โ ๐ร1 Then
๐ ๐ ๐ ๐ ๐๐ =๐, ๐ ๐ ๐ ๐ ๐๐ =๐ 11/29/2018 Pattern recognition
19
Pre-requisite Useful results ๐ดโ โ ๐ร๐ ,๐โ โ ๐ร1 , then ๐๐ด๐ ๐ ๐ ๐ =๐ด
11/29/2018 Pattern recognition
20
Multivariate linear regression
Similarly, ๐ โ = arg min ๐ ๐โ๐ ๐ ๐ ๐โ๐ฟ ๐ ๐ธ ๐ ๐ ๐ธ ๐ ๐ ๐ =2 ๐ ๐ ๐ ๐ โ๐ =2 ๐ ๐ ๐ ๐ โ2 ๐ ๐ ๐ =0 ๐ ๐ ๐ ๐ = ๐ ๐ ๐ 11/29/2018 Pattern recognition
21
Multivariate linear regression
Discussion If ๐ ๐ ๐ is a full-rank matrix or a positive definite matrix, Let ๐ ๐ = ๐ฅ ๐ ;1 , then If ๐ ๐ ๐ is not a full-rank matrix โ Inductive bias ๐ ๐ ๐ ๐ = ๐ ๐ ๐ ๐ โ = ๐ ๐ ๐ โ1 ๐ ๐ ๐ ๐ ๐ ๐ = ๐ ๐ ๐ ๐ ๐ ๐ โ1 ๐ ๐ ๐ 11/29/2018 Pattern recognition
22
Generalized linear model
๐ ๐ = ๐ ๐ ๐+๐ โ ๐ฆ= ๐ ๐ ๐+๐ Log-linear regression ln ๐ฆ = ๐ ๐ ๐+๐ More generally, if a monotone function ๐ โ is differentiable, let ๐ฆ= ๐ โ1 ( ๐ ๐ ๐+๐) y is a generalized linear model, and ๐ โ is a link function 11/29/2018 Pattern recognition
23
Logistic regression How do we perform classification task using linear model? Firstly, letโs consider a binary classification task with labels from {0, 1} ๐ง= ๐ ๐ ๐+๐โโโ {0, 1} Unit-step function ๐ฆ= 0, ๐ง<0; 0.5, ๐ง=0; 1, ๐ง>0; 11/29/2018 Pattern recognition
24
Logistic regression But unit-step function is not continuous โ cannot be used as ๐ โ1 (โ
) So we have to find a โsurrogate functionโ Logistic function ๐ฆ= 1 1+ ๐ โ๐ง = 1 1+ ๐ โ ๐ ๐ ๐+๐ 11/29/2018 Pattern recognition
25
Logistic regression ๐ฆ= 1 1+ ๐ โ ๐ ๐ ๐+๐ ln ๐ฆ 1โ๐ฆ = ๐ ๐ ๐+๐
y: the possibility of x being a positive sample 1-y: the possibility of x being a negative sample ๐ฆ 1โ๐ฆ (odds): the relative possibility of x being a positive sample ln ๐ฆ 1โ๐ฆ : log odds/logit Advantages of logistic regression ๐ฆ= 1 1+ ๐ โ ๐ ๐ ๐+๐ ln ๐ฆ 1โ๐ฆ = ๐ ๐ ๐+๐ 11/29/2018 Pattern recognition
26
Logistic regression Task: Determine w and b in Solution:
1. y โ p ( y = 1 | x ) 2. Estimate w and b using maximum likelihood method ๐ฆ= 1 1+ ๐ โ ๐ ๐ ๐+๐ ,ln ๐ฆ 1โ๐ฆ = ๐ ๐ ๐+๐ ln ๐ฆ 1โ๐ฆ = ๐ ๐ ๐+๐ ln ๐(๐ฆ=1|๐) ๐(๐ฆ=0|๐) = ๐ ๐ ๐+๐ ๐ ๐ฆ=1 ๐ = ๐ 1 (๐)= ๐ ๐ ๐ ๐+๐ 1+ ๐ ๐ ๐ ๐+๐ ๐ ๐ฆ=0 ๐ = ๐ 0 (๐)= 1 1+ ๐ ๐ ๐ ๐+๐ ๐ 0 =1โ ๐ 1 11/29/2018 Pattern recognition
27
Pre-requisite: maximum likelihood estimation
MLE is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. Let Dc be the set containing all the samples from class c. Suppose these samples are independent and identically distributed (i.i.d), the likelihood of all the samples belonging to Dc given a parameter ๐ฝ ๐ is: ๐ ๐ท ๐ถ ๐ฝ ๐ = ๐โ ๐ท ๐ ๐(๐| ๐ฝ ๐ ) 11/29/2018 Pattern recognition
28
Pre-requisite: maximum likelihood estimation
We want to maximize ๐( ๐ท ๐ถ | ๐ฝ ๐ ) Log-likelihood (๐ฟ๐ฟ( ๐ฝ ๐ )) often used instead of using ๐( ๐ท ๐ถ | ๐ฝ ๐ ) The maximum likelihood estimation of ๐ฝ ๐ is ๐ฟ๐ฟ ๐ฝ ๐ = log ๐ ๐ท ๐ถ ๐ฝ ๐ = ๐โ ๐ท ๐ log ๐(๐| ๐ฝ ๐ ) ๐ฝ ๐ = arg max ๐ฝ ๐ฟ๐ฟ( ๐ฝ ๐ ) 11/29/2018 Pattern recognition
29
Pre-requisite: maximum likelihood estimation
Example: If the probability density function ๐ ๐ ๐ ~๐ฉ( ๐ ๐ , ๐ ๐ 2 ), the MLE of ๐ ๐ and ๐ ๐ 2 is ๐ ๐ = 1 ๐ท ๐ ๐โ ๐ท ๐ ๐ ๐ ๐ 2 = 1 ๐ท ๐ ๐โ ๐ท ๐ ๐โ ๐ ๐ ๐โ ๐ ๐ ๐ 11/29/2018 Pattern recognition
30
Logistic regression 2. Estimate w and b using maximum likelihood method Given a training set ๐ ๐ , ๐ฆ ๐ ๐=1 ๐ , the likelihood is then Let ๐ท= ๐;๐ , ๐ =(๐;1), then ๐ ๐ ๐+๐= ๐ท ๐ ๐ โ ๐,๐ = ln ๐=1 ๐ ๐ 1 ( ๐ ๐ |๐,๐) ๐ฆ ๐ ๐ 0 ( ๐ ๐ |๐,๐) 1โ๐ฆ ๐ = ๐=1 ๐ ๐ฆ ๐ ln ๐ 1 ( ๐ ๐ |๐,๐) +(1โ ๐ฆ ๐ ) ln ๐ 0 ( ๐ ๐ |๐,๐) ๐ 1 ๐ ๐ = ๐ ๐ท ๐ ๐ 1+ ๐ ๐ท ๐ ๐ , ๐ 0 ๐ ๐ = 1 1+ ๐ ๐ท ๐ ๐ 11/29/2018 Pattern recognition
31
โ ๐,๐ =โ ๐ท = ๐ ๐ ๐ฆ ๐ ๐ท ๐ ๐ ๐ โlnโก(1+ ๐ ๐ท ๐ ๐ ๐ ))
Logistic regression Then โ ๐ท is a continuous convex function and has higher-order derivatives. So the target function of optimization can be written as The minimum value of the target function can be calculated using gradient descent method or Newtonโs method โ ๐,๐ =โ ๐ท = ๐ ๐ ๐ฆ ๐ ๐ท ๐ ๐ ๐ โlnโก(1+ ๐ ๐ท ๐ ๐ ๐ )) ๐ท โ = arg min ๐ท โโ ๐ท 11/29/2018 Pattern recognition
32
Pre-requisite: Newtonโs method
Newtonโs method was first published in 1685 in A Treatise of Algebra both Historical and Practical by John Wallis. In 1690, Joseph Raphson published a simplified description in Analysis aequationum universalis. 11/29/2018 Pattern recognition
33
Pre-requisite: Newtonโs method
Consider min ๐(๐ฅ) Taylor series expansion around ๐ฅ ๐ ๐ฅ โ๐ ๐ฅ =๐ ๐ฅ +๐ป๐( ๐ฅ ) ๐ฅโ ๐ฅ ๐ฅโ ๐ฅ โฒ ๐ป 2 ๐ ๐ฅ (๐ฅโ ๐ฅ ) Instead of min ๐(๐ฅ) , solve min ๐(๐ฅ) , i.e ๐ป๐ ๐ฅ =0 ๐ป๐ ๐ฅ โฒ ๐ฅโ ๐ฅ + ๐ป 2 ๐ ๐ฅ ๐ฅโ ๐ฅ =0 โ๐ฅโ ๐ฅ =โ ๐ป 2 ๐ ๐ฅ โ1 ๐ป๐( ๐ฅ ) The direction ๐=โ ๐ป 2 ๐ ๐ฅ โ1 ๐ป๐ ๐ฅ is the Newton direction 11/29/2018 Pattern recognition
34
Pre-requisite: Newtonโs method
The algorithm Step 0 Given x0, set k := 0. Step ๐ ๐ =โ ๐ป 2 ๐ ๐ฅ ๐ โ1 ๐ป๐ ๐ฅ ๐ , if ๐ ๐ โค๐, then stop Step 2 Set ๐ฅ ๐+1 โ ๐ฅ ๐ + ๐ ๐ , ๐โ๐+1, go to step 1 11/29/2018 Pattern recognition
35
โ ๐,๐ =โ ๐ท = ๐ ๐ ๐ฆ ๐ ๐ท ๐ ๐ ๐ +lnโก(1+ ๐ ๐ท ๐ ๐ ๐ ))
Logistic regression Then โ ๐ท is a continuous convex function and has higher-order derivatives. So the target function of optimization can be written as The minimum value of the target function can be calculated using gradient descent method or Newtonโs method Using Newtonโs method, in t+1 iteration, โ ๐,๐ =โ ๐ท = ๐ ๐ ๐ฆ ๐ ๐ท ๐ ๐ ๐ +lnโก(1+ ๐ ๐ท ๐ ๐ ๐ )) ๐ท โ = arg min ๐ท โ ๐ท ๐ท ๐ก+1 = ๐ท ๐ โ ๐ 2 โ ๐ท ๐๐ท๐ ๐ท ๐ โ1 ๐โ ๐ท ๐๐ท 11/29/2018 Pattern recognition
36
Logistic regression The first and second derivatives of are given in the book Assignment 1: Implemented logistic regression model using matlab (R, Python, or any language you are familiar) You can use any dataset in UCI repository to validate your model Plot a figure like this โ 11/29/2018 Pattern recognition
37
Pre-requisite: Lagrange multiplier
Lagrange multiplier is a strategy for finding the local extremum of a function subject to equality constraints Problem: max ๐(๐ฑ) ๐๐ min โ๐(๐ฑ) ,๐ฑโ โ ๐ร1 ๐ .๐ก ๐ ๐ ๐ฑ =0, ๐=1,โฆ๐ 11/29/2018 Pattern recognition
38
Pre-requisite: Lagrange multiplier
Solution: If ๐ฑ 0 , ๐ 10 , ๐ 20 ,โฆ, ๐ ๐0 is a stationary point of F, then ๐ฑ 0 is a stationary point of f(x) with constraints ๐ฑ 0 , ๐ 10 , ๐ 20 ,โฆ, ๐ ๐0 is a stationary point of F ๐น ๐ฑ; ๐ 1 ,โฆ, ๐ ๐ =๐ ๐ฑ + ๐ ๐ ๐ ๐ ๐ ๐ (๐ฑ) ๐๐น ๐ ๐ฅ 1 =0, ๐๐น ๐ ๐ฅ 2 =0,โฆ, ๐๐น ๐ ๐ฅ ๐ =0, ๐๐น ๐ ๐ 1 =0,โฆ, ๐๐น ๐ ๐ ๐ =0 n+m equations! 11/29/2018 Pattern recognition
39
Pre-requisite: Lagrange multiplier
Example: Problem: for a given point ๐ 0 =(1,0), among all the points lying on the line ๐ฆ=๐ฅ, identify the one having the least distance to ๐ 0 . ๐ฆ=๐ฅ ๐ 0 ? The distance is ๐ ๐ฅ,๐ฆ = ๐ฅโ ๐ฆโ0 2 Now we want to find the stationary point of ๐ ๐ฅ,๐ฆ under the constraint ๐ ๐ฅ,๐ฆ =๐ฆโ๐ฅ=0 According to Lagrange multiplier method, construct another function ๐น ๐ฅ,๐ฆ,๐ =๐ ๐ฅ,๐ฆ +๐๐ ๐ฅ,๐ฆ = ๐ฅโ ๐ฆโ0 2 +๐(๐ฆโ๐ฅ) 11/29/2018 Pattern recognition
40
Pre-requisite: Lagrange multiplier
Example: Problem: for a given point ๐ 0 =(1,0), among all the points lying on the line ๐ฆ=๐ฅ, identify the one having the least distance to ๐ 0 . ๐ฆ=๐ฅ ๐ 0 ? Find the stationary point for ๐น ๐ฅ,๐ฆ,๐ ๐๐น ๐๐ฅ =0 ๐๐น ๐๐ฆ =0 ๐๐น ๐๐ =0 2 ๐ฅโ1 +๐=0 2๐ฆโ๐=0 ๐ฅโ๐ฆ=0 ๐ฅ=0.5 ๐ฆ=0.5 ๐=1 11/29/2018 Pattern recognition
41
Linear discriminant analysis
In a two-class classification problem, given n samples in a d- dimensional feature space. There are n1 samples belong to class 1 and n2 samples belong to class 2. Goal: to find a vector w, and project the n samples on the axis y = wTx, so that the projected samples are well separated. ๐ฆ= ๐ 1 ๐ ๐ ๐ฆ= ๐ 2 ๐ ๐ 11/29/2018 Pattern recognition
42
Linear discriminant analysis
Given a dataset ๐ท= ๐ ๐ , ๐ฆ ๐ ๐=1 ๐ , ๐ฆ ๐ โ 0,1 , denote ๐ ๐ , ๐ ๐ , ๐ฎ ๐ the samples, the sample mean vector, and the covariance matrix of class i, respectively. The sample mean of the projected points in the i-th class is: The variance of the projected points in the i-th class is ๐ ๐ = 1 ๐ ๐ ๐ ๐ ๐ ๐ = ๐ ๐ ๐ ๐ ๐ฎ ๐ = ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ โ ๐ ๐ ๐ ๐ ๐ = ๐ ๐ ๐ฎ ๐ ๐ 11/29/2018 Pattern recognition
43
Linear discriminant analysis
The between-class scatter matrix is: ๐บ ๐ = ๐ 0 โ ๐ ๐ 0 โ ๐ 1 ๐ The within-class scatter matrix is: ๐บ ๐ค = ๐บ 0 + ๐บ 1 The fisher linear discriminant analysis will choose the w, which maximize: i.e. the between-class distance should be as large as possible, meanwhile the within-class scatter should be as small as possible. ๐ฝ ๐ = ๐ ๐ ๐ 0 โ ๐ ๐ ๐ ๐บ ๐บ 1 = ๐ ๐ ๐ 0 โ ๐ ๐ 0 โ ๐ 1 ๐ป ๐ ๐ ๐ ๐บ 0 ๐+ ๐ ๐ ๐บ 1 ๐ = ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ ๐ค ๐ 11/29/2018 Pattern recognition
44
Linear discriminant analysis
Without loss of generality, let ๐ ๐ ๐ ๐ค ๐=1 The optimization problem can be rewritten as: min ๐ โ ๐ ๐ ๐บ ๐ ๐ ๐ .๐ก. ๐ ๐ ๐บ ๐ค ๐=1. Using Lagrange multiplier method, we need to find the minimum of the following function: ๐น ๐ =โ ๐ ๐ ๐บ ๐ ๐+๐ ๐ ๐ ๐บ ๐ค ๐โ๐ F(w) is a convex function ๐๐น ๐๐ =0 ๐๐น ๐๐ =0 ๐บ ๐ ๐=๐ ๐บ ๐ค ๐ 11/29/2018 Pattern recognition
45
Linear discriminant analysis
Let ๐บ ๐ ๐=๐( ๐ 0 โ ๐ 1 ), then ๐บ ๐ ๐=๐ ๐บ ๐ค ๐ โ ๐บ ๐ค โ1 ๐บ ๐ ๐=๐๐ โ ๐บ ๐ค โ1 ๐ 0 โ ๐ ๐ 0 โ ๐ 1 ๐ ๐= ๐บ ๐ค โ1 ๐ 0 โ ๐ 1 ๐ 2 =๐๐ โ๐= ๐บ ๐ค โ1 ( ๐ 0 โ ๐ 1 ) In practice, we compute the singular value decomposition of ๐บ ๐ค , i.e. ๐บ ๐ค =๐๐บ ๐ ๐ . Then ๐บ ๐ค โ1 =๐ ๐บ โ1 ๐ ๐ 11/29/2018 Pattern recognition
46
Multiclass classification
Binary classification โ multiclass classification One vs. One (OvO) One vs. Rest (OvR) C1 C2 C3 C4 โ ๐ 1 ๐ 2 ๐ 3 ๐ 4 ๐ 5 ๐ 6 โ+โ โ-โ โถ ๐ถ 1 ๐ถ 3 ๐ถ 2 OvO OvR N(N-1)/2 classifiers 11/29/2018 Pattern recognition
47
Multiclass classification
Binary classification โ multiclass classification Many vs. Many (MvM) Error correcting output codes Encode Decode Hamming distance Euclidian distance ๐ ๐ ๐ ๐ ๐ 5 ๐ถ 1 ๐ถ 2 ๐ถ 3 ๐ถ 4 -1 +1 test sample -1 +1 11/29/2018 Pattern recognition
48
Class-imbalance In previous problems, we assume that the numbers of samples from different classes are about the same. However, if the proportions of samples from different classes vary greatly, the learning process will be influenced. E.g. 998 negatives vs. 2 positives Consider class-imbalance when using logistic regression to perform a classification task When # of positives and negatives are the same: ๐ฆ>0.5 ๐๐ ๐ฆ 1โ๐ฆ >1: positive ๐ฆ<0.5 ๐๐ ๐ฆ 1โ๐ฆ <1: negative ln ๐ฆ 1โ๐ฆ = ๐ ๐ ๐+๐ 11/29/2018 Pattern recognition
49
Class-imbalance However, when # of positives and negatives are not equal, let ๐ + be the number of positives and ๐ โ be the number of negatives The odds of observing a positive is ๐ + ๐ โ Therefore, the classification criteria is ๐ฆ 1โ๐ฆ > ๐ + ๐ โ : positive ๐ฆ 1โ๐ฆ < ๐ + ๐ โ : negative Rescaling ๐ฆโฒ 1โ๐ฆโฒ = ๐ฆ 1โ๐ฆ ร ๐ โ ๐ + 11/29/2018 Pattern recognition
50
Class-imbalance Undersampling Oversampling Threshold-moving 11/29/2018
Pattern recognition
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.