Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ying shen Sse, tongji university Sep. 2016

Similar presentations


Presentation on theme: "Ying shen Sse, tongji university Sep. 2016"โ€” Presentation transcript:

1 Ying shen Sse, tongji university Sep. 2016
Linear Model Ying shen Sse, tongji university Sep. 2016

2 The basic form of the linear model
Given a sample ๐’™= ๐‘ฅ 1 , ๐‘ฅ 2 , โ€ฆ, ๐‘ฅ ๐‘‘ ๐‘‡ with d attributes The linear model tries to a learn a prediction function using a linear combination of all attributes, i.e. ๐‘“ ๐’™ = ๐‘ค 1 ๐‘ฅ 1 + ๐‘ค 2 ๐‘ฅ 2 +โ€ฆ+ ๐‘ค ๐‘‘ ๐‘ฅ ๐‘‘ +๐‘ The vector form of the function is ๐‘“ ๐’™ = ๐’˜ ๐‘‡ ๐’™+๐‘ where ๐’˜= ๐‘ค 1 , ๐‘ค 2 ,โ€ฆ, ๐‘ค ๐‘‘ ๐‘‡ Once w and d have been learned from samples, f will be determined. For example ๐‘“ ๅฅฝ็“œ =0.2โˆ— ๐‘ฅ ่‰ฒๆณฝ +0.5โˆ— ๐‘ฅ ๆ น่’‚ +0.3โˆ— ๐‘ฅ ๆ•ฒๅฃฐ +1 11/29/2018 Pattern recognition

3 Linear regression Given a dataset ๐ท= ๐’™ 1 , ๐‘ฆ 1 , ๐’™ 2 , ๐‘ฆ 2 ,โ€ฆ, ๐’™ ๐‘š , ๐‘ฆ ๐‘š ; ๐’™ ๐‘– = ๐‘ฅ ๐‘–1 , ๐‘ฅ ๐‘–2 ,โ€ฆ, ๐‘ฅ ๐‘–๐‘‘ ๐‘‡ , the task of a linear regression is to learn a linear model which can predict a value for a new sample x' that close to its true value y'. When ๐‘‘=1, ๐’™ ๐‘– = ๐‘ฅ ๐‘– Hours Spent Studying 4 9 10 14 7 12 22 1 3 8 Math SAT Score 390 580 650 730 410 530 600 790 350 400 590 11/29/2018 Pattern recognition

4 ๐‘“ ๐‘ฅ ๐‘– =๐‘ค ๐‘ฅ ๐‘– +๐‘, ๐‘ ๐‘ข๐‘โ„Ž ๐‘กโ„Ž๐‘Ž๐‘ก ๐‘“( ๐‘ฅ ๐‘– )โ‰… ๐‘ฆ ๐‘–
Linear regression We will learn a linear regression model ๐‘“ ๐‘ฅ ๐‘– =๐‘ค ๐‘ฅ ๐‘– +๐‘, ๐‘ ๐‘ข๐‘โ„Ž ๐‘กโ„Ž๐‘Ž๐‘ก ๐‘“( ๐‘ฅ ๐‘– )โ‰… ๐‘ฆ ๐‘– How do we determine w and b? 11/29/2018 Pattern recognition

5 Linear regression Mean squared error (MSE) is a commonly used performance measure: We want to minimize MSE between f(xi) and yi: ๐‘€๐‘†๐ธ= 1 ๐‘š ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– โ€ฒ โˆ’ ๐‘ฆ ๐‘– 2 ๐‘ค โˆ— , ๐‘ โˆ— = arg min (๐‘ค,๐‘) ๐‘–=1 ๐‘š ๐‘“ ๐‘ฅ ๐‘– โˆ’ ๐‘ฆ ๐‘– 2 = arg min (๐‘ค,๐‘) ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– โˆ’๐‘ค ๐‘ฅ ๐‘– โˆ’๐‘ 2 11/29/2018 Pattern recognition

6 Linear regression The method of determining the fitting model based on MSE is called the least square method In linear regression problem, the least square method aims to find a line such that the sum of distances of all the samples to it is the smallest. 11/29/2018 Pattern recognition

7 Pre-requisite A stationary point of a differentiable function of one variable is a point of the domain of the function where the derivative is zero Single-variable function: f(x) is differentiable in (a, b). At x0, Two-variables function: f(x, y) is differentiable in its domain. At (x0, y0), ๐‘‘๐‘“ ๐‘‘๐‘ฅ โ€‹ ๐‘ฅ 0 =0 ๐‘‘๐‘“ ๐‘‘๐‘ฅ โ€‹ ๐‘ฅ 0 , ๐‘ฆ 0 =0, ๐‘‘๐‘“ ๐‘‘๐‘ฆ โ€‹ ๐‘ฅ 0 , ๐‘ฆ 0 =0 11/29/2018 Pattern recognition

8 ๐‘‘๐‘“ ๐‘‘ ๐‘ฅ 1 โ€‹ ๐’™ 0 =0, ๐‘‘๐‘“ ๐‘‘ ๐‘ฅ 2 โ€‹ ๐’™ 0 =0,โ€ฆ, ๐‘‘๐‘“ ๐‘‘ ๐‘ฅ ๐‘› โ€‹ ๐’™ 0 =0,
Pre-requisite In general case, if x0 is a stationary point of f(x), ๐’™โˆˆ โ„ ๐‘›ร—1 Proposition: Let f be a differentiable function of n variables defined on the convex set S, and let x0 be in the interior of S. If f is convex then x0 is a global minimizer of f in S if and only if it is a stationary point of f (i.e. ๐‘‘๐‘“ ๐‘‘ ๐‘ฅ ๐‘– โ€‹ ๐’™ 0 =0 for i = 1, ..., n). ๐‘‘๐‘“ ๐‘‘ ๐‘ฅ 1 โ€‹ ๐’™ 0 =0, ๐‘‘๐‘“ ๐‘‘ ๐‘ฅ 2 โ€‹ ๐’™ 0 =0,โ€ฆ, ๐‘‘๐‘“ ๐‘‘ ๐‘ฅ ๐‘› โ€‹ ๐’™ 0 =0, 11/29/2018 Pattern recognition

9 ๐œ•๐ธ ๐œ•๐‘ค =2(๐‘ค ๐‘–=1 ๐‘š ๐‘ฅ ๐‘– 2 โˆ’ ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– โˆ’๐‘ ๐‘ฅ ๐‘– ),
Parameter estimation Function ๐ธ ๐‘ค,๐‘ = ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– โˆ’๐‘ค ๐‘ฅ ๐‘– โˆ’๐‘ 2 is a convex function The extremum can be achieved at the stationary point, i.e. ๐œ•๐ธ ๐œ•๐‘ค =2(๐‘ค ๐‘–=1 ๐‘š ๐‘ฅ ๐‘– 2 โˆ’ ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– โˆ’๐‘ ๐‘ฅ ๐‘– ), ๐œ•๐ธ ๐œ•๐‘ =2(๐‘š๐‘โˆ’ ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– โˆ’๐‘ค ๐‘ฅ ๐‘– ) ๐œ•๐ธ ๐œ•๐‘ค =0, ๐œ•๐ธ ๐œ•๐‘ =0 11/29/2018 Pattern recognition

10 Parameter estimation Solve the equations and we can have closed-form expression of w and b Where ๐‘ฅ = 1 ๐‘š ๐‘–=1 ๐‘š ๐‘ฅ ๐‘– , ๐‘ฆ = 1 ๐‘š ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– is the mean of x and y ๐‘ค= ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– ( ๐‘ฅ ๐‘– โˆ’ ๐‘ฅ ) ๐‘–=1 ๐‘š ๐‘ฅ ๐‘– 2 โˆ’ 1 ๐‘š ๐‘–=1 ๐‘š ๐‘ฅ ๐‘– , ๐‘= 1 ๐‘š ๐‘–=1 ๐‘š ( ๐‘ฆ ๐‘– โˆ’๐‘ค ๐‘ฅ ๐‘– ) = ๐‘ฆ โˆ’๐‘ค ๐‘ฅ 11/29/2018 Pattern recognition

11 Multivariate linear regression
In a general case, given a dataset D with ๐‘‘โ‰ฅ1, we try to learn a model: ๐‘“ ๐’™ ๐‘– = ๐’˜ ๐‘‡ ๐’™ ๐‘– +๐‘, ๐‘ ๐‘ข๐‘โ„Ž ๐‘กโ„Ž๐‘Ž๐‘ก ๐‘“( ๐’™ ๐‘– )โ‰… ๐‘ฆ ๐‘– We can also use the least square method to estimate w and b Firstly, denote ๐’˜ = ๐’˜,๐‘ ๐‘‡ and ๐—= ๐‘ฅ 11 ๐‘ฅ 21 โ‹ฎ ๐‘ฅ ๐‘š1 ๐‘ฅ 12 ๐‘ฅ 22 โ‹ฎ ๐‘ฅ ๐‘š2 โ‹ฏ โ‹ฏ โ‹ฑ โ‹ฏ ๐‘ฅ 1๐‘‘ ๐‘ฅ 2๐‘‘ โ‹ฎ ๐‘ฅ ๐‘š๐‘‘ โ‹ฎ 1 = ๐’™ 1 ๐‘‡ 1 ๐’™ 2 ๐‘‡ 1 โ‹ฎ ๐’™ ๐‘š ๐‘‡ โ‹ฎ 1 ; ๐’š= ๐‘ฆ 1 ๐‘ฆ 2 โ‹ฎ ๐‘ฆ ๐‘š ๐— ๐’˜ = ๐’™ 1 ๐‘‡ ๐’˜+๐‘, ๐’™ 2 ๐‘‡ ๐’˜+๐‘,โ€ฆ, ๐’™ ๐‘š ๐‘‡ ๐’˜+๐‘ ๐‘‡ 11/29/2018 Pattern recognition

12 ๐‘‘๐‘“ ๐‘‘๐‘ก = ๐‘‘ ๐‘“ 1 (๐‘ก) ๐‘‘๐‘ก , ๐‘‘ ๐‘“ 2 (๐‘ก) ๐‘‘๐‘ก ,โ€ฆ, ๐‘‘ ๐‘“ ๐‘› (๐‘ก) ๐‘‘๐‘ก ๐‘‡
Pre-requisite Matrix differentiation Function is a vector and the variable is a scalar ๐‘“ ๐‘ก = ๐‘“ 1 ๐‘ก , ๐‘“ 2 ๐‘ก ,โ€ฆ, ๐‘“ ๐‘› ๐‘ก ๐‘‡ Definition ๐‘‘๐‘“ ๐‘‘๐‘ก = ๐‘‘ ๐‘“ 1 (๐‘ก) ๐‘‘๐‘ก , ๐‘‘ ๐‘“ 2 (๐‘ก) ๐‘‘๐‘ก ,โ€ฆ, ๐‘‘ ๐‘“ ๐‘› (๐‘ก) ๐‘‘๐‘ก ๐‘‡ 11/29/2018 Pattern recognition

13 Pre-requisite Matrix differentiation
Function is a matrix and the variable is a scalar Definition 11/29/2018 Pattern recognition

14 Pre-requisite Matrix differentiation
Function is a scalar and the variable is a vector Definition In a similar way 11/29/2018 Pattern recognition

15 Pre-requisite Matrix differentiation
Function is a vector and the variable is a vector Definition 11/29/2018 Pattern recognition

16 Pre-requisite Matrix differentiation
Function is a vector and the variable is a vector In a similar way 11/29/2018 Pattern recognition

17 Pre-requisite Matrix differentiation
Function is a vector and the variable is a vector Example ๐’š= ๐‘ฆ 1 (๐’™) ๐‘ฆ 2 (๐’™) , ๐’™= ๐‘ฅ 1 ๐‘ฅ 2 ๐‘ฅ 3 , ๐‘ฆ 1 ๐’™ = ๐‘ฅ 1 2 โˆ’ ๐‘ฅ 2 , ๐‘ฆ 2 ๐’™ = ๐‘ฅ ๐‘ฅ 2 ๐‘‘ ๐’š ๐‘‡ ๐‘‘๐’™ = ๐‘‘ ๐‘ฆ 1 (๐’™) ๐‘‘ ๐‘ฅ 1 ๐‘‘ ๐‘ฆ 2 (๐’™) ๐‘‘ ๐‘ฅ 1 ๐‘‘ ๐‘ฆ 1 (๐’™) ๐‘‘ ๐‘ฅ 2 ๐‘‘ ๐‘ฆ 2 (๐’™) ๐‘‘ ๐‘ฅ 2 ๐‘‘ ๐‘ฆ 1 (๐’™) ๐‘‘ ๐‘ฅ 3 ๐‘‘ ๐‘ฆ 2 (๐’™) ๐‘‘ ๐‘ฅ 3 = 2 ๐‘ฅ 1 0 โˆ’ ๐‘ฅ 3 11/29/2018 Pattern recognition

18 Pre-requisite Useful results ๐’™,๐’‚โˆˆ โ„ ๐‘›ร—1 Then
๐‘‘ ๐’‚ ๐‘‡ ๐’™ ๐‘‘๐’™ =๐’‚, ๐‘‘ ๐’™ ๐‘‡ ๐’‚ ๐‘‘๐’™ =๐’‚ 11/29/2018 Pattern recognition

19 Pre-requisite Useful results ๐ดโˆˆ โ„ ๐‘šร—๐‘› ,๐’™โˆˆ โ„ ๐‘›ร—1 , then ๐‘‘๐ด๐’™ ๐‘‘ ๐’™ ๐‘‡ =๐ด
11/29/2018 Pattern recognition

20 Multivariate linear regression
Similarly, ๐’˜ โˆ— = arg min ๐’˜ ๐’šโˆ’๐— ๐’˜ ๐‘‡ ๐’šโˆ’๐‘ฟ ๐’˜ ๐ธ ๐’˜ ๐œ• ๐ธ ๐’˜ ๐œ• ๐’˜ =2 ๐— ๐‘‡ ๐— ๐’˜ โˆ’๐’š =2 ๐— ๐‘‡ ๐— ๐’˜ โˆ’2 ๐— ๐‘‡ ๐’š =0 ๐— ๐‘‡ ๐— ๐’˜ = ๐— ๐‘‡ ๐’š 11/29/2018 Pattern recognition

21 Multivariate linear regression
Discussion If ๐— ๐‘‡ ๐— is a full-rank matrix or a positive definite matrix, Let ๐’™ ๐‘– = ๐‘ฅ ๐‘– ;1 , then If ๐— ๐‘‡ ๐— is not a full-rank matrix โ†’ Inductive bias ๐— ๐‘‡ ๐— ๐’˜ = ๐— ๐‘‡ ๐’š ๐’˜ โˆ— = ๐— ๐‘‡ ๐— โˆ’1 ๐— ๐‘‡ ๐’š ๐‘“ ๐’™ ๐‘– = ๐’™ ๐‘– ๐‘‡ ๐— ๐‘‡ ๐— โˆ’1 ๐— ๐‘‡ ๐’š 11/29/2018 Pattern recognition

22 Generalized linear model
๐‘“ ๐’™ = ๐’˜ ๐‘‡ ๐’™+๐‘ โ†’ ๐‘ฆ= ๐’˜ ๐‘‡ ๐’™+๐‘ Log-linear regression ln ๐‘ฆ = ๐’˜ ๐‘‡ ๐’™+๐‘ More generally, if a monotone function ๐‘” โˆ™ is differentiable, let ๐‘ฆ= ๐‘” โˆ’1 ( ๐’˜ ๐‘‡ ๐’™+๐‘) y is a generalized linear model, and ๐‘” โˆ™ is a link function 11/29/2018 Pattern recognition

23 Logistic regression How do we perform classification task using linear model? Firstly, letโ€™s consider a binary classification task with labels from {0, 1} ๐‘ง= ๐’˜ ๐‘‡ ๐’™+๐‘โˆˆโ„โ†’ {0, 1} Unit-step function ๐‘ฆ= 0, ๐‘ง<0; 0.5, ๐‘ง=0; 1, ๐‘ง>0; 11/29/2018 Pattern recognition

24 Logistic regression But unit-step function is not continuous โ†’ cannot be used as ๐‘” โˆ’1 (โ‹…) So we have to find a โ€œsurrogate functionโ€ Logistic function ๐‘ฆ= 1 1+ ๐‘’ โˆ’๐‘ง = 1 1+ ๐‘’ โˆ’ ๐’˜ ๐‘‡ ๐’™+๐‘ 11/29/2018 Pattern recognition

25 Logistic regression ๐‘ฆ= 1 1+ ๐‘’ โˆ’ ๐’˜ ๐‘‡ ๐’™+๐‘ ln ๐‘ฆ 1โˆ’๐‘ฆ = ๐’˜ ๐‘‡ ๐’™+๐‘
y: the possibility of x being a positive sample 1-y: the possibility of x being a negative sample ๐‘ฆ 1โˆ’๐‘ฆ (odds): the relative possibility of x being a positive sample ln ๐‘ฆ 1โˆ’๐‘ฆ : log odds/logit Advantages of logistic regression ๐‘ฆ= 1 1+ ๐‘’ โˆ’ ๐’˜ ๐‘‡ ๐’™+๐‘ ln ๐‘ฆ 1โˆ’๐‘ฆ = ๐’˜ ๐‘‡ ๐’™+๐‘ 11/29/2018 Pattern recognition

26 Logistic regression Task: Determine w and b in Solution:
1. y โ†’ p ( y = 1 | x ) 2. Estimate w and b using maximum likelihood method ๐‘ฆ= 1 1+ ๐‘’ โˆ’ ๐’˜ ๐‘‡ ๐’™+๐‘ ,ln ๐‘ฆ 1โˆ’๐‘ฆ = ๐’˜ ๐‘‡ ๐’™+๐‘ ln ๐‘ฆ 1โˆ’๐‘ฆ = ๐’˜ ๐‘‡ ๐’™+๐‘ ln ๐‘(๐‘ฆ=1|๐’™) ๐‘(๐‘ฆ=0|๐’™) = ๐’˜ ๐‘‡ ๐’™+๐‘ ๐‘ ๐‘ฆ=1 ๐’™ = ๐‘ 1 (๐’™)= ๐‘’ ๐’˜ ๐‘‡ ๐’™+๐‘ 1+ ๐‘’ ๐’˜ ๐‘‡ ๐’™+๐‘ ๐‘ ๐‘ฆ=0 ๐’™ = ๐‘ 0 (๐’™)= 1 1+ ๐‘’ ๐’˜ ๐‘‡ ๐’™+๐‘ ๐‘ 0 =1โˆ’ ๐‘ 1 11/29/2018 Pattern recognition

27 Pre-requisite: maximum likelihood estimation
MLE is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. Let Dc be the set containing all the samples from class c. Suppose these samples are independent and identically distributed (i.i.d), the likelihood of all the samples belonging to Dc given a parameter ๐œฝ ๐‘ is: ๐‘ƒ ๐ท ๐ถ ๐œฝ ๐‘ = ๐’™โˆˆ ๐ท ๐‘ ๐‘ƒ(๐’™| ๐œฝ ๐‘ ) 11/29/2018 Pattern recognition

28 Pre-requisite: maximum likelihood estimation
We want to maximize ๐‘ƒ( ๐ท ๐ถ | ๐œฝ ๐‘ ) Log-likelihood (๐ฟ๐ฟ( ๐œฝ ๐‘ )) often used instead of using ๐‘ƒ( ๐ท ๐ถ | ๐œฝ ๐‘ ) The maximum likelihood estimation of ๐œฝ ๐‘ is ๐ฟ๐ฟ ๐œฝ ๐‘ = log ๐‘ƒ ๐ท ๐ถ ๐œฝ ๐‘ = ๐’™โˆˆ ๐ท ๐‘ log ๐‘ƒ(๐’™| ๐œฝ ๐‘ ) ๐œฝ ๐‘ = arg max ๐œฝ ๐ฟ๐ฟ( ๐œฝ ๐‘ ) 11/29/2018 Pattern recognition

29 Pre-requisite: maximum likelihood estimation
Example: If the probability density function ๐‘ ๐’™ ๐‘ ~๐’ฉ( ๐ ๐‘ , ๐ˆ ๐‘ 2 ), the MLE of ๐ ๐‘ and ๐ˆ ๐‘ 2 is ๐ ๐‘ = 1 ๐ท ๐‘ ๐’™โˆˆ ๐ท ๐‘ ๐’™ ๐ˆ ๐‘ 2 = 1 ๐ท ๐‘ ๐’™โˆˆ ๐ท ๐‘ ๐’™โˆ’ ๐ ๐‘ ๐’™โˆ’ ๐ ๐‘ ๐‘‡ 11/29/2018 Pattern recognition

30 Logistic regression 2. Estimate w and b using maximum likelihood method Given a training set ๐’™ ๐‘– , ๐‘ฆ ๐‘– ๐‘–=1 ๐‘š , the likelihood is then Let ๐œท= ๐’˜;๐‘ , ๐’™ =(๐’™;1), then ๐’˜ ๐‘‡ ๐’™+๐‘= ๐œท ๐‘‡ ๐’™ โ„“ ๐’˜,๐‘ = ln ๐‘–=1 ๐‘š ๐‘ 1 ( ๐’™ ๐‘– |๐’˜,๐‘) ๐‘ฆ ๐‘– ๐‘ 0 ( ๐’™ ๐‘– |๐’˜,๐‘) 1โˆ’๐‘ฆ ๐‘– = ๐‘–=1 ๐‘š ๐‘ฆ ๐‘– ln ๐‘ 1 ( ๐’™ ๐‘– |๐’˜,๐‘) +(1โˆ’ ๐‘ฆ ๐‘– ) ln ๐‘ 0 ( ๐’™ ๐‘– |๐’˜,๐‘) ๐‘ 1 ๐’™ ๐‘– = ๐‘’ ๐œท ๐‘‡ ๐’™ 1+ ๐‘’ ๐œท ๐‘‡ ๐’™ , ๐‘ 0 ๐’™ ๐‘– = 1 1+ ๐‘’ ๐œท ๐‘‡ ๐’™ 11/29/2018 Pattern recognition

31 โ„“ ๐’˜,๐‘ =โ„“ ๐œท = ๐‘– ๐‘š ๐‘ฆ ๐‘– ๐œท ๐‘‡ ๐’™ ๐‘– โˆ’lnโก(1+ ๐‘’ ๐œท ๐‘‡ ๐’™ ๐‘– ))
Logistic regression Then โ„“ ๐œท is a continuous convex function and has higher-order derivatives. So the target function of optimization can be written as The minimum value of the target function can be calculated using gradient descent method or Newtonโ€™s method โ„“ ๐’˜,๐‘ =โ„“ ๐œท = ๐‘– ๐‘š ๐‘ฆ ๐‘– ๐œท ๐‘‡ ๐’™ ๐‘– โˆ’lnโก(1+ ๐‘’ ๐œท ๐‘‡ ๐’™ ๐‘– )) ๐œท โˆ— = arg min ๐œท โˆ’โ„“ ๐œท 11/29/2018 Pattern recognition

32 Pre-requisite: Newtonโ€™s method
Newtonโ€™s method was first published in 1685 in A Treatise of Algebra both Historical and Practical by John Wallis. In 1690, Joseph Raphson published a simplified description in Analysis aequationum universalis. 11/29/2018 Pattern recognition

33 Pre-requisite: Newtonโ€™s method
Consider min ๐‘“(๐‘ฅ) Taylor series expansion around ๐‘ฅ ๐‘“ ๐‘ฅ โ‰ˆ๐‘” ๐‘ฅ =๐‘“ ๐‘ฅ +๐›ป๐‘“( ๐‘ฅ ) ๐‘ฅโˆ’ ๐‘ฅ ๐‘ฅโˆ’ ๐‘ฅ โ€ฒ ๐›ป 2 ๐‘“ ๐‘ฅ (๐‘ฅโˆ’ ๐‘ฅ ) Instead of min ๐‘“(๐‘ฅ) , solve min ๐‘”(๐‘ฅ) , i.e ๐›ป๐‘” ๐‘ฅ =0 ๐›ป๐‘“ ๐‘ฅ โ€ฒ ๐‘ฅโˆ’ ๐‘ฅ + ๐›ป 2 ๐‘“ ๐‘ฅ ๐‘ฅโˆ’ ๐‘ฅ =0 โ‡’๐‘ฅโˆ’ ๐‘ฅ =โˆ’ ๐›ป 2 ๐‘“ ๐‘ฅ โˆ’1 ๐›ป๐‘“( ๐‘ฅ ) The direction ๐‘‘=โˆ’ ๐›ป 2 ๐‘“ ๐‘ฅ โˆ’1 ๐›ป๐‘“ ๐‘ฅ is the Newton direction 11/29/2018 Pattern recognition

34 Pre-requisite: Newtonโ€™s method
The algorithm Step 0 Given x0, set k := 0. Step ๐‘‘ ๐‘˜ =โˆ’ ๐›ป 2 ๐‘“ ๐‘ฅ ๐‘˜ โˆ’1 ๐›ป๐‘“ ๐‘ฅ ๐‘˜ , if ๐‘‘ ๐‘˜ โ‰ค๐œ–, then stop Step 2 Set ๐‘ฅ ๐‘˜+1 โ† ๐‘ฅ ๐‘˜ + ๐‘‘ ๐‘˜ , ๐‘˜โ†๐‘˜+1, go to step 1 11/29/2018 Pattern recognition

35 โ„“ ๐’˜,๐‘ =โ„“ ๐œท = ๐‘– ๐‘š ๐‘ฆ ๐‘– ๐œท ๐‘‡ ๐’™ ๐‘– +lnโก(1+ ๐‘’ ๐œท ๐‘‡ ๐’™ ๐‘– ))
Logistic regression Then โ„“ ๐œท is a continuous convex function and has higher-order derivatives. So the target function of optimization can be written as The minimum value of the target function can be calculated using gradient descent method or Newtonโ€™s method Using Newtonโ€™s method, in t+1 iteration, โ„“ ๐’˜,๐‘ =โ„“ ๐œท = ๐‘– ๐‘š ๐‘ฆ ๐‘– ๐œท ๐‘‡ ๐’™ ๐‘– +lnโก(1+ ๐‘’ ๐œท ๐‘‡ ๐’™ ๐‘– )) ๐œท โˆ— = arg min ๐œท โ„“ ๐œท ๐œท ๐‘ก+1 = ๐œท ๐‘‡ โˆ’ ๐œ• 2 โ„“ ๐œท ๐œ•๐œท๐œ• ๐œท ๐‘‡ โˆ’1 ๐œ•โ„“ ๐œท ๐œ•๐œท 11/29/2018 Pattern recognition

36 Logistic regression The first and second derivatives of are given in the book Assignment 1: Implemented logistic regression model using matlab (R, Python, or any language you are familiar) You can use any dataset in UCI repository to validate your model Plot a figure like this โ†’ 11/29/2018 Pattern recognition

37 Pre-requisite: Lagrange multiplier
Lagrange multiplier is a strategy for finding the local extremum of a function subject to equality constraints Problem: max ๐‘“(๐ฑ) ๐‘œ๐‘Ÿ min โˆ’๐‘“(๐ฑ) ,๐ฑโˆˆ โ„ ๐‘›ร—1 ๐‘ .๐‘ก ๐‘” ๐‘˜ ๐ฑ =0, ๐‘˜=1,โ€ฆ๐‘š 11/29/2018 Pattern recognition

38 Pre-requisite: Lagrange multiplier
Solution: If ๐ฑ 0 , ๐œ† 10 , ๐œ† 20 ,โ€ฆ, ๐œ† ๐‘š0 is a stationary point of F, then ๐ฑ 0 is a stationary point of f(x) with constraints ๐ฑ 0 , ๐œ† 10 , ๐œ† 20 ,โ€ฆ, ๐œ† ๐‘š0 is a stationary point of F ๐น ๐ฑ; ๐œ† 1 ,โ€ฆ, ๐œ† ๐‘š =๐‘“ ๐ฑ + ๐‘˜ ๐‘š ๐œ† ๐‘˜ ๐‘” ๐‘˜ (๐ฑ) ๐‘‘๐น ๐‘‘ ๐‘ฅ 1 =0, ๐‘‘๐น ๐‘‘ ๐‘ฅ 2 =0,โ€ฆ, ๐‘‘๐น ๐‘‘ ๐‘ฅ ๐‘› =0, ๐‘‘๐น ๐‘‘ ๐œ† 1 =0,โ€ฆ, ๐‘‘๐น ๐‘‘ ๐œ† ๐‘š =0 n+m equations! 11/29/2018 Pattern recognition

39 Pre-requisite: Lagrange multiplier
Example: Problem: for a given point ๐‘ 0 =(1,0), among all the points lying on the line ๐‘ฆ=๐‘ฅ, identify the one having the least distance to ๐‘ 0 . ๐‘ฆ=๐‘ฅ ๐‘ 0 ? The distance is ๐‘“ ๐‘ฅ,๐‘ฆ = ๐‘ฅโˆ’ ๐‘ฆโˆ’0 2 Now we want to find the stationary point of ๐‘“ ๐‘ฅ,๐‘ฆ under the constraint ๐‘” ๐‘ฅ,๐‘ฆ =๐‘ฆโˆ’๐‘ฅ=0 According to Lagrange multiplier method, construct another function ๐น ๐‘ฅ,๐‘ฆ,๐œ† =๐‘“ ๐‘ฅ,๐‘ฆ +๐œ†๐‘” ๐‘ฅ,๐‘ฆ = ๐‘ฅโˆ’ ๐‘ฆโˆ’0 2 +๐œ†(๐‘ฆโˆ’๐‘ฅ) 11/29/2018 Pattern recognition

40 Pre-requisite: Lagrange multiplier
Example: Problem: for a given point ๐‘ 0 =(1,0), among all the points lying on the line ๐‘ฆ=๐‘ฅ, identify the one having the least distance to ๐‘ 0 . ๐‘ฆ=๐‘ฅ ๐‘ 0 ? Find the stationary point for ๐น ๐‘ฅ,๐‘ฆ,๐œ† ๐‘‘๐น ๐‘‘๐‘ฅ =0 ๐‘‘๐น ๐‘‘๐‘ฆ =0 ๐‘‘๐น ๐‘‘๐œ† =0 2 ๐‘ฅโˆ’1 +๐œ†=0 2๐‘ฆโˆ’๐œ†=0 ๐‘ฅโˆ’๐‘ฆ=0 ๐‘ฅ=0.5 ๐‘ฆ=0.5 ๐œ†=1 11/29/2018 Pattern recognition

41 Linear discriminant analysis
In a two-class classification problem, given n samples in a d- dimensional feature space. There are n1 samples belong to class 1 and n2 samples belong to class 2. Goal: to find a vector w, and project the n samples on the axis y = wTx, so that the projected samples are well separated. ๐‘ฆ= ๐’˜ 1 ๐‘‡ ๐’™ ๐‘ฆ= ๐’˜ 2 ๐‘‡ ๐’™ 11/29/2018 Pattern recognition

42 Linear discriminant analysis
Given a dataset ๐ท= ๐’™ ๐‘˜ , ๐‘ฆ ๐‘˜ ๐‘˜=1 ๐‘š , ๐‘ฆ ๐‘˜ โˆˆ 0,1 , denote ๐‘‹ ๐‘– , ๐ ๐‘– , ๐œฎ ๐‘– the samples, the sample mean vector, and the covariance matrix of class i, respectively. The sample mean of the projected points in the i-th class is: The variance of the projected points in the i-th class is ๐œ‡ ๐‘– = 1 ๐‘› ๐‘– ๐’˜ ๐‘‡ ๐‘‹ ๐‘– = ๐’˜ ๐‘‡ ๐ ๐‘– ๐œฎ ๐‘– = ๐’˜ ๐‘‡ ๐‘‹ ๐‘– โˆ’ ๐’˜ ๐‘‡ ๐ ๐‘– ๐’˜ ๐‘‡ ๐‘‹ ๐‘– โˆ’ ๐’˜ ๐‘‡ ๐ ๐‘– ๐‘‡ = ๐’˜ ๐‘‡ ๐œฎ ๐‘– ๐’˜ 11/29/2018 Pattern recognition

43 Linear discriminant analysis
The between-class scatter matrix is: ๐‘บ ๐‘ = ๐ 0 โˆ’ ๐ ๐ 0 โˆ’ ๐ 1 ๐‘‡ The within-class scatter matrix is: ๐‘บ ๐‘ค = ๐šบ 0 + ๐šบ 1 The fisher linear discriminant analysis will choose the w, which maximize: i.e. the between-class distance should be as large as possible, meanwhile the within-class scatter should be as small as possible. ๐ฝ ๐’˜ = ๐’˜ ๐‘‡ ๐ 0 โˆ’ ๐’˜ ๐‘‡ ๐ ๐šบ ๐šบ 1 = ๐’˜ ๐‘‡ ๐ 0 โˆ’ ๐ ๐ 0 โˆ’ ๐ 1 ๐‘ป ๐’˜ ๐’˜ ๐‘‡ ๐šบ 0 ๐’˜+ ๐’˜ ๐‘‡ ๐šบ 1 ๐’˜ = ๐’˜ ๐‘‡ ๐‘† ๐‘ ๐’˜ ๐’˜ ๐‘‡ ๐‘† ๐‘ค ๐’˜ 11/29/2018 Pattern recognition

44 Linear discriminant analysis
Without loss of generality, let ๐’˜ ๐‘‡ ๐‘† ๐‘ค ๐’˜=1 The optimization problem can be rewritten as: min ๐’˜ โˆ’ ๐’˜ ๐‘‡ ๐‘บ ๐‘ ๐’˜ ๐‘ .๐‘ก. ๐’˜ ๐‘‡ ๐‘บ ๐‘ค ๐’˜=1. Using Lagrange multiplier method, we need to find the minimum of the following function: ๐น ๐’˜ =โˆ’ ๐’˜ ๐‘‡ ๐‘บ ๐‘ ๐’˜+๐œ† ๐’˜ ๐‘‡ ๐‘บ ๐‘ค ๐’˜โˆ’๐œ† F(w) is a convex function ๐œ•๐น ๐œ•๐’˜ =0 ๐œ•๐น ๐œ•๐œ† =0 ๐‘บ ๐‘ ๐’˜=๐œ† ๐‘บ ๐‘ค ๐’˜ 11/29/2018 Pattern recognition

45 Linear discriminant analysis
Let ๐‘บ ๐‘ ๐’˜=๐œ†( ๐ 0 โˆ’ ๐ 1 ), then ๐‘บ ๐‘ ๐’˜=๐œ† ๐‘บ ๐‘ค ๐’˜ โ‡’ ๐‘บ ๐‘ค โˆ’1 ๐‘บ ๐‘ ๐’˜=๐œ†๐’˜ โ‡’ ๐‘บ ๐‘ค โˆ’1 ๐ 0 โˆ’ ๐ ๐ 0 โˆ’ ๐ 1 ๐‘‡ ๐’˜= ๐‘บ ๐‘ค โˆ’1 ๐ 0 โˆ’ ๐ 1 ๐œ† 2 =๐œ†๐’˜ โ‡’๐’˜= ๐‘บ ๐‘ค โˆ’1 ( ๐ 0 โˆ’ ๐ 1 ) In practice, we compute the singular value decomposition of ๐‘บ ๐‘ค , i.e. ๐‘บ ๐‘ค =๐”๐šบ ๐• ๐‘‡ . Then ๐‘บ ๐‘ค โˆ’1 =๐• ๐šบ โˆ’1 ๐” ๐‘‡ 11/29/2018 Pattern recognition

46 Multiclass classification
Binary classification โ‡’ multiclass classification One vs. One (OvO) One vs. Rest (OvR) C1 C2 C3 C4 โ‡’ ๐‘“ 1 ๐‘“ 2 ๐‘“ 3 ๐‘“ 4 ๐‘“ 5 ๐‘“ 6 โ€œ+โ€ โ€œ-โ€ โŸถ ๐ถ 1 ๐ถ 3 ๐ถ 2 OvO OvR N(N-1)/2 classifiers 11/29/2018 Pattern recognition

47 Multiclass classification
Binary classification โ‡’ multiclass classification Many vs. Many (MvM) Error correcting output codes Encode Decode Hamming distance Euclidian distance ๐‘“ ๐‘“ ๐‘“ ๐‘“ ๐‘“ 5 ๐ถ 1 ๐ถ 2 ๐ถ 3 ๐ถ 4 -1 +1 test sample -1 +1 11/29/2018 Pattern recognition

48 Class-imbalance In previous problems, we assume that the numbers of samples from different classes are about the same. However, if the proportions of samples from different classes vary greatly, the learning process will be influenced. E.g. 998 negatives vs. 2 positives Consider class-imbalance when using logistic regression to perform a classification task When # of positives and negatives are the same: ๐‘ฆ>0.5 ๐‘œ๐‘Ÿ ๐‘ฆ 1โˆ’๐‘ฆ >1: positive ๐‘ฆ<0.5 ๐‘œ๐‘Ÿ ๐‘ฆ 1โˆ’๐‘ฆ <1: negative ln ๐‘ฆ 1โˆ’๐‘ฆ = ๐’˜ ๐‘‡ ๐’™+๐‘ 11/29/2018 Pattern recognition

49 Class-imbalance However, when # of positives and negatives are not equal, let ๐‘š + be the number of positives and ๐‘š โˆ’ be the number of negatives The odds of observing a positive is ๐‘š + ๐‘š โˆ’ Therefore, the classification criteria is ๐‘ฆ 1โˆ’๐‘ฆ > ๐‘š + ๐‘š โˆ’ : positive ๐‘ฆ 1โˆ’๐‘ฆ < ๐‘š + ๐‘š โˆ’ : negative Rescaling ๐‘ฆโ€ฒ 1โˆ’๐‘ฆโ€ฒ = ๐‘ฆ 1โˆ’๐‘ฆ ร— ๐‘š โˆ’ ๐‘š + 11/29/2018 Pattern recognition

50 Class-imbalance Undersampling Oversampling Threshold-moving 11/29/2018
Pattern recognition


Download ppt "Ying shen Sse, tongji university Sep. 2016"

Similar presentations


Ads by Google