Ying shen Sse, tongji university Sep. 2016

Ying shen Sse, tongji university Sep. 2016
Linear Model Ying shen Sse, tongji university Sep. 2016

The basic form of the linear model
Given a sample 𝒙= 𝑥 1 , 𝑥 2 , …, 𝑥 𝑑 𝑇 with d attributes The linear model tries to a learn a prediction function using a linear combination of all attributes, i.e. 𝑓 𝒙 = 𝑤 1 𝑥 1 + 𝑤 2 𝑥 2 +…+ 𝑤 𝑑 𝑥 𝑑 +𝑏 The vector form of the function is 𝑓 𝒙 = 𝒘 𝑇 𝒙+𝑏 where 𝒘= 𝑤 1 , 𝑤 2 ,…, 𝑤 𝑑 𝑇 Once w and d have been learned from samples, f will be determined. For example 𝑓 好瓜 =0.2∗ 𝑥 色泽 +0.5∗ 𝑥 根蒂 +0.3∗ 𝑥 敲声 +1 11/29/2018 Pattern recognition

Linear regression Given a dataset 𝐷= 𝒙 1 , 𝑦 1 , 𝒙 2 , 𝑦 2 ,…, 𝒙 𝑚 , 𝑦 𝑚 ; 𝒙 𝑖 = 𝑥 𝑖1 , 𝑥 𝑖2 ,…, 𝑥 𝑖𝑑 𝑇 , the task of a linear regression is to learn a linear model which can predict a value for a new sample x' that close to its true value y'. When 𝑑=1, 𝒙 𝑖 = 𝑥 𝑖 Hours Spent Studying 4 9 10 14 7 12 22 1 3 8 Math SAT Score 390 580 650 730 410 530 600 790 350 400 590 11/29/2018 Pattern recognition

𝑓 𝑥 𝑖 =𝑤 𝑥 𝑖 +𝑏, 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑓( 𝑥 𝑖 )≅ 𝑦 𝑖
Linear regression We will learn a linear regression model 𝑓 𝑥 𝑖 =𝑤 𝑥 𝑖 +𝑏, 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑓( 𝑥 𝑖 )≅ 𝑦 𝑖 How do we determine w and b? 11/29/2018 Pattern recognition

Linear regression Mean squared error (MSE) is a commonly used performance measure: We want to minimize MSE between f(xi) and yi: 𝑀𝑆𝐸= 1 𝑚 𝑖=1 𝑚 𝑦 𝑖 ′ − 𝑦 𝑖 2 𝑤 ∗ , 𝑏 ∗ = arg min (𝑤,𝑏) 𝑖=1 𝑚 𝑓 𝑥 𝑖 − 𝑦 𝑖 2 = arg min (𝑤,𝑏) 𝑖=1 𝑚 𝑦 𝑖 −𝑤 𝑥 𝑖 −𝑏 2 11/29/2018 Pattern recognition

Linear regression The method of determining the fitting model based on MSE is called the least square method In linear regression problem, the least square method aims to find a line such that the sum of distances of all the samples to it is the smallest. 11/29/2018 Pattern recognition

Pre-requisite A stationary point of a differentiable function of one variable is a point of the domain of the function where the derivative is zero Single-variable function: f(x) is differentiable in (a, b). At x0, Two-variables function: f(x, y) is differentiable in its domain. At (x0, y0), 𝑑𝑓 𝑑𝑥 𝑥 0 =0 𝑑𝑓 𝑑𝑥 𝑥 0 , 𝑦 0 =0, 𝑑𝑓 𝑑𝑦 𝑥 0 , 𝑦 0 =0 11/29/2018 Pattern recognition

𝑑𝑓 𝑑 𝑥 1 𝒙 0 =0, 𝑑𝑓 𝑑 𝑥 2 𝒙 0 =0,…, 𝑑𝑓 𝑑 𝑥 𝑛 𝒙 0 =0,
Pre-requisite In general case, if x0 is a stationary point of f(x), 𝒙∈ ℝ 𝑛×1 Proposition: Let f be a differentiable function of n variables defined on the convex set S, and let x0 be in the interior of S. If f is convex then x0 is a global minimizer of f in S if and only if it is a stationary point of f (i.e. 𝑑𝑓 𝑑 𝑥 𝑖 𝒙 0 =0 for i = 1, ..., n). 𝑑𝑓 𝑑 𝑥 1 𝒙 0 =0, 𝑑𝑓 𝑑 𝑥 2 𝒙 0 =0,…, 𝑑𝑓 𝑑 𝑥 𝑛 𝒙 0 =0, 11/29/2018 Pattern recognition

𝜕𝐸 𝜕𝑤 =2(𝑤 𝑖=1 𝑚 𝑥 𝑖 2 − 𝑖=1 𝑚 𝑦 𝑖 −𝑏 𝑥 𝑖 ),
Parameter estimation Function 𝐸 𝑤,𝑏 = 𝑖=1 𝑚 𝑦 𝑖 −𝑤 𝑥 𝑖 −𝑏 2 is a convex function The extremum can be achieved at the stationary point, i.e. 𝜕𝐸 𝜕𝑤 =2(𝑤 𝑖=1 𝑚 𝑥 𝑖 2 − 𝑖=1 𝑚 𝑦 𝑖 −𝑏 𝑥 𝑖 ), 𝜕𝐸 𝜕𝑏 =2(𝑚𝑏− 𝑖=1 𝑚 𝑦 𝑖 −𝑤 𝑥 𝑖 ) 𝜕𝐸 𝜕𝑤 =0, 𝜕𝐸 𝜕𝑏 =0 11/29/2018 Pattern recognition

Parameter estimation Solve the equations and we can have closed-form expression of w and b Where 𝑥 = 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 , 𝑦 = 1 𝑚 𝑖=1 𝑚 𝑦 𝑖 is the mean of x and y 𝑤= 𝑖=1 𝑚 𝑦 𝑖 ( 𝑥 𝑖 − 𝑥 ) 𝑖=1 𝑚 𝑥 𝑖 2 − 1 𝑚 𝑖=1 𝑚 𝑥 𝑖 , 𝑏= 1 𝑚 𝑖=1 𝑚 ( 𝑦 𝑖 −𝑤 𝑥 𝑖 ) = 𝑦 −𝑤 𝑥 11/29/2018 Pattern recognition

Multivariate linear regression
In a general case, given a dataset D with 𝑑≥1, we try to learn a model: 𝑓 𝒙 𝑖 = 𝒘 𝑇 𝒙 𝑖 +𝑏, 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑓( 𝒙 𝑖 )≅ 𝑦 𝑖 We can also use the least square method to estimate w and b Firstly, denote 𝒘 = 𝒘,𝑏 𝑇 and 𝐗= 𝑥 11 𝑥 21 ⋮ 𝑥 𝑚1 𝑥 12 𝑥 22 ⋮ 𝑥 𝑚2 ⋯ ⋯ ⋱ ⋯ 𝑥 1𝑑 𝑥 2𝑑 ⋮ 𝑥 𝑚𝑑 ⋮ 1 = 𝒙 1 𝑇 1 𝒙 2 𝑇 1 ⋮ 𝒙 𝑚 𝑇 ⋮ 1 ; 𝒚= 𝑦 1 𝑦 2 ⋮ 𝑦 𝑚 𝐗 𝒘 = 𝒙 1 𝑇 𝒘+𝑏, 𝒙 2 𝑇 𝒘+𝑏,…, 𝒙 𝑚 𝑇 𝒘+𝑏 𝑇 11/29/2018 Pattern recognition

𝑑𝑓 𝑑𝑡 = 𝑑 𝑓 1 (𝑡) 𝑑𝑡 , 𝑑 𝑓 2 (𝑡) 𝑑𝑡 ,…, 𝑑 𝑓 𝑛 (𝑡) 𝑑𝑡 𝑇
Pre-requisite Matrix differentiation Function is a vector and the variable is a scalar 𝑓 𝑡 = 𝑓 1 𝑡 , 𝑓 2 𝑡 ,…, 𝑓 𝑛 𝑡 𝑇 Definition 𝑑𝑓 𝑑𝑡 = 𝑑 𝑓 1 (𝑡) 𝑑𝑡 , 𝑑 𝑓 2 (𝑡) 𝑑𝑡 ,…, 𝑑 𝑓 𝑛 (𝑡) 𝑑𝑡 𝑇 11/29/2018 Pattern recognition

Pre-requisite Matrix differentiation
Function is a matrix and the variable is a scalar Definition 11/29/2018 Pattern recognition

Function is a scalar and the variable is a vector Definition In a similar way 11/29/2018 Pattern recognition

Function is a vector and the variable is a vector Definition 11/29/2018 Pattern recognition

Function is a vector and the variable is a vector In a similar way 11/29/2018 Pattern recognition

Function is a vector and the variable is a vector Example 𝒚= 𝑦 1 (𝒙) 𝑦 2 (𝒙) , 𝒙= 𝑥 1 𝑥 2 𝑥 3 , 𝑦 1 𝒙 = 𝑥 1 2 − 𝑥 2 , 𝑦 2 𝒙 = 𝑥 𝑥 2 𝑑 𝒚 𝑇 𝑑𝒙 = 𝑑 𝑦 1 (𝒙) 𝑑 𝑥 1 𝑑 𝑦 2 (𝒙) 𝑑 𝑥 1 𝑑 𝑦 1 (𝒙) 𝑑 𝑥 2 𝑑 𝑦 2 (𝒙) 𝑑 𝑥 2 𝑑 𝑦 1 (𝒙) 𝑑 𝑥 3 𝑑 𝑦 2 (𝒙) 𝑑 𝑥 3 = 2 𝑥 1 0 − 𝑥 3 11/29/2018 Pattern recognition

Pre-requisite Useful results 𝒙,𝒂∈ ℝ 𝑛×1 Then
𝑑 𝒂 𝑇 𝒙 𝑑𝒙 =𝒂, 𝑑 𝒙 𝑇 𝒂 𝑑𝒙 =𝒂 11/29/2018 Pattern recognition

Pre-requisite Useful results 𝐴∈ ℝ 𝑚×𝑛 ,𝒙∈ ℝ 𝑛×1 , then 𝑑𝐴𝒙 𝑑 𝒙 𝑇 =𝐴
11/29/2018 Pattern recognition

Similarly, 𝒘 ∗ = arg min 𝒘 𝒚−𝐗 𝒘 𝑇 𝒚−𝑿 𝒘 𝐸 𝒘 𝜕 𝐸 𝒘 𝜕 𝒘 =2 𝐗 𝑇 𝐗 𝒘 −𝒚 =2 𝐗 𝑇 𝐗 𝒘 −2 𝐗 𝑇 𝒚 =0 𝐗 𝑇 𝐗 𝒘 = 𝐗 𝑇 𝒚 11/29/2018 Pattern recognition

Discussion If 𝐗 𝑇 𝐗 is a full-rank matrix or a positive definite matrix, Let 𝒙 𝑖 = 𝑥 𝑖 ;1 , then If 𝐗 𝑇 𝐗 is not a full-rank matrix → Inductive bias 𝐗 𝑇 𝐗 𝒘 = 𝐗 𝑇 𝒚 𝒘 ∗ = 𝐗 𝑇 𝐗 −1 𝐗 𝑇 𝒚 𝑓 𝒙 𝑖 = 𝒙 𝑖 𝑇 𝐗 𝑇 𝐗 −1 𝐗 𝑇 𝒚 11/29/2018 Pattern recognition

Generalized linear model
𝑓 𝒙 = 𝒘 𝑇 𝒙+𝑏 → 𝑦= 𝒘 𝑇 𝒙+𝑏 Log-linear regression ln 𝑦 = 𝒘 𝑇 𝒙+𝑏 More generally, if a monotone function 𝑔 ∙ is differentiable, let 𝑦= 𝑔 −1 ( 𝒘 𝑇 𝒙+𝑏) y is a generalized linear model, and 𝑔 ∙ is a link function 11/29/2018 Pattern recognition

Logistic regression How do we perform classification task using linear model? Firstly, let’s consider a binary classification task with labels from {0, 1} 𝑧= 𝒘 𝑇 𝒙+𝑏∈ℝ→ {0, 1} Unit-step function 𝑦= 0, 𝑧<0; 0.5, 𝑧=0; 1, 𝑧>0; 11/29/2018 Pattern recognition

Logistic regression But unit-step function is not continuous → cannot be used as 𝑔 −1 (⋅) So we have to find a “surrogate function” Logistic function 𝑦= 1 1+ 𝑒 −𝑧 = 1 1+ 𝑒 − 𝒘 𝑇 𝒙+𝑏 11/29/2018 Pattern recognition

Logistic regression 𝑦= 1 1+ 𝑒 − 𝒘 𝑇 𝒙+𝑏 ln 𝑦 1−𝑦 = 𝒘 𝑇 𝒙+𝑏
y: the possibility of x being a positive sample 1-y: the possibility of x being a negative sample 𝑦 1−𝑦 (odds): the relative possibility of x being a positive sample ln 𝑦 1−𝑦 : log odds/logit Advantages of logistic regression 𝑦= 1 1+ 𝑒 − 𝒘 𝑇 𝒙+𝑏 ln 𝑦 1−𝑦 = 𝒘 𝑇 𝒙+𝑏 11/29/2018 Pattern recognition

Logistic regression Task: Determine w and b in Solution:
1. y → p ( y = 1 | x ) 2. Estimate w and b using maximum likelihood method 𝑦= 1 1+ 𝑒 − 𝒘 𝑇 𝒙+𝑏 ,ln 𝑦 1−𝑦 = 𝒘 𝑇 𝒙+𝑏 ln 𝑦 1−𝑦 = 𝒘 𝑇 𝒙+𝑏 ln 𝑝(𝑦=1|𝒙) 𝑝(𝑦=0|𝒙) = 𝒘 𝑇 𝒙+𝑏 𝑝 𝑦=1 𝒙 = 𝑝 1 (𝒙)= 𝑒 𝒘 𝑇 𝒙+𝑏 1+ 𝑒 𝒘 𝑇 𝒙+𝑏 𝑝 𝑦=0 𝒙 = 𝑝 0 (𝒙)= 1 1+ 𝑒 𝒘 𝑇 𝒙+𝑏 𝑝 0 =1− 𝑝 1 11/29/2018 Pattern recognition

Pre-requisite: maximum likelihood estimation
MLE is a method of estimating the parameters of a statistical model given observations, by finding the parameter values that maximize the likelihood of making the observations given the parameters. Let Dc be the set containing all the samples from class c. Suppose these samples are independent and identically distributed (i.i.d), the likelihood of all the samples belonging to Dc given a parameter 𝜽 𝑐 is: 𝑃 𝐷 𝐶 𝜽 𝑐 = 𝒙∈ 𝐷 𝑐 𝑃(𝒙| 𝜽 𝑐 ) 11/29/2018 Pattern recognition

We want to maximize 𝑃( 𝐷 𝐶 | 𝜽 𝑐 ) Log-likelihood (𝐿𝐿( 𝜽 𝑐 )) often used instead of using 𝑃( 𝐷 𝐶 | 𝜽 𝑐 ) The maximum likelihood estimation of 𝜽 𝑐 is 𝐿𝐿 𝜽 𝑐 = log 𝑃 𝐷 𝐶 𝜽 𝑐 = 𝒙∈ 𝐷 𝑐 log 𝑃(𝒙| 𝜽 𝑐 ) 𝜽 𝑐 = arg max 𝜽 𝐿𝐿( 𝜽 𝑐 ) 11/29/2018 Pattern recognition

Example: If the probability density function 𝑝 𝒙 𝑐 ~𝒩( 𝝁 𝑐 , 𝝈 𝑐 2 ), the MLE of 𝝁 𝑐 and 𝝈 𝑐 2 is 𝝁 𝑐 = 1 𝐷 𝑐 𝒙∈ 𝐷 𝑐 𝒙 𝝈 𝑐 2 = 1 𝐷 𝑐 𝒙∈ 𝐷 𝑐 𝒙− 𝝁 𝑐 𝒙− 𝝁 𝑐 𝑇 11/29/2018 Pattern recognition

Logistic regression 2. Estimate w and b using maximum likelihood method Given a training set 𝒙 𝑖 , 𝑦 𝑖 𝑖=1 𝑚 , the likelihood is then Let 𝜷= 𝒘;𝑏 , 𝒙 =(𝒙;1), then 𝒘 𝑇 𝒙+𝑏= 𝜷 𝑇 𝒙 ℓ 𝒘,𝑏 = ln 𝑖=1 𝑚 𝑝 1 ( 𝒙 𝑖 |𝒘,𝑏) 𝑦 𝑖 𝑝 0 ( 𝒙 𝑖 |𝒘,𝑏) 1−𝑦 𝑖 = 𝑖=1 𝑚 𝑦 𝑖 ln 𝑝 1 ( 𝒙 𝑖 |𝒘,𝑏) +(1− 𝑦 𝑖 ) ln 𝑝 0 ( 𝒙 𝑖 |𝒘,𝑏) 𝑝 1 𝒙 𝑖 = 𝑒 𝜷 𝑇 𝒙 1+ 𝑒 𝜷 𝑇 𝒙 , 𝑝 0 𝒙 𝑖 = 1 1+ 𝑒 𝜷 𝑇 𝒙 11/29/2018 Pattern recognition

ℓ 𝒘,𝑏 =ℓ 𝜷 = 𝑖 𝑚 𝑦 𝑖 𝜷 𝑇 𝒙 𝑖 −ln⁡(1+ 𝑒 𝜷 𝑇 𝒙 𝑖 ))
Logistic regression Then ℓ 𝜷 is a continuous convex function and has higher-order derivatives. So the target function of optimization can be written as The minimum value of the target function can be calculated using gradient descent method or Newton’s method ℓ 𝒘,𝑏 =ℓ 𝜷 = 𝑖 𝑚 𝑦 𝑖 𝜷 𝑇 𝒙 𝑖 −ln⁡(1+ 𝑒 𝜷 𝑇 𝒙 𝑖 )) 𝜷 ∗ = arg min 𝜷 −ℓ 𝜷 11/29/2018 Pattern recognition

Pre-requisite: Newton’s method
Newton’s method was first published in 1685 in A Treatise of Algebra both Historical and Practical by John Wallis. In 1690, Joseph Raphson published a simplified description in Analysis aequationum universalis. 11/29/2018 Pattern recognition

Consider min 𝑓(𝑥) Taylor series expansion around 𝑥 𝑓 𝑥 ≈𝑔 𝑥 =𝑓 𝑥 +𝛻𝑓( 𝑥 ) 𝑥− 𝑥 𝑥− 𝑥 ′ 𝛻 2 𝑓 𝑥 (𝑥− 𝑥 ) Instead of min 𝑓(𝑥) , solve min 𝑔(𝑥) , i.e 𝛻𝑔 𝑥 =0 𝛻𝑓 𝑥 ′ 𝑥− 𝑥 + 𝛻 2 𝑓 𝑥 𝑥− 𝑥 =0 ⇒𝑥− 𝑥 =− 𝛻 2 𝑓 𝑥 −1 𝛻𝑓( 𝑥 ) The direction 𝑑=− 𝛻 2 𝑓 𝑥 −1 𝛻𝑓 𝑥 is the Newton direction 11/29/2018 Pattern recognition

The algorithm Step 0 Given x0, set k := 0. Step 𝑑 𝑘 =− 𝛻 2 𝑓 𝑥 𝑘 −1 𝛻𝑓 𝑥 𝑘 , if 𝑑 𝑘 ≤𝜖, then stop Step 2 Set 𝑥 𝑘+1 ← 𝑥 𝑘 + 𝑑 𝑘 , 𝑘←𝑘+1, go to step 1 11/29/2018 Pattern recognition

ℓ 𝒘,𝑏 =ℓ 𝜷 = 𝑖 𝑚 𝑦 𝑖 𝜷 𝑇 𝒙 𝑖 +ln⁡(1+ 𝑒 𝜷 𝑇 𝒙 𝑖 ))
Logistic regression Then ℓ 𝜷 is a continuous convex function and has higher-order derivatives. So the target function of optimization can be written as The minimum value of the target function can be calculated using gradient descent method or Newton’s method Using Newton’s method, in t+1 iteration, ℓ 𝒘,𝑏 =ℓ 𝜷 = 𝑖 𝑚 𝑦 𝑖 𝜷 𝑇 𝒙 𝑖 +ln⁡(1+ 𝑒 𝜷 𝑇 𝒙 𝑖 )) 𝜷 ∗ = arg min 𝜷 ℓ 𝜷 𝜷 𝑡+1 = 𝜷 𝑇 − 𝜕 2 ℓ 𝜷 𝜕𝜷𝜕 𝜷 𝑇 −1 𝜕ℓ 𝜷 𝜕𝜷 11/29/2018 Pattern recognition

Logistic regression The first and second derivatives of are given in the book Assignment 1: Implemented logistic regression model using matlab (R, Python, or any language you are familiar) You can use any dataset in UCI repository to validate your model Plot a figure like this → 11/29/2018 Pattern recognition

Pre-requisite: Lagrange multiplier
Lagrange multiplier is a strategy for finding the local extremum of a function subject to equality constraints Problem: max 𝑓(𝐱) 𝑜𝑟 min −𝑓(𝐱) ,𝐱∈ ℝ 𝑛×1 𝑠.𝑡 𝑔 𝑘 𝐱 =0, 𝑘=1,…𝑚 11/29/2018 Pattern recognition

Solution: If 𝐱 0 , 𝜆 10 , 𝜆 20 ,…, 𝜆 𝑚0 is a stationary point of F, then 𝐱 0 is a stationary point of f(x) with constraints 𝐱 0 , 𝜆 10 , 𝜆 20 ,…, 𝜆 𝑚0 is a stationary point of F 𝐹 𝐱; 𝜆 1 ,…, 𝜆 𝑚 =𝑓 𝐱 + 𝑘 𝑚 𝜆 𝑘 𝑔 𝑘 (𝐱) 𝑑𝐹 𝑑 𝑥 1 =0, 𝑑𝐹 𝑑 𝑥 2 =0,…, 𝑑𝐹 𝑑 𝑥 𝑛 =0, 𝑑𝐹 𝑑 𝜆 1 =0,…, 𝑑𝐹 𝑑 𝜆 𝑚 =0 n+m equations! 11/29/2018 Pattern recognition

Example: Problem: for a given point 𝑝 0 =(1,0), among all the points lying on the line 𝑦=𝑥, identify the one having the least distance to 𝑝 0 . 𝑦=𝑥 𝑝 0 ? The distance is 𝑓 𝑥,𝑦 = 𝑥− 𝑦−0 2 Now we want to find the stationary point of 𝑓 𝑥,𝑦 under the constraint 𝑔 𝑥,𝑦 =𝑦−𝑥=0 According to Lagrange multiplier method, construct another function 𝐹 𝑥,𝑦,𝜆 =𝑓 𝑥,𝑦 +𝜆𝑔 𝑥,𝑦 = 𝑥− 𝑦−0 2 +𝜆(𝑦−𝑥) 11/29/2018 Pattern recognition

Example: Problem: for a given point 𝑝 0 =(1,0), among all the points lying on the line 𝑦=𝑥, identify the one having the least distance to 𝑝 0 . 𝑦=𝑥 𝑝 0 ? Find the stationary point for 𝐹 𝑥,𝑦,𝜆 𝑑𝐹 𝑑𝑥 =0 𝑑𝐹 𝑑𝑦 =0 𝑑𝐹 𝑑𝜆 =0 2 𝑥−1 +𝜆=0 2𝑦−𝜆=0 𝑥−𝑦=0 𝑥=0.5 𝑦=0.5 𝜆=1 11/29/2018 Pattern recognition

Linear discriminant analysis
In a two-class classification problem, given n samples in a d- dimensional feature space. There are n1 samples belong to class 1 and n2 samples belong to class 2. Goal: to find a vector w, and project the n samples on the axis y = wTx, so that the projected samples are well separated. 𝑦= 𝒘 1 𝑇 𝒙 𝑦= 𝒘 2 𝑇 𝒙 11/29/2018 Pattern recognition

Given a dataset 𝐷= 𝒙 𝑘 , 𝑦 𝑘 𝑘=1 𝑚 , 𝑦 𝑘 ∈ 0,1 , denote 𝑋 𝑖 , 𝝁 𝑖 , 𝜮 𝑖 the samples, the sample mean vector, and the covariance matrix of class i, respectively. The sample mean of the projected points in the i-th class is: The variance of the projected points in the i-th class is 𝜇 𝑖 = 1 𝑛 𝑖 𝒘 𝑇 𝑋 𝑖 = 𝒘 𝑇 𝝁 𝑖 𝜮 𝑖 = 𝒘 𝑇 𝑋 𝑖 − 𝒘 𝑇 𝝁 𝑖 𝒘 𝑇 𝑋 𝑖 − 𝒘 𝑇 𝝁 𝑖 𝑇 = 𝒘 𝑇 𝜮 𝑖 𝒘 11/29/2018 Pattern recognition

The between-class scatter matrix is: 𝑺 𝑏 = 𝝁 0 − 𝝁 𝝁 0 − 𝝁 1 𝑇 The within-class scatter matrix is: 𝑺 𝑤 = 𝚺 0 + 𝚺 1 The fisher linear discriminant analysis will choose the w, which maximize: i.e. the between-class distance should be as large as possible, meanwhile the within-class scatter should be as small as possible. 𝐽 𝒘 = 𝒘 𝑇 𝝁 0 − 𝒘 𝑇 𝝁 𝚺 𝚺 1 = 𝒘 𝑇 𝝁 0 − 𝝁 𝝁 0 − 𝝁 1 𝑻 𝒘 𝒘 𝑇 𝚺 0 𝒘+ 𝒘 𝑇 𝚺 1 𝒘 = 𝒘 𝑇 𝑆 𝑏 𝒘 𝒘 𝑇 𝑆 𝑤 𝒘 11/29/2018 Pattern recognition

Without loss of generality, let 𝒘 𝑇 𝑆 𝑤 𝒘=1 The optimization problem can be rewritten as: min 𝒘 − 𝒘 𝑇 𝑺 𝑏 𝒘 𝑠.𝑡. 𝒘 𝑇 𝑺 𝑤 𝒘=1. Using Lagrange multiplier method, we need to find the minimum of the following function: 𝐹 𝒘 =− 𝒘 𝑇 𝑺 𝑏 𝒘+𝜆 𝒘 𝑇 𝑺 𝑤 𝒘−𝜆 F(w) is a convex function 𝜕𝐹 𝜕𝒘 =0 𝜕𝐹 𝜕𝜆 =0 𝑺 𝑏 𝒘=𝜆 𝑺 𝑤 𝒘 11/29/2018 Pattern recognition

Let 𝑺 𝑏 𝒘=𝜆( 𝝁 0 − 𝝁 1 ), then 𝑺 𝑏 𝒘=𝜆 𝑺 𝑤 𝒘 ⇒ 𝑺 𝑤 −1 𝑺 𝑏 𝒘=𝜆𝒘 ⇒ 𝑺 𝑤 −1 𝝁 0 − 𝝁 𝝁 0 − 𝝁 1 𝑇 𝒘= 𝑺 𝑤 −1 𝝁 0 − 𝝁 1 𝜆 2 =𝜆𝒘 ⇒𝒘= 𝑺 𝑤 −1 ( 𝝁 0 − 𝝁 1 ) In practice, we compute the singular value decomposition of 𝑺 𝑤 , i.e. 𝑺 𝑤 =𝐔𝚺 𝐕 𝑇 . Then 𝑺 𝑤 −1 =𝐕 𝚺 −1 𝐔 𝑇 11/29/2018 Pattern recognition

Multiclass classification
Binary classification ⇒ multiclass classification One vs. One (OvO) One vs. Rest (OvR) C1 C2 C3 C4 ⇒ 𝑓 1 𝑓 2 𝑓 3 𝑓 4 𝑓 5 𝑓 6 “+” “-” ⟶ 𝐶 1 𝐶 3 𝐶 2 OvO OvR N(N-1)/2 classifiers 11/29/2018 Pattern recognition

Multiclass classification
Binary classification ⇒ multiclass classification Many vs. Many (MvM) Error correcting output codes Encode Decode Hamming distance Euclidian distance 𝑓 𝑓 𝑓 𝑓 𝑓 5 𝐶 1 𝐶 2 𝐶 3 𝐶 4 -1 +1 test sample -1 +1 11/29/2018 Pattern recognition

Class-imbalance In previous problems, we assume that the numbers of samples from different classes are about the same. However, if the proportions of samples from different classes vary greatly, the learning process will be influenced. E.g. 998 negatives vs. 2 positives Consider class-imbalance when using logistic regression to perform a classification task When # of positives and negatives are the same: 𝑦>0.5 𝑜𝑟 𝑦 1−𝑦 >1: positive 𝑦<0.5 𝑜𝑟 𝑦 1−𝑦 <1: negative ln 𝑦 1−𝑦 = 𝒘 𝑇 𝒙+𝑏 11/29/2018 Pattern recognition

Class-imbalance However, when # of positives and negatives are not equal, let 𝑚 + be the number of positives and 𝑚 − be the number of negatives The odds of observing a positive is 𝑚 + 𝑚 − Therefore, the classification criteria is 𝑦 1−𝑦 > 𝑚 + 𝑚 − : positive 𝑦 1−𝑦 < 𝑚 + 𝑚 − : negative Rescaling 𝑦′ 1−𝑦′ = 𝑦 1−𝑦 × 𝑚 − 𝑚 + 11/29/2018 Pattern recognition

Class-imbalance Undersampling Oversampling Threshold-moving 11/29/2018
Pattern recognition

Ying shen Sse, tongji university Sep. 2016

Similar presentations

Presentation on theme: "Ying shen Sse, tongji university Sep. 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ying shen Sse, tongji university Sep. 2016

Similar presentations

Presentation on theme: "Ying shen Sse, tongji university Sep. 2016"— Presentation transcript:

Similar presentations

About project

Feedback