Presentation is loading. Please wait.

Presentation is loading. Please wait.

Classification: Logistic Regression

Similar presentations


Presentation on theme: "Classification: Logistic Regression"— Presentation transcript:

1 Classification: Logistic Regression
Hung-yi Lee 李宏毅 Good ref: New ton: Gradient descent/Another point of view Why not MSE Multiple Limit Ds v.s. generative X 2

2 有關分組 作業以個人為單位繳交 期末專題才需要分組 找不到組員也沒有關係,期末專題公告後找不到 組員的同學助教會幫忙湊對

3 Step 1: Function Set 𝑃 𝑤,𝑏 𝐶 1 |𝑥 ≥0.5 class 1 𝑃 𝑤,𝑏 𝐶 1 |𝑥 <0.5
Including all different w and b 𝑃 𝑤,𝑏 𝐶 1 |𝑥 ≥0.5 class 1 z 𝑃 𝑤,𝑏 𝐶 1 |𝑥 <0.5 z class 2 𝑃 𝑤,𝑏 𝐶 1 |𝑥 =𝜎 𝑧 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑧=𝑤∙𝑥+𝑏 𝜎 𝑧 = 1 1+𝑒𝑥𝑝 −𝑧

4 Step 1: Function Set … … 𝑃 𝑤,𝑏 𝐶 1 |𝑥 … … Sigmoid Function
𝑃 𝑤,𝑏 𝐶 1 |𝑥 Sigmoid Function Activation function

5 Step 2: Goodness of a Function
𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 …… 𝐶 1 𝐶 2 Training Data Assume the data is generated based on 𝑓 𝑤,𝑏 𝑥 = 𝑃 𝑤,𝑏 𝐶 1 |𝑥 Given a set of w and b, what is its probability of generating the data? All the w and b give non-zero probability 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 The most likely w* and b* is the one with the largest 𝐿 𝑤,𝑏 . 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏

6 = …… 𝑥 1 𝑥 2 𝑥 3 …… 𝐶 1 𝐶 2 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0
𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏 = 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 =−𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 − 𝑦 1 𝑙𝑛𝑓 𝑥 − 𝑦 1 𝑙𝑛 1−𝑓 𝑥 1 1 −𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 − 𝑦 2 𝑙𝑛𝑓 𝑥 − 𝑦 2 𝑙𝑛 1−𝑓 𝑥 2 1 −𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 − 𝑦 3 𝑙𝑛𝑓 𝑥 − 𝑦 3 𝑙𝑛 1−𝑓 𝑥 3 1 ……

7 Step 2: Goodness of a Function
𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 −𝑙𝑛𝐿 𝑤,𝑏 =𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 +𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 +𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 Cross entropy between two Bernoulli distribution Where does it comes from? Distribution p: p 𝑥=1 = 𝑦 𝑛 p 𝑥=0 = 1− 𝑦 𝑛 Distribution q: q 𝑥=1 =𝑓 𝑥 𝑛 q 𝑥=0 =1−𝑓 𝑥 𝑛 cross entropy 𝐻 𝑝,𝑞 =− 𝑥 𝑝 𝑥 𝑙𝑛 𝑞 𝑥

8 Step 2: Goodness of a Function
𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 −𝑙𝑛𝐿 𝑤,𝑏 =𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 +𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 +𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 Cross entropy between two Bernoulli distribution Where does it comes from? minimize 1.0 𝑓 𝑥 𝑛 0.0 cross entropy 1−𝑓 𝑥 𝑛 Ground Truth 𝑦 𝑛 =1

9 Step 3: Find the best function
1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 𝜕𝑙𝑛𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

10 Step 3: Find the best function
1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜕𝑙𝑛 1−𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

11 Step 3: Find the best function
1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 = 𝑛 − 𝑦 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 1− 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 + 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Larger difference, larger update 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛

12 Logistic Regression + Square Error
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Step 2: Training data: 𝑥 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑦 𝑛 2 Step 3: 𝜕 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝜕 ( 𝑓 𝑤,𝑏 (𝑥)− 𝑦 ) 2 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝑓 𝑤,𝑏 𝑥 1− 𝑓 𝑤,𝑏 𝑥 𝑥 𝑖 𝑦 𝑛 =1 If 𝑓 𝑤,𝑏 𝑥 𝑛 =1 (close to target) 𝜕𝐿 𝜕 𝑤 𝑖 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =0 (far from target) 𝜕𝐿 𝜕 𝑤 𝑖 =0

13 Logistic Regression + Square Error
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Step 2: Training data: 𝑥 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑦 𝑛 2 Step 3: 𝜕 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝜕 ( 𝑓 𝑤,𝑏 (𝑥)− 𝑦 ) 2 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝑓 𝑤,𝑏 𝑥 1− 𝑓 𝑤,𝑏 𝑥 𝑥 𝑖 𝑦 𝑛 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =1 (far from target) 𝜕𝐿 𝜕 𝑤 𝑖 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =0 (close to target) 𝜕𝐿 𝜕 𝑤 𝑖 =0

14 Cross Entropy v.s. Square Error
Total Loss Square Error w2 w1

15 Logistic Regression Linear Regression
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Step 2: The standard logistic function is the logistic function with parameters (k = 1, x0 = 0, L = 1) which yields sigmoid Step 3:

16 Logistic Regression Linear Regression
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 The standard logistic function is the logistic function with parameters (k = 1, x0 = 0, L = 1) which yields sigmoid Cross entropy: 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =− 𝑦 𝑛 𝑙𝑛𝑓 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1−𝑓 𝑥 𝑛

17 Logistic Regression Linear Regression
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 The standard logistic function is the logistic function with parameters (k = 1, x0 = 0, L = 1) which yields sigmoid 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Logistic regression: Step 3: 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Linear regression:

18 Discriminative v.s. Generative
𝑃 𝐶 1 |𝑥 =𝜎 𝑤∙𝑥+𝑏 directly find w and b Find 𝜇 1 , 𝜇 2 , Σ −1 𝑤 𝑇 = 𝜇 1 − 𝜇 2 𝑇 Σ −1 𝑏=− 𝜇 1 𝑇 Σ 1 −1 𝜇 1 Will we obtain the same set of w and b? 𝜇 2 𝑇 Σ 2 −1 𝜇 2 +𝑙𝑛 𝑁 1 𝑁 2 The same model (function set), but different function may be selected by the same training data.

19 Generative v.s. Discriminative
All: hp, att, sp att, de, sp de, speed 73% accuracy 79% accuracy

20 Generative v.s. Discriminative
Example Training Data 1 1 X 4 X 4 X 4 1 1 Class 1 Class 2 Class 2 Class 2 腦補的 model How about Naïve Bayes? Testing Data 1 Class 1? Class 2? 1 𝑃 𝑥| 𝐶 𝑖 =𝑃 𝑥 1 | 𝐶 𝑖 𝑃 𝑥 2 | 𝐶 𝑖

21 Generative v.s. Discriminative
Example Training Data 1 1 X 4 X 4 X 4 1 1 Class 1 Class 2 Class 2 Class 2 腦補的 model 𝑃 𝐶 1 = 1 13 𝑃 𝑥 1 =1| 𝐶 1 =1 𝑃 𝑥 2 =1| 𝐶 1 =1 𝑃 𝐶 2 = 12 13 𝑃 𝑥 1 =1| 𝐶 2 = 1 3 𝑃 𝑥 2 =1| 𝐶 2 = 1 3

22 Training Data X 4 X 4 X 4 Testing Data Class 1 Class 2 Class 2 Class 2
X 4 X 4 X 4 1 1 Class 1 Class 2 Class 2 Class 2 1 13 1×1 <0.5 = 𝑃 𝑥| 𝐶 1 𝑃 𝐶 1 𝑃 𝑥| 𝐶 1 𝑃 𝐶 1 +𝑃 𝑥| 𝐶 2 𝑃 𝐶 2 Testing Data 1 𝑃 𝐶 1 |𝑥 1 1 13 1 3 × 1 3 12 13 1×1 腦補的 model 𝑃 𝐶 1 = 1 13 𝑃 𝑥 1 =1| 𝐶 1 =1 𝑃 𝑥 2 =1| 𝐶 1 =1 𝑃 𝐶 2 = 12 13 𝑃 𝑥 1 =1| 𝐶 2 = 1 3 𝑃 𝑥 2 =1| 𝐶 2 = 1 3

23 Generative v.s. Discriminative
Usually people believe discriminative model is better Benefit of generative model With the assumption of probability distribution less training data is needed more robust to the noise Priors and class-dependent probabilities can be estimated from different sources. For separation, we can talk about ASR

24 Multi-class Classification
(3 classes as example) Probability: 1> 𝑦 𝑖 >0 𝑖 𝑦 𝑖 =1 C1: 𝑤 1 , 𝑏 1 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 C2: 𝑤 2 , 𝑏 2 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 C3: 𝑤 3 , 𝑏 3 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 Softmax 3 20 0.88 0.12 1 2.7 2 class ≈0 -3 0.05

25 Multi-class Classification
[Bishop, P ] Multi-class Classification (3 classes as example) 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 Cross Entropy 𝑥 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 Softmax − 𝑖=1 3 𝑦 𝑖 𝑙𝑛 𝑦 𝑖 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 target If x ∈ class 1 If x ∈ class 2 If x ∈ class 3 1 0 0 0 1 0 0 0 1 𝑦 = 𝑦 = 𝑦 = −𝑙𝑛 𝑦 1 −𝑙𝑛 𝑦 2 −𝑙𝑛 𝑦 3

26 Limitation of Logistic Regression
(𝑧≥0) (𝑧<0) Can we? Input Feature Label x1 x2 Class 2 1 Class 1 z ≥ 0 z < 0 The preceding example employed typical “opportunistic,” or found, data. But even data generated by a designed experiment need external information. A DoD project from the early days of neural networks attempted to distinguish aerial images of forests with and without tanks in them. Perfect performance was achieved on the training set, and then on an outof-sample set of data that had been gathered at the same time but not used for training. This was celebrated but, wisely, a confirming study was performed. New images were collected on which the models performed extremely poorly. This drove investigation into the features driving the models and revealed them to be magnitude readings from specific locations of the images; i.e., background pixels. It turns out that the day the tanks had been photographed was sunny, and that for nontanks, cloudy!11 Even resampling the original data wouldn’t have protected against this error, as the flaw was inherent in the generating experiment. PBS featured this project in a 1991 documentary series The Machine That Changed the World: Episode IV, “The Thinking Machine.” z < 0 z ≥ 0

27 Limitation of Logistic Regression
𝑥 1 ′ : distance to 0 0 𝑥 2 ′ : distance to 1 1 Feature transformation 𝑥 1 ′ 𝑥 2 ′ 𝑥 1 𝑥 2 Not always easy ….. domain knowledge can be helpful 𝑥 2 ′ 0 2 0 1 1 1 2 0 0 0 0 1 1 1 𝑥 1 ′

28 Limitation of Logistic Regression
Cascading logistic regression models 𝑥 1 ′ 𝑥 2 ′ Feature Transformation Classification (ignore bias in this figure)

29 𝑥 1 ′ =0.27 𝑥 1 ′ =0.05 𝑥 1 ′ =0.73 -1 𝑥 1 ′ 𝑥 2 ′ -2 2 𝑥 2 ′ =0.27 𝑥 2 ′ =0.73 𝑥 2 ′ =0.05 2 -2 1,-1,-3 -1

30 𝑥 1 ′ =0.27 𝑥 1 ′ =0.05 𝑥 1 ′ =0.73 𝑥 1 ′ 𝑥 2 ′ 𝑥 2 ′ =0.27 𝑥 2 ′ =0.73 𝑥 2 ′ =0.05 (0.73, 0.05) 𝑥 2 ′ (0.27, 0.27) (0.05,0.73) 𝑥 1 ′

31 Feature Transformation
Deep Learning! All the parameters of the logistic regressions are jointly learned. “Neuron” 𝑥 1 ′ 𝑥 2 ′ Feature Transformation Classification Neural Network

32 Reference Bishop: Chapter 4.3

33 Acknowledgement 感謝 林恩妤 發現投影片上的錯誤

34 Appendix Good ref: New ton: Gradient descent/Another point of view Why not MSE Multiple Limit Ds v.s. generative X 2

35 Three Steps 𝑥 𝑦 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑛 =𝑐𝑙𝑎𝑠𝑠 1,𝑐𝑙𝑎𝑠𝑠 2 𝑥 𝑛
𝑦 1 𝑦 2 𝑦 3 𝑦 𝑛 =𝑐𝑙𝑎𝑠𝑠 1,𝑐𝑙𝑎𝑠𝑠 2 𝑥 𝑛 𝑦 𝑛 Three Steps Step 1. Function Set (Model) Step 2. Goodness of a function Step 3. Find the best function: gradient descent 𝑥 If 𝑃 𝐶 1 |𝑥 >0.5, output: y = class 1 𝑦 feature Otherwise, output: y = class 2 class 𝑃 𝐶 1 |𝑥 =𝜎 𝑤∙𝑥+𝑏 w and b are related to 𝑁 1 , 𝑁 2 , 𝜇 1 , 𝜇 2 , Σ The number of times g get incorrect results on training data. 𝐿 𝑓 = 𝑛 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛

36 Step 2: Loss function ≥ 𝑓 𝑤,𝑏 𝑥 = < class 1 class 2 z +1 -1
𝑓 𝑤,𝑏 𝑥 = class 1 class 2 z < +1 -1 Ideal loss: Approximation: 𝐿 𝑓 = 𝑛 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 0 or 1 𝑙 ∗ is the upper bound of 𝛿 ∗ Ideal loss 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 我會說這個圖是 "plot of loss functions" 𝑦 𝑛 𝑧 𝑛

37 Step 2: Loss function 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 : cross entropy 𝑦 𝑛 =+1 𝑓 𝑥 𝑛
𝑦 𝑛 =+1 𝑓 𝑥 𝑛 Ground Truth cross entropy 1.0 𝑦 𝑛 =−1 1−𝑓 𝑥 𝑛 If 𝑦 𝑛 =+1: =−ln 1 1+𝑒𝑥𝑝 − 𝑧 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =−ln𝑓 𝑥 𝑛 =−ln𝜎 𝑧 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑧 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑦 𝑛 𝑧 𝑛 If 𝑦 𝑛 =−1: 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =−ln 1−𝑓 𝑥 𝑛 =−ln 𝑒𝑥𝑝 − 𝑧 𝑛 1+𝑒𝑥𝑝 − 𝑧 𝑛 =−ln 1 1+𝑒𝑥𝑝 𝑧 𝑛 =−ln 1−𝜎 𝑥 𝑛 =ln 1+𝑒𝑥𝑝 𝑧 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑦 𝑛 𝑧 𝑛

38 Step 2: Loss function 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 : cross entropy
𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑦 𝑛 𝑧 𝑛 Ideal loss 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 Divided by ln2 here 𝑦 𝑛 𝑧 𝑛


Download ppt "Classification: Logistic Regression"

Similar presentations


Ads by Google