Classification: Logistic Regression

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Supervised Learning Recap
Lecture 14 – Neural Networks
Giansalvo EXIN Cirrincione unit #7/8 ERROR FUNCTIONS part one Goal for REGRESSION: to model the conditional distribution of the output variables, conditioned.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Models for Classification
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
1 Introduction to Machine Learning Chapter 1. cont.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
First Lecture of Machine Learning Hung-yi Lee. Learning to say “yes/no” Binary Classification.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Learning to Compare Image Patches via Convolutional Neural Networks
Semi-Supervised Clustering
Machine Learning – Classification David Fenyő
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Regression Analysis AGEC 784.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Data Mining, Neural Network and Genetic Programming
Learning with Perceptrons and Neural Networks
CH 5: Multivariate Methods
10701 / Machine Learning.
A Simple Artificial Neuron
ECE 5424: Introduction to Machine Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Neural Networks and Backpropagation
Classification / Regression Neural Networks 2
Statistical Learning Dong Liu Dept. EEIS, USTC.
Machine Learning Today: Reading: Maria Florina Balcan
Goodfellow: Chap 6 Deep Feedforward Networks
ANN Design and Training
CS 188: Artificial Intelligence
Perceptron as one Type of Linear Discriminants
Advanced Artificial Intelligence Classification
Deep Learning for Non-Linear Control
Sigmoid and logistic regression
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The Naïve Bayes (NB) Classifier
The loss function, the normal equation,
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Mathematical Foundations of BME Reza Shadmehr
Model generalization Brief summary of methods
Softmax Classifier.
Parametric Methods Berlin Chen, 2005 References:
Neural networks (1) Traditional multi-layer perceptrons
Where does the error come from?
Multivariate Methods Berlin Chen
Machine learning overview
Multivariate Methods Berlin Chen, 2005 References:
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Linear Discrimination
Introduction to Neural Networks
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Classification: Logistic Regression Hung-yi Lee 李宏毅 Good ref: http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/ http://www.cs.columbia.edu/~smaskey/CS6998/slides/statnlp_week6.pdf http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf New ton: http://qwone.com/~jason/writing/lr.pdf Gradient descent/Another point of view Why not MSE Multiple Limit Ds v.s. generative X 2

有關分組 作業以個人為單位繳交 期末專題才需要分組 找不到組員也沒有關係,期末專題公告後找不到 組員的同學助教會幫忙湊對

Step 1: Function Set 𝑃 𝑤,𝑏 𝐶 1 |𝑥 ≥0.5 class 1 𝑃 𝑤,𝑏 𝐶 1 |𝑥 <0.5 Including all different w and b 𝑃 𝑤,𝑏 𝐶 1 |𝑥 ≥0.5 class 1 z 𝑃 𝑤,𝑏 𝐶 1 |𝑥 <0.5 z class 2 𝑃 𝑤,𝑏 𝐶 1 |𝑥 =𝜎 𝑧 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑧=𝑤∙𝑥+𝑏 𝜎 𝑧 = 1 1+𝑒𝑥𝑝 −𝑧

Step 1: Function Set … … 𝑃 𝑤,𝑏 𝐶 1 |𝑥 … … Sigmoid Function 𝑃 𝑤,𝑏 𝐶 1 |𝑥 … … Sigmoid Function Activation function

Step 2: Goodness of a Function 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 …… 𝐶 1 𝐶 2 Training Data Assume the data is generated based on 𝑓 𝑤,𝑏 𝑥 = 𝑃 𝑤,𝑏 𝐶 1 |𝑥 Given a set of w and b, what is its probability of generating the data? All the w and b give non-zero probability 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 The most likely w* and b* is the one with the largest 𝐿 𝑤,𝑏 . 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏

= …… 𝑥 1 𝑥 2 𝑥 3 …… 𝐶 1 𝐶 2 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏 = 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 =−𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 − 𝑦 1 𝑙𝑛𝑓 𝑥 1 + 1− 𝑦 1 𝑙𝑛 1−𝑓 𝑥 1 1 −𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 − 𝑦 2 𝑙𝑛𝑓 𝑥 2 + 1− 𝑦 2 𝑙𝑛 1−𝑓 𝑥 2 1 −𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 − 𝑦 3 𝑙𝑛𝑓 𝑥 3 + 1− 𝑦 3 𝑙𝑛 1−𝑓 𝑥 3 1 ……

Step 2: Goodness of a Function 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 −𝑙𝑛𝐿 𝑤,𝑏 =𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 +𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 +𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 Cross entropy between two Bernoulli distribution Where does it comes from? Distribution p: p 𝑥=1 = 𝑦 𝑛 p 𝑥=0 = 1− 𝑦 𝑛 Distribution q: q 𝑥=1 =𝑓 𝑥 𝑛 q 𝑥=0 =1−𝑓 𝑥 𝑛 cross entropy 𝐻 𝑝,𝑞 =− 𝑥 𝑝 𝑥 𝑙𝑛 𝑞 𝑥

Step 2: Goodness of a Function 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 −𝑙𝑛𝐿 𝑤,𝑏 =𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 +𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 +𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 Cross entropy between two Bernoulli distribution Where does it comes from? minimize 1.0 𝑓 𝑥 𝑛 0.0 cross entropy 1−𝑓 𝑥 𝑛 Ground Truth 𝑦 𝑛 =1

Step 3: Find the best function 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 𝜕𝑙𝑛𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

Step 3: Find the best function 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜕𝑙𝑛 1−𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

Step 3: Find the best function 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 = 𝑛 − 𝑦 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 1− 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 + 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Larger difference, larger update 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛

Logistic Regression + Square Error 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Step 2: Training data: 𝑥 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑦 𝑛 2 Step 3: 𝜕 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝜕 ( 𝑓 𝑤,𝑏 (𝑥)− 𝑦 ) 2 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝑓 𝑤,𝑏 𝑥 1− 𝑓 𝑤,𝑏 𝑥 𝑥 𝑖 𝑦 𝑛 =1 If 𝑓 𝑤,𝑏 𝑥 𝑛 =1 (close to target) 𝜕𝐿 𝜕 𝑤 𝑖 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =0 (far from target) 𝜕𝐿 𝜕 𝑤 𝑖 =0

Logistic Regression + Square Error 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Step 2: Training data: 𝑥 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑦 𝑛 2 Step 3: 𝜕 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝜕 ( 𝑓 𝑤,𝑏 (𝑥)− 𝑦 ) 2 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝑓 𝑤,𝑏 𝑥 1− 𝑓 𝑤,𝑏 𝑥 𝑥 𝑖 𝑦 𝑛 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =1 (far from target) 𝜕𝐿 𝜕 𝑤 𝑖 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =0 (close to target) 𝜕𝐿 𝜕 𝑤 𝑖 =0

Cross Entropy v.s. Square Error Total Loss Square Error w2 w1 http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

Logistic Regression Linear Regression 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Step 2: The standard logistic function is the logistic function with parameters (k = 1, x0 = 0, L = 1) which yields sigmoid Step 3:

Logistic Regression Linear Regression 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 The standard logistic function is the logistic function with parameters (k = 1, x0 = 0, L = 1) which yields sigmoid Cross entropy: 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =− 𝑦 𝑛 𝑙𝑛𝑓 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1−𝑓 𝑥 𝑛

Logistic Regression Linear Regression 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 The standard logistic function is the logistic function with parameters (k = 1, x0 = 0, L = 1) which yields sigmoid 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Logistic regression: Step 3: 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Linear regression:

Discriminative v.s. Generative 𝑃 𝐶 1 |𝑥 =𝜎 𝑤∙𝑥+𝑏 directly find w and b Find 𝜇 1 , 𝜇 2 , Σ −1 𝑤 𝑇 = 𝜇 1 − 𝜇 2 𝑇 Σ −1 𝑏=− 1 2 𝜇 1 𝑇 Σ 1 −1 𝜇 1 Will we obtain the same set of w and b? + 1 2 𝜇 2 𝑇 Σ 2 −1 𝜇 2 +𝑙𝑛 𝑁 1 𝑁 2 The same model (function set), but different function may be selected by the same training data.

Generative v.s. Discriminative 0.657142857143 0.485714285714 All: hp, att, sp att, de, sp de, speed 73% accuracy 79% accuracy

Generative v.s. Discriminative Example Training Data 1 1 X 4 X 4 X 4 1 1 Class 1 Class 2 Class 2 Class 2 腦補的 model How about Naïve Bayes? Testing Data 1 Class 1? Class 2? 1 𝑃 𝑥| 𝐶 𝑖 =𝑃 𝑥 1 | 𝐶 𝑖 𝑃 𝑥 2 | 𝐶 𝑖

Generative v.s. Discriminative Example Training Data 1 1 X 4 X 4 X 4 1 1 Class 1 Class 2 Class 2 Class 2 腦補的 model 𝑃 𝐶 1 = 1 13 𝑃 𝑥 1 =1| 𝐶 1 =1 𝑃 𝑥 2 =1| 𝐶 1 =1 𝑃 𝐶 2 = 12 13 𝑃 𝑥 1 =1| 𝐶 2 = 1 3 𝑃 𝑥 2 =1| 𝐶 2 = 1 3

Training Data X 4 X 4 X 4 Testing Data Class 1 Class 2 Class 2 Class 2 X 4 X 4 X 4 1 1 Class 1 Class 2 Class 2 Class 2 1 13 1×1 <0.5 = 𝑃 𝑥| 𝐶 1 𝑃 𝐶 1 𝑃 𝑥| 𝐶 1 𝑃 𝐶 1 +𝑃 𝑥| 𝐶 2 𝑃 𝐶 2 Testing Data 1 𝑃 𝐶 1 |𝑥 1 1 13 1 3 × 1 3 12 13 1×1 腦補的 model 𝑃 𝐶 1 = 1 13 𝑃 𝑥 1 =1| 𝐶 1 =1 𝑃 𝑥 2 =1| 𝐶 1 =1 𝑃 𝐶 2 = 12 13 𝑃 𝑥 1 =1| 𝐶 2 = 1 3 𝑃 𝑥 2 =1| 𝐶 2 = 1 3

Generative v.s. Discriminative Usually people believe discriminative model is better Benefit of generative model With the assumption of probability distribution less training data is needed more robust to the noise Priors and class-dependent probabilities can be estimated from different sources. For separation, we can talk about ASR

Multi-class Classification (3 classes as example) Probability: 1> 𝑦 𝑖 >0 𝑖 𝑦 𝑖 =1 C1: 𝑤 1 , 𝑏 1 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 C2: 𝑤 2 , 𝑏 2 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 C3: 𝑤 3 , 𝑏 3 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 Softmax 3 20 0.88 0.12 1 2.7 2 class ≈0 -3 0.05

Multi-class Classification [Bishop, P209-210] Multi-class Classification (3 classes as example) 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 Cross Entropy 𝑥 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 Softmax − 𝑖=1 3 𝑦 𝑖 𝑙𝑛 𝑦 𝑖 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 target If x ∈ class 1 If x ∈ class 2 If x ∈ class 3 1 0 0 0 1 0 0 0 1 𝑦 = 𝑦 = 𝑦 = −𝑙𝑛 𝑦 1 −𝑙𝑛 𝑦 2 −𝑙𝑛 𝑦 3

Limitation of Logistic Regression (𝑧≥0) (𝑧<0) Can we? Input Feature Label x1 x2 Class 2 1 Class 1 z ≥ 0 z < 0 The preceding example employed typical “opportunistic,” or found, data. But even data generated by a designed experiment need external information. A DoD project from the early days of neural networks attempted to distinguish aerial images of forests with and without tanks in them. Perfect performance was achieved on the training set, and then on an outof-sample set of data that had been gathered at the same time but not used for training. This was celebrated but, wisely, a confirming study was performed. New images were collected on which the models performed extremely poorly. This drove investigation into the features driving the models and revealed them to be magnitude readings from specific locations of the images; i.e., background pixels. It turns out that the day the tanks had been photographed was sunny, and that for nontanks, cloudy!11 Even resampling the original data wouldn’t have protected against this error, as the flaw was inherent in the generating experiment. PBS featured this project in a 1991 documentary series The Machine That Changed the World: Episode IV, “The Thinking Machine.” z < 0 z ≥ 0

Limitation of Logistic Regression 𝑥 1 ′ : distance to 0 0 𝑥 2 ′ : distance to 1 1 Feature transformation 𝑥 1 ′ 𝑥 2 ′ 𝑥 1 𝑥 2 Not always easy ….. domain knowledge can be helpful 𝑥 2 ′ 0 2 0 1 1 1 2 0 0 0 0 1 1 1 𝑥 1 ′

Limitation of Logistic Regression Cascading logistic regression models 𝑥 1 ′ 𝑥 2 ′ Feature Transformation Classification (ignore bias in this figure)

𝑥 1 ′ =0.27 𝑥 1 ′ =0.05 𝑥 1 ′ =0.73 -1 𝑥 1 ′ 𝑥 2 ′ -2 2 𝑥 2 ′ =0.27 𝑥 2 ′ =0.73 𝑥 2 ′ =0.05 2 -2 1,-1,-3 -1

𝑥 1 ′ =0.27 𝑥 1 ′ =0.05 𝑥 1 ′ =0.73 𝑥 1 ′ 𝑥 2 ′ 𝑥 2 ′ =0.27 𝑥 2 ′ =0.73 𝑥 2 ′ =0.05 (0.73, 0.05) 𝑥 2 ′ (0.27, 0.27) (0.05,0.73) 𝑥 1 ′

Feature Transformation Deep Learning! All the parameters of the logistic regressions are jointly learned. “Neuron” 𝑥 1 ′ 𝑥 2 ′ Feature Transformation Classification Neural Network

Reference Bishop: Chapter 4.3

Acknowledgement 感謝 林恩妤 發現投影片上的錯誤

Appendix Good ref: http://www.win-vector.com/blog/2011/09/the-simpler-derivation-of-logistic-regression/ http://www.cs.columbia.edu/~smaskey/CS6998/slides/statnlp_week6.pdf http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf New ton: http://qwone.com/~jason/writing/lr.pdf Gradient descent/Another point of view Why not MSE Multiple Limit Ds v.s. generative X 2

Three Steps 𝑥 𝑦 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑛 =𝑐𝑙𝑎𝑠𝑠 1,𝑐𝑙𝑎𝑠𝑠 2 𝑥 𝑛 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑛 =𝑐𝑙𝑎𝑠𝑠 1,𝑐𝑙𝑎𝑠𝑠 2 𝑥 𝑛 𝑦 𝑛 Three Steps Step 1. Function Set (Model) Step 2. Goodness of a function Step 3. Find the best function: gradient descent 𝑥 If 𝑃 𝐶 1 |𝑥 >0.5, output: y = class 1 𝑦 feature Otherwise, output: y = class 2 class 𝑃 𝐶 1 |𝑥 =𝜎 𝑤∙𝑥+𝑏 w and b are related to 𝑁 1 , 𝑁 2 , 𝜇 1 , 𝜇 2 , Σ The number of times g get incorrect results on training data. 𝐿 𝑓 = 𝑛 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛

Step 2: Loss function ≥ 𝑓 𝑤,𝑏 𝑥 = < class 1 class 2 z +1 -1 𝑓 𝑤,𝑏 𝑥 = ≥ class 1 class 2 z < +1 -1 Ideal loss: Approximation: 𝐿 𝑓 = 𝑛 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 𝐿 𝑓 = 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 0 or 1 𝑙 ∗ is the upper bound of 𝛿 ∗ Ideal loss 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 我會說這個圖是 "plot of loss functions" 𝑦 𝑛 𝑧 𝑛

Step 2: Loss function 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 : cross entropy 𝑦 𝑛 =+1 𝑓 𝑥 𝑛 𝑦 𝑛 =+1 𝑓 𝑥 𝑛 Ground Truth cross entropy 1.0 𝑦 𝑛 =−1 1−𝑓 𝑥 𝑛 If 𝑦 𝑛 =+1: =−ln 1 1+𝑒𝑥𝑝 − 𝑧 𝑛 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =−ln𝑓 𝑥 𝑛 =−ln𝜎 𝑧 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑧 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑦 𝑛 𝑧 𝑛 If 𝑦 𝑛 =−1: 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =−ln 1−𝑓 𝑥 𝑛 =−ln 𝑒𝑥𝑝 − 𝑧 𝑛 1+𝑒𝑥𝑝 − 𝑧 𝑛 =−ln 1 1+𝑒𝑥𝑝 𝑧 𝑛 =−ln 1−𝜎 𝑥 𝑛 =ln 1+𝑒𝑥𝑝 𝑧 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑦 𝑛 𝑧 𝑛

Step 2: Loss function 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 : cross entropy 𝑙 𝑓 𝑥 𝑛 , 𝑦 𝑛 =ln 1+𝑒𝑥𝑝 − 𝑦 𝑛 𝑧 𝑛 Ideal loss 𝛿 𝑓 𝑥 𝑛 ≠ 𝑦 𝑛 Divided by ln2 here 𝑦 𝑛 𝑧 𝑛