Deep Learning for Non-Linear Control

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Linear Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
Machine Learning Week 2 Lecture 1.
Lecture 14 – Neural Networks
x – independent variable (input)
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Classification and Prediction: Regression Analysis
Collaborative Filtering Matrix Factorization Approach
Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Introduction to Neural Networks Debrup Chakraborty Pattern Recognition and Machine Learning 2006.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Machine Learning 5. Parametric Methods.
Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Neural Networks Lecture 4 out of 4. Practical Considerations Input Architecture Output.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
Fall 2004 Backpropagation CS478 - Machine Learning.
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Artificial Neural Networks
Machine Learning & Deep Learning
Data Mining, Neural Network and Genetic Programming
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
ECE 5424: Introduction to Machine Learning
Classification: Logistic Regression
COMP24111: Machine Learning and Optimisation
Deep Learning Hung-yi Lee 李宏毅.
10701 / Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Roberto Battiti, Mauro Brunato
Recognition - III.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
ECE 471/571 - Lecture 17 Back Propagation.
Goodfellow: Chap 6 Deep Feedforward Networks
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
10701 / Machine Learning Today: - Cross validation,
Artificial Neural Networks
Neural Networks Geoff Hulten.
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Model generalization Brief summary of methods
CSSE463: Image Recognition Day 18
Neural networks (1) Traditional multi-layer perceptrons
Where does the error come from?
Machine learning overview
Image Classification & Training of Neural Networks
Neural networks (3) Regularization Autoencoder
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Introduction to Neural Networks
Image recognition.
Outline Announcement Neural networks Perceptrons - continued
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Deep Learning for Non-Linear Control Shiyan Hu Michigan Technological University

The General Non-Linear Control System Need to model the nonlinear dynamics of the plant, how?

An Example Current outside temperature at t The AC power level at t+1 AC dynamics model Current outside temperature at t The AC power level at t+1 Required temperature at t The actual temperature at t+1 Available AC power levels at t Forecast outside temperature at t+1 The electricity bill at t+1 Time t In most cases, one cannot analytically model this dynamics, and can only use historical operation data to approximate its dynamics. This is called learning.

What is Learning? 𝑓 = 𝑓 = Stock Market Forecast Self-driving Car Dow Jones Industrial Average tomorrow 𝑓 = 𝑓 = Wheel control

Learning Examples Supervised learning Unsupervised learning Given inputs and target outputs, fit the model (e.g., classification). Translate one language to the other language, which is beyond classification since we cannot enumerate all possible sentences. This is called structural learning. Unsupervised learning What if we do not know the target outputs? That is, we do not know what we want to learn. Reinforcement learning A machine talks to a person, learn what is a good way and what is not (only through reward function), and then it gradually learn to speak. That is, it evolves through getting the feedback. We do not feed it with exact input and output: not supervised We still give him some feedbacks using reward function: not unsupervised

Unsupervised Learning Learn the meanings of words and sentences through reading the documents.

Reinforcement Learning

Deep learning trends at Google. Source: SIGMOD 2016/Jeff Dean

History of Deep Learning 1958: Perceptron (linear model) 1986: Backpropagation 2006: RBM initialization 2011: Start to be popular in speech recognition 2012: win ILSVRC image competition 2015.2: Image recognition surpassing human-level performance  2016.3: Alpha GO beats Lee Sedol 2016.10: Speech recognition system as good as humans

Neural Network Neural Network “Neuron” Different connection leads to different network structures

Deep = Many hidden layers http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf 19 layers 8 layers 6.7% 7.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014)

Deep = Many hidden layers 3.57% 7.3% 6.7% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)

Supervised Learning Statistical and Signal Processing Techniques Linear regression Logistic regression Nonlinear regression Machine Learning Techniques SVM Deep learning

Learning Basics Training Data Testing Data You are given some training data You are to learn a function/model about these training data You will use this model to process testing data Training data and testing data do not necessarily share the same properties

Linear Regression: Input Data Training Data: 𝑥 1 , 𝑦 1 𝑥 2 , 𝑦 2 𝑥 10 , 𝑦 10 …… This is real data. y = b + wx 𝑥 𝑛 , 𝑦 𝑛 This is a linear function where w and b are scalars

An Example Training data are (10, 5) (20, 6) (3,7) (40, 8) (50, 9) 5=b+10w 6=b+20w 7=b+30w 8=b+40w 9=b+50w Compute b and w which best fit these data

Linear Regression: Function to Learn To compute scalars w and b such that y best approximates b + wx One can convert it to an optimization problem minimizing a Loss Function L 𝑤,𝑏 = 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 𝐿 𝑤,𝑏 =𝑎𝑟𝑔 min 𝑤,𝑏 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2

Linear Regression: Gradient Descent 𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 Negative Increase w Positive Decrease w w0

Linear Regression: Gradient Descent 𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 𝑤 1 ← 𝑤 0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 η is called “learning rate” w0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0

Linear Regression: Gradient Descent 𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 𝑤 1 ← 𝑤 0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 1 𝑤 2 ← 𝑤 1 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 1 …… Many iterations Local optimal not global optimal w0 w1 w2 wT

Still not guarantee reaching global minima, but give some hope …… Momentum cost Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Momentum Negative of 𝜕𝐿∕𝜕𝑤 Momentum Real Movement 𝜕𝐿∕𝜕𝑤 = 0

Linear Regression: Gradient Descent How about two parameters? 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 𝐿 𝑤,𝑏 𝜕𝐿 𝜕𝑤 𝜕𝐿 𝜕𝑏 𝛻𝐿= gradient (Randomly) Pick an initial value w0, b0 Compute 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 , 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 𝑤 1 ← 𝑤 0 −𝜂 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 𝑏 1 ← 𝑏 0 −𝜂 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 Compute 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1 , 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1 𝑤 2 ← 𝑤 1 −𝜂 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1 𝑏 2 ← 𝑏 1 −𝜂 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1

Color: Value of Loss L(w,b) 2D Gradient Descent 𝑏 𝑤 Color: Value of Loss L(w,b) (−𝜂 𝜕𝐿 𝜕𝑏 , −𝜂 𝜕𝐿 𝜕𝑤 ) Compute 𝜕𝐿 𝜕𝑏 , 𝜕𝐿 𝜕𝑤

Convex L Not the case in linear regression where the loss function L is convex, so there is global optimum. 𝐿 𝑤 𝑏

Compute Gradient Descent Formulation of 𝜕𝐿 𝜕𝑤 and 𝜕𝐿 𝜕𝑏 𝐿 𝑤,𝑏 = 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑛=1 10 2 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 − 𝑥 𝑛 𝜕𝐿 𝜕𝑤 =? 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 −𝑤⋅ 𝑥 𝑛 𝑥 𝑖 𝑛 𝜕𝐿 𝜕𝑏 =?

Compute Gradient Descent Formulation of 𝜕𝐿 𝜕𝑤 and 𝜕𝐿 𝜕𝑏 𝐿 𝑤,𝑏 = 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑛=1 10 2 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 − 𝑥 𝑛 𝜕𝐿 𝜕𝑤 =? 𝑛=1 10 2 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 −1 𝜕𝐿 𝜕𝑏 =?

How about the results? y = b + wx b = -188.4 w = 2.7 Training Data 𝑒 1 Let 𝑒 𝑛 denote the square error. Average Error on Training Data Training Data 𝑒 1 𝑒 2 = 31.9 = 1 10 𝑛=1 10 𝑒 𝑛

Generalization? What we really care about is the error on new data (testing data) y = b + wx b = -188.4 w = 2.7 Average Error on Testing Data = 35.0 = 1 10 𝑛=1 10 𝑒 𝑛 Testing data How can we do better? > Average Error on Training Data (31.9)

More Complex f Best Training Testing y = b + w1 x + w2 (x)2 b = -10.3 w1 = 1.0, w2 = 2.7 x 10-3 Average Error = 15.4 Testing Average Error = 18.4 Better! Could it be even better?

More Complex f Best Training Testing y = b + w1 x+ w2 (x)2 + w3 (x)3 Average Error = 15.3 Testing Average Error = 18.1 Slightly better. How about more complex model?

More Complex f Best Training Testing y = b + w1 x + w2 (x)2 Average Error = 14.9 Testing Average Error = 28.8 The results become worse ...

More Complex f Best Training Testing: y = b + w1 x + w2(x)2 Average Error = 12.8 Testing: Average Error = 232.1 The results are bad.

Training and Testing Fitting Error 1 31.9 35.0 2 15.4 18.4 3 15.3 18.1 4 14.9 28.2 5 12.8 232.1 A more complex model does not always lead to better performance on testing data , which is due to overfitting. Where does error come from?

Estimator 𝑓 𝑓 ∗ The true function 𝑓 Bias + Variance From training data, we find 𝑓 ∗ 𝑓 ∗ is an estimator of 𝑓 𝑓

Bias and Variance of Estimator Assume that a variable x follows a PDF the mean 𝜇 and the variance of x is 𝜎 2 , we want estimate it Estimator: sample N points using the PDF: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑁 𝑚= 1 𝑁 𝑛 𝑥 𝑛 𝑠 2 = 1 𝑁 𝑛 𝑥 𝑛 −𝑚 2 𝐸 𝑚 =𝐸 1 𝑁 𝑛 𝑥 𝑛 = 1 𝑁 𝑛 𝐸 𝑥 𝑛 =𝜇 unbiased 𝐸 𝑠 2 = 𝑁−1 𝑁 𝜎 2 biased

𝐸 𝑓 ∗ = 𝑓 𝑓 ∗ Variance Bias 𝑓

How to Compute Bias and Variance? Assume that we have more sets of data Assume that we insist to use linear model y = b + w ∙ x

Training Results Different training data lead to different function 𝑓 ∗ y = b + w ∙ x y = b’ + w’ ∙ x

Different Functions/Models y = b + w ∙ x y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 + w4 ∙ (x)4 + w5 ∙ (x)5

Black curve: the true function 𝑓 Red curves: 5000 𝑓 ∗ Blue curve: the average of 5000 𝑓 ∗ = 𝑓

Bias v.s. Variance Large Small Bias, Bias, Large Variance y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 + w4 ∙ (x)4 + w5 ∙ (x)5 y = b + w ∙ x Simpler model is less influenced by the sampled data, while more complex models tend to overfit the data (impacted more by changes in data) Large Bias, Small Variance Small Bias, Large Variance model model

Bias v.s. Variance Underfitting Overfitting Large Bias Small Bias Error from bias Error from variance Error observed Underfitting Overfitting Large Bias Small Bias Small Variance Large Variance

What to do with large bias? Diagnosis: If your model cannot even fit the training examples, then you have large bias If you can fit the training data, but large error on testing data, then you probably have large variance For bias, redesign your model: Add more features as input A more complex function/model Underfitting Overfitting

What to do with large variance? More data Regularization Very effective, but not always practical 10 examples 100 examples

Exercise Suppose that you have 10000 data, and you will use the function/model y = b + w1 x + w2(x)2+ w3 (x)3 + w4 (x)4+ w5 (x)5 to fit them Plan 1: Put all of 10000 data to your training process Plan 2: Partition them into 10 sets. Use regression on each of these 10 sets and you get 10 different functions. Average these 10 functions and return it. Which one is better?

High Dimensional Data 𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , 𝑠𝑜 𝑥 1 5 =12 𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , 𝑠𝑜 𝑥 1 5 =12 𝒙 𝟐 = 2,9,5,15,19,17,9,21,52 𝑇 , 𝑠𝑜 𝑥 2 5 =19 𝒚 = 10,2,3,7,5,8,35,19,29 𝑇 , 𝑠𝑜 𝑦 5 =5 𝒚= 𝑤 1 𝒙 𝟏 + 𝑤 2 𝒙 𝟐 +𝒃= 𝒙 𝟏 , 𝒙 𝟐 ⋅𝒘+𝒃 𝑤ℎ𝑒𝑟𝑒 𝒘= 𝑤 1 , 𝑤 2 is a vector and 𝒃 is a vector We can still use the gradient descent method to solve it min 𝐿= 𝑛 𝑦 𝑛 − 𝑏 𝑛 + 𝑤 𝑖 𝑥 𝑖 𝑛 2 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊

Improve Robustness: Regularization 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊 The functions with smaller 𝑤 𝑖 are better 𝐿= 𝑛 𝑦 𝑛 − 𝑏 𝑛 + 𝑤 𝑖 𝑥 𝑖 𝑛 2 +𝜆 𝑤 𝑖 2 Why smooth functions are preferred? 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊 + 𝑤 𝑖 Δ 𝑥 𝑖 +Δ 𝑥 𝑖 If some noises are induced on input xi when testing, then less influence on the estimated y.

Regularization Results 𝜆 Training Testing 1.9 102.3 1 2.3 68.7 10 3.5 25.7 100 4.1 11.1 1000 5.6 12.8 10000 6.3 18.7 100000 8.5 26.8 smoother We prefer smooth function, but not too smooth.

Logistic Regression Training data What if y value is binary? This is a classification problem. 𝑥 1 , 𝑦 1 𝑥 2 , 𝑦 2 …… 𝑥 10 , 𝑦 10 𝑥 1 𝑥 2 𝑥 3 𝑥 10 …… 𝐶 1 𝐶 2

Probabilistic Interpretation Assume that the training data are generated based on a probabilistic distribution function (PDF). We aim to estimate this PDF by 𝑓(𝑥) which is characterized by parameters w and b, so we also write it as 𝑓 𝑤,𝑏 (𝑥) If 𝑃 𝐶 1 |𝑥 = 𝑓(𝑥)>0.5, output: y = class 1 Otherwise, output: y = class 2

Function … … 𝑃 𝐶 1 |𝑥 … … 𝑃 𝐶 1 |𝑥 = 𝑓 𝑤,𝑏 𝑥 =𝜎(𝑧)=𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑃 𝐶 1 |𝑥 = 𝑓 𝑤,𝑏 𝑥 =𝜎(𝑧)=𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 … … 𝑃 𝐶 1 |𝑥 … … Sigmoid Function 𝑧≥0⇒𝑃 𝐶 1 |𝑥 = 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 >0.5, the data in class 1

Data Generation Probability 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 …… 𝐶 1 𝐶 2 Training Data Given a set of w and b, what is its probability of generating the data? 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 What is the largest probability model (characterized by w* and b*) generating these data (maximum likelihood)? 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏

= …… 𝑥 1 𝑥 2 𝑥 3 …… 𝐶 1 𝐶 2 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏 = 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 =−𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 − 𝑦 1 𝑙𝑛𝑓 𝑥 1 + 1− 𝑦 1 𝑙𝑛 1−𝑓 𝑥 1 1 −𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 − 𝑦 2 𝑙𝑛𝑓 𝑥 2 + 1− 𝑦 2 𝑙𝑛 1−𝑓 𝑥 2 1 −𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 − 𝑦 3 𝑙𝑛𝑓 𝑥 3 + 1− 𝑦 3 𝑙𝑛 1−𝑓 𝑥 3 1 ……

Error Function 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 Cross entropy between two Bernoulli distribution (estimated output v.s. true output) Distribution p: p 𝑥=1 = 𝑦 𝑛 p 𝑥=0 = 1− 𝑦 𝑛 Distribution q: q 𝑥=1 =𝑓 𝑥 𝑛 q 𝑥=0 =1−𝑓 𝑥 𝑛 cross entropy 𝐻 𝑝,𝑞 =− 𝑥 𝑝 𝑥 𝑙𝑛 𝑞 𝑥

Logistic Regression Linear Regression 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 Cross entropy: 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 =− 𝑦 𝑛 𝑙𝑛𝑓 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1−𝑓 𝑥 𝑛

Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 𝜕𝑙𝑛𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜕𝑙𝑛 1−𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

Larger difference, larger update Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 = 𝑛 − 𝑦 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 1− 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 + 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Larger difference, larger update

Logistic Regression Linear Regression 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Logistic regression: Step 3: 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Linear regression:

Logistic Regression + Square Error 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Step 2: Training data: 𝑥 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑦 𝑛 2 Step 3: 𝜕 ( 𝑓 𝑤,𝑏 (𝑥)− 𝑦 ) 2 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝑓 𝑤,𝑏 𝑥 1− 𝑓 𝑤,𝑏 𝑥 𝑥 𝑖 𝑦 𝑛 =1 If 𝑓 𝑤,𝑏 𝑥 𝑛 =1 (close to target) 𝜕𝐿 𝜕 𝑤 𝑖 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =0 (far from target) 𝜕𝐿 𝜕 𝑤 𝑖 =0

Cross Entropy v.s. Square Error Total Loss Square Error w2 w1 http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

Multi-class Classification Probability: 1> 𝑦 𝑖 >0 𝑖 𝑦 𝑖 =1 C1: 𝑤 1 , 𝑏 1 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 C2: 𝑤 2 , 𝑏 2 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 C3: 𝑤 3 , 𝑏 3 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 Softmax 3 20 0.88 0.12 1 2.7 ≈0 -3 0.05

Multi-class Classification 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 Cross Entropy 𝑥 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 Softmax − 𝑖=1 3 𝑦 𝑖 𝑙𝑛 𝑦 𝑖 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 target If x ∈ class 1 If x ∈ class 2 If x ∈ class 3 1 0 0 0 1 0 0 0 1 𝑦 = 𝑦 = 𝑦 =

Limitation of Logistic Regression Input Feature Label x1 x2 Class 2 1 Class 1 z ≥ 0 z < 0 z < 0 z ≥ 0

Limitation of Logistic Regression 𝑥 1 ′ : distance to 0 0 𝑥 2 ′ : distance to 1 1 Feature Transformation 𝑥 1 ′ 𝑥 2 ′ 𝑥 1 𝑥 2 Not always easy to find a good transformation 𝑥 2 ′ 0 2 0 1 1 1 2 0 0 0 1 0 1 1 𝑥 1 ′

Limitation of Logistic Regression Cascading logistic regression models 𝑥 1 ′ 𝑥 2 ′ Feature Transformation Classification (ignore bias in this figure)

Feature Transformation Folding the space

Deep Learning! “Neuron” Neural Network