Deep Learning for Non-Linear Control Shiyan Hu Michigan Technological University
The General Non-Linear Control System Need to model the nonlinear dynamics of the plant, how?
An Example Current outside temperature at t The AC power level at t+1 AC dynamics model Current outside temperature at t The AC power level at t+1 Required temperature at t The actual temperature at t+1 Available AC power levels at t Forecast outside temperature at t+1 The electricity bill at t+1 Time t In most cases, one cannot analytically model this dynamics, and can only use historical operation data to approximate its dynamics. This is called learning.
What is Learning? 𝑓 = 𝑓 = Stock Market Forecast Self-driving Car Dow Jones Industrial Average tomorrow 𝑓 = 𝑓 = Wheel control
Learning Examples Supervised learning Unsupervised learning Given inputs and target outputs, fit the model (e.g., classification). Translate one language to the other language, which is beyond classification since we cannot enumerate all possible sentences. This is called structural learning. Unsupervised learning What if we do not know the target outputs? That is, we do not know what we want to learn. Reinforcement learning A machine talks to a person, learn what is a good way and what is not (only through reward function), and then it gradually learn to speak. That is, it evolves through getting the feedback. We do not feed it with exact input and output: not supervised We still give him some feedbacks using reward function: not unsupervised
Unsupervised Learning Learn the meanings of words and sentences through reading the documents.
Reinforcement Learning
Deep learning trends at Google. Source: SIGMOD 2016/Jeff Dean
History of Deep Learning 1958: Perceptron (linear model) 1986: Backpropagation 2006: RBM initialization 2011: Start to be popular in speech recognition 2012: win ILSVRC image competition 2015.2: Image recognition surpassing human-level performance 2016.3: Alpha GO beats Lee Sedol 2016.10: Speech recognition system as good as humans
Neural Network Neural Network “Neuron” Different connection leads to different network structures
Deep = Many hidden layers http://cs231n.stanford.edu/slides/winter1516_lecture8.pdf 19 layers 8 layers 6.7% 7.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014)
Deep = Many hidden layers 3.57% 7.3% 6.7% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)
Supervised Learning Statistical and Signal Processing Techniques Linear regression Logistic regression Nonlinear regression Machine Learning Techniques SVM Deep learning
Learning Basics Training Data Testing Data You are given some training data You are to learn a function/model about these training data You will use this model to process testing data Training data and testing data do not necessarily share the same properties
Linear Regression: Input Data Training Data: 𝑥 1 , 𝑦 1 𝑥 2 , 𝑦 2 𝑥 10 , 𝑦 10 …… This is real data. y = b + wx 𝑥 𝑛 , 𝑦 𝑛 This is a linear function where w and b are scalars
An Example Training data are (10, 5) (20, 6) (3,7) (40, 8) (50, 9) 5=b+10w 6=b+20w 7=b+30w 8=b+40w 9=b+50w Compute b and w which best fit these data
Linear Regression: Function to Learn To compute scalars w and b such that y best approximates b + wx One can convert it to an optimization problem minimizing a Loss Function L 𝑤,𝑏 = 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 𝐿 𝑤,𝑏 =𝑎𝑟𝑔 min 𝑤,𝑏 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2
Linear Regression: Gradient Descent 𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 Negative Increase w Positive Decrease w w0
Linear Regression: Gradient Descent 𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 𝑤 1 ← 𝑤 0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 η is called “learning rate” w0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0
Linear Regression: Gradient Descent 𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 𝑤 1 ← 𝑤 0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 1 𝑤 2 ← 𝑤 1 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 1 …… Many iterations Local optimal not global optimal w0 w1 w2 wT
Still not guarantee reaching global minima, but give some hope …… Momentum cost Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Momentum Negative of 𝜕𝐿∕𝜕𝑤 Momentum Real Movement 𝜕𝐿∕𝜕𝑤 = 0
Linear Regression: Gradient Descent How about two parameters? 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 𝐿 𝑤,𝑏 𝜕𝐿 𝜕𝑤 𝜕𝐿 𝜕𝑏 𝛻𝐿= gradient (Randomly) Pick an initial value w0, b0 Compute 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 , 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 𝑤 1 ← 𝑤 0 −𝜂 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 𝑏 1 ← 𝑏 0 −𝜂 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 0 ,𝑏= 𝑏 0 Compute 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1 , 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1 𝑤 2 ← 𝑤 1 −𝜂 𝜕𝐿 𝜕𝑤 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1 𝑏 2 ← 𝑏 1 −𝜂 𝜕𝐿 𝜕𝑏 | 𝑤= 𝑤 1 ,𝑏= 𝑏 1
Color: Value of Loss L(w,b) 2D Gradient Descent 𝑏 𝑤 Color: Value of Loss L(w,b) (−𝜂 𝜕𝐿 𝜕𝑏 , −𝜂 𝜕𝐿 𝜕𝑤 ) Compute 𝜕𝐿 𝜕𝑏 , 𝜕𝐿 𝜕𝑤
Convex L Not the case in linear regression where the loss function L is convex, so there is global optimum. 𝐿 𝑤 𝑏
Compute Gradient Descent Formulation of 𝜕𝐿 𝜕𝑤 and 𝜕𝐿 𝜕𝑏 𝐿 𝑤,𝑏 = 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑛=1 10 2 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 − 𝑥 𝑛 𝜕𝐿 𝜕𝑤 =? 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 −𝑤⋅ 𝑥 𝑛 𝑥 𝑖 𝑛 𝜕𝐿 𝜕𝑏 =?
Compute Gradient Descent Formulation of 𝜕𝐿 𝜕𝑤 and 𝜕𝐿 𝜕𝑏 𝐿 𝑤,𝑏 = 𝑛=1 10 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑛=1 10 2 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 − 𝑥 𝑛 𝜕𝐿 𝜕𝑤 =? 𝑛=1 10 2 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 −1 𝜕𝐿 𝜕𝑏 =?
How about the results? y = b + wx b = -188.4 w = 2.7 Training Data 𝑒 1 Let 𝑒 𝑛 denote the square error. Average Error on Training Data Training Data 𝑒 1 𝑒 2 = 31.9 = 1 10 𝑛=1 10 𝑒 𝑛
Generalization? What we really care about is the error on new data (testing data) y = b + wx b = -188.4 w = 2.7 Average Error on Testing Data = 35.0 = 1 10 𝑛=1 10 𝑒 𝑛 Testing data How can we do better? > Average Error on Training Data (31.9)
More Complex f Best Training Testing y = b + w1 x + w2 (x)2 b = -10.3 w1 = 1.0, w2 = 2.7 x 10-3 Average Error = 15.4 Testing Average Error = 18.4 Better! Could it be even better?
More Complex f Best Training Testing y = b + w1 x+ w2 (x)2 + w3 (x)3 Average Error = 15.3 Testing Average Error = 18.1 Slightly better. How about more complex model?
More Complex f Best Training Testing y = b + w1 x + w2 (x)2 Average Error = 14.9 Testing Average Error = 28.8 The results become worse ...
More Complex f Best Training Testing: y = b + w1 x + w2(x)2 Average Error = 12.8 Testing: Average Error = 232.1 The results are bad.
Training and Testing Fitting Error 1 31.9 35.0 2 15.4 18.4 3 15.3 18.1 4 14.9 28.2 5 12.8 232.1 A more complex model does not always lead to better performance on testing data , which is due to overfitting. Where does error come from?
Estimator 𝑓 𝑓 ∗ The true function 𝑓 Bias + Variance From training data, we find 𝑓 ∗ 𝑓 ∗ is an estimator of 𝑓 𝑓
Bias and Variance of Estimator Assume that a variable x follows a PDF the mean 𝜇 and the variance of x is 𝜎 2 , we want estimate it Estimator: sample N points using the PDF: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑁 𝑚= 1 𝑁 𝑛 𝑥 𝑛 𝑠 2 = 1 𝑁 𝑛 𝑥 𝑛 −𝑚 2 𝐸 𝑚 =𝐸 1 𝑁 𝑛 𝑥 𝑛 = 1 𝑁 𝑛 𝐸 𝑥 𝑛 =𝜇 unbiased 𝐸 𝑠 2 = 𝑁−1 𝑁 𝜎 2 biased
𝐸 𝑓 ∗ = 𝑓 𝑓 ∗ Variance Bias 𝑓
How to Compute Bias and Variance? Assume that we have more sets of data Assume that we insist to use linear model y = b + w ∙ x
Training Results Different training data lead to different function 𝑓 ∗ y = b + w ∙ x y = b’ + w’ ∙ x
Different Functions/Models y = b + w ∙ x y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 + w4 ∙ (x)4 + w5 ∙ (x)5
Black curve: the true function 𝑓 Red curves: 5000 𝑓 ∗ Blue curve: the average of 5000 𝑓 ∗ = 𝑓
Bias v.s. Variance Large Small Bias, Bias, Large Variance y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 + w4 ∙ (x)4 + w5 ∙ (x)5 y = b + w ∙ x Simpler model is less influenced by the sampled data, while more complex models tend to overfit the data (impacted more by changes in data) Large Bias, Small Variance Small Bias, Large Variance model model
Bias v.s. Variance Underfitting Overfitting Large Bias Small Bias Error from bias Error from variance Error observed Underfitting Overfitting Large Bias Small Bias Small Variance Large Variance
What to do with large bias? Diagnosis: If your model cannot even fit the training examples, then you have large bias If you can fit the training data, but large error on testing data, then you probably have large variance For bias, redesign your model: Add more features as input A more complex function/model Underfitting Overfitting
What to do with large variance? More data Regularization Very effective, but not always practical 10 examples 100 examples
Exercise Suppose that you have 10000 data, and you will use the function/model y = b + w1 x + w2(x)2+ w3 (x)3 + w4 (x)4+ w5 (x)5 to fit them Plan 1: Put all of 10000 data to your training process Plan 2: Partition them into 10 sets. Use regression on each of these 10 sets and you get 10 different functions. Average these 10 functions and return it. Which one is better?
High Dimensional Data 𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , 𝑠𝑜 𝑥 1 5 =12 𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , 𝑠𝑜 𝑥 1 5 =12 𝒙 𝟐 = 2,9,5,15,19,17,9,21,52 𝑇 , 𝑠𝑜 𝑥 2 5 =19 𝒚 = 10,2,3,7,5,8,35,19,29 𝑇 , 𝑠𝑜 𝑦 5 =5 𝒚= 𝑤 1 𝒙 𝟏 + 𝑤 2 𝒙 𝟐 +𝒃= 𝒙 𝟏 , 𝒙 𝟐 ⋅𝒘+𝒃 𝑤ℎ𝑒𝑟𝑒 𝒘= 𝑤 1 , 𝑤 2 is a vector and 𝒃 is a vector We can still use the gradient descent method to solve it min 𝐿= 𝑛 𝑦 𝑛 − 𝑏 𝑛 + 𝑤 𝑖 𝑥 𝑖 𝑛 2 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊
Improve Robustness: Regularization 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊 The functions with smaller 𝑤 𝑖 are better 𝐿= 𝑛 𝑦 𝑛 − 𝑏 𝑛 + 𝑤 𝑖 𝑥 𝑖 𝑛 2 +𝜆 𝑤 𝑖 2 Why smooth functions are preferred? 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊 + 𝑤 𝑖 Δ 𝑥 𝑖 +Δ 𝑥 𝑖 If some noises are induced on input xi when testing, then less influence on the estimated y.
Regularization Results 𝜆 Training Testing 1.9 102.3 1 2.3 68.7 10 3.5 25.7 100 4.1 11.1 1000 5.6 12.8 10000 6.3 18.7 100000 8.5 26.8 smoother We prefer smooth function, but not too smooth.
Logistic Regression Training data What if y value is binary? This is a classification problem. 𝑥 1 , 𝑦 1 𝑥 2 , 𝑦 2 …… 𝑥 10 , 𝑦 10 𝑥 1 𝑥 2 𝑥 3 𝑥 10 …… 𝐶 1 𝐶 2
Probabilistic Interpretation Assume that the training data are generated based on a probabilistic distribution function (PDF). We aim to estimate this PDF by 𝑓(𝑥) which is characterized by parameters w and b, so we also write it as 𝑓 𝑤,𝑏 (𝑥) If 𝑃 𝐶 1 |𝑥 = 𝑓(𝑥)>0.5, output: y = class 1 Otherwise, output: y = class 2
Function … … 𝑃 𝐶 1 |𝑥 … … 𝑃 𝐶 1 |𝑥 = 𝑓 𝑤,𝑏 𝑥 =𝜎(𝑧)=𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑃 𝐶 1 |𝑥 = 𝑓 𝑤,𝑏 𝑥 =𝜎(𝑧)=𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 … … 𝑃 𝐶 1 |𝑥 … … Sigmoid Function 𝑧≥0⇒𝑃 𝐶 1 |𝑥 = 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 >0.5, the data in class 1
Data Generation Probability 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 …… 𝐶 1 𝐶 2 Training Data Given a set of w and b, what is its probability of generating the data? 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 What is the largest probability model (characterized by w* and b*) generating these data (maximum likelihood)? 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏
= …… 𝑥 1 𝑥 2 𝑥 3 …… 𝐶 1 𝐶 2 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏 = 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 =−𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 − 𝑦 1 𝑙𝑛𝑓 𝑥 1 + 1− 𝑦 1 𝑙𝑛 1−𝑓 𝑥 1 1 −𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 − 𝑦 2 𝑙𝑛𝑓 𝑥 2 + 1− 𝑦 2 𝑙𝑛 1−𝑓 𝑥 2 1 −𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 − 𝑦 3 𝑙𝑛𝑓 𝑥 3 + 1− 𝑦 3 𝑙𝑛 1−𝑓 𝑥 3 1 ……
Error Function 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 Cross entropy between two Bernoulli distribution (estimated output v.s. true output) Distribution p: p 𝑥=1 = 𝑦 𝑛 p 𝑥=0 = 1− 𝑦 𝑛 Distribution q: q 𝑥=1 =𝑓 𝑥 𝑛 q 𝑥=0 =1−𝑓 𝑥 𝑛 cross entropy 𝐻 𝑝,𝑞 =− 𝑥 𝑝 𝑥 𝑙𝑛 𝑞 𝑥
Logistic Regression Linear Regression 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 Cross entropy: 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 =− 𝑦 𝑛 𝑙𝑛𝑓 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1−𝑓 𝑥 𝑛
Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 𝜕𝑙𝑛𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧
Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜕𝑙𝑛 1−𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧
Larger difference, larger update Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 = 𝑛 − 𝑦 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 1− 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 + 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Larger difference, larger update
Logistic Regression Linear Regression 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Logistic regression: Step 3: 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Linear regression:
Logistic Regression + Square Error 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Step 2: Training data: 𝑥 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑦 𝑛 2 Step 3: 𝜕 ( 𝑓 𝑤,𝑏 (𝑥)− 𝑦 ) 2 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝑓 𝑤,𝑏 𝑥 1− 𝑓 𝑤,𝑏 𝑥 𝑥 𝑖 𝑦 𝑛 =1 If 𝑓 𝑤,𝑏 𝑥 𝑛 =1 (close to target) 𝜕𝐿 𝜕 𝑤 𝑖 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =0 (far from target) 𝜕𝐿 𝜕 𝑤 𝑖 =0
Cross Entropy v.s. Square Error Total Loss Square Error w2 w1 http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
Multi-class Classification Probability: 1> 𝑦 𝑖 >0 𝑖 𝑦 𝑖 =1 C1: 𝑤 1 , 𝑏 1 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 C2: 𝑤 2 , 𝑏 2 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 C3: 𝑤 3 , 𝑏 3 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 Softmax 3 20 0.88 0.12 1 2.7 ≈0 -3 0.05
Multi-class Classification 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 Cross Entropy 𝑥 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 Softmax − 𝑖=1 3 𝑦 𝑖 𝑙𝑛 𝑦 𝑖 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 target If x ∈ class 1 If x ∈ class 2 If x ∈ class 3 1 0 0 0 1 0 0 0 1 𝑦 = 𝑦 = 𝑦 =
Limitation of Logistic Regression Input Feature Label x1 x2 Class 2 1 Class 1 z ≥ 0 z < 0 z < 0 z ≥ 0
Limitation of Logistic Regression 𝑥 1 ′ : distance to 0 0 𝑥 2 ′ : distance to 1 1 Feature Transformation 𝑥 1 ′ 𝑥 2 ′ 𝑥 1 𝑥 2 Not always easy to find a good transformation 𝑥 2 ′ 0 2 0 1 1 1 2 0 0 0 1 0 1 1 𝑥 1 ′
Limitation of Logistic Regression Cascading logistic regression models 𝑥 1 ′ 𝑥 2 ′ Feature Transformation Classification (ignore bias in this figure)
Feature Transformation Folding the space
Deep Learning! “Neuron” Neural Network