Deep Learning for Non-Linear Control

Deep Learning for Non-Linear Control
Shiyan Hu Michigan Technological University

The General Non-Linear Control System
Need to model the nonlinear dynamics of the plant, how?

An Example Current outside temperature at t The AC power level at t+1
AC dynamics model Current outside temperature at t The AC power level at t+1 Required temperature at t The actual temperature at t+1 Available AC power levels at t Forecast outside temperature at t+1 The electricity bill at t+1 Time t In most cases, one cannot analytically model this dynamics, and can only use historical operation data to approximate its dynamics. This is called learning.

What is Learning? 𝑓 = 𝑓 = Stock Market Forecast
Self-driving Car Dow Jones Industrial Average tomorrow 𝑓 = 𝑓 = Wheel control

Learning Examples Supervised learning Unsupervised learning
Given inputs and target outputs, fit the model (e.g., classification). Translate one language to the other language, which is beyond classification since we cannot enumerate all possible sentences. This is called structural learning. Unsupervised learning What if we do not know the target outputs? That is, we do not know what we want to learn. Reinforcement learning A machine talks to a person, learn what is a good way and what is not (only through reward function), and then it gradually learn to speak. That is, it evolves through getting the feedback. We do not feed it with exact input and output: not supervised We still give him some feedbacks using reward function: not unsupervised

Unsupervised Learning
Learn the meanings of words and sentences through reading the documents.

Reinforcement Learning

Deep learning trends at Google. Source: SIGMOD 2016/Jeff Dean

History of Deep Learning
1958: Perceptron (linear model) 1986: Backpropagation 2006: RBM initialization 2011: Start to be popular in speech recognition 2012: win ILSVRC image competition 2015.2: Image recognition surpassing human-level performance 2016.3: Alpha GO beats Lee Sedol : Speech recognition system as good as humans

Neural Network Neural Network “Neuron”
Different connection leads to different network structures

Deep = Many hidden layers
19 layers 8 layers 6.7% 7.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014)

Deep = Many hidden layers
3.57% 7.3% 6.7% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)

Supervised Learning Statistical and Signal Processing Techniques
Linear regression Logistic regression Nonlinear regression Machine Learning Techniques SVM Deep learning

Learning Basics Training Data Testing Data
You are given some training data You are to learn a function/model about these training data You will use this model to process testing data Training data and testing data do not necessarily share the same properties

Linear Regression: Input Data
Training Data: 𝑥 1 , 𝑦 1 𝑥 2 , 𝑦 2 𝑥 10 , 𝑦 10 …… This is real data. y = b + wx 𝑥 𝑛 , 𝑦 𝑛 This is a linear function where w and b are scalars

An Example Training data are (10, 5) (20, 6) (3,7) (40, 8) (50, 9)
5=b+10w 6=b+20w 7=b+30w 8=b+40w 9=b+50w Compute b and w which best fit these data

Linear Regression: Function to Learn
To compute scalars w and b such that y best approximates b + wx One can convert it to an optimization problem minimizing a Loss Function L 𝑤,𝑏 = 𝑛= 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 𝐿 𝑤,𝑏 =𝑎𝑟𝑔 min 𝑤,𝑏 𝑛= 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2

Linear Regression: Gradient Descent
𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 Negative Increase w Positive Decrease w w0

𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 𝑤 1 ← 𝑤 0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 η is called “learning rate” w0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0

𝑤 ∗ =𝑎𝑟𝑔 min 𝑤 𝐿 𝑤 Consider loss function 𝐿(𝑤) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 𝑤 1 ← 𝑤 0 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 0 Loss 𝐿 𝑤 Compute 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 1 𝑤 2 ← 𝑤 1 −𝜂 𝑑𝐿 𝑑𝑤 | 𝑤= 𝑤 1 …… Many iterations Local optimal not global optimal w0 w1 w2 wT

Still not guarantee reaching global minima, but give some hope ……
Momentum cost Movement = Negative of 𝜕𝐿∕𝜕𝑤 + Momentum Negative of 𝜕𝐿∕𝜕𝑤 Momentum Real Movement 𝜕𝐿∕𝜕𝑤 = 0

Color: Value of Loss L(w,b)
2D Gradient Descent 𝑏 𝑤 Color: Value of Loss L(w,b) (−𝜂 𝜕𝐿 𝜕𝑏 , −𝜂 𝜕𝐿 𝜕𝑤 ) Compute 𝜕𝐿 𝜕𝑏 , 𝜕𝐿 𝜕𝑤

Convex L Not the case in linear regression where the loss function L is convex, so there is global optimum. 𝐿 𝑤 𝑏

Compute Gradient Descent
Formulation of 𝜕𝐿 𝜕𝑤 and 𝜕𝐿 𝜕𝑏 𝐿 𝑤,𝑏 = 𝑛= 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑛= 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 − 𝑥 𝑛 𝜕𝐿 𝜕𝑤 =? 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 −𝑤⋅ 𝑥 𝑛 𝑥 𝑖 𝑛 𝜕𝐿 𝜕𝑏 =?

Compute Gradient Descent
Formulation of 𝜕𝐿 𝜕𝑤 and 𝜕𝐿 𝜕𝑏 𝐿 𝑤,𝑏 = 𝑛= 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 2 𝑛= 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 − 𝑥 𝑛 𝜕𝐿 𝜕𝑤 =? 𝑛= 𝑦 𝑛 − 𝑏+𝑤∙ 𝑥 𝑛 −1 𝜕𝐿 𝜕𝑏 =?

How about the results? y = b + wx b = -188.4 w = 2.7 Training Data 𝑒 1
Let 𝑒 𝑛 denote the square error. Average Error on Training Data Training Data 𝑒 1 𝑒 2 = 31.9 = 𝑛=1 10 𝑒 𝑛

Generalization? What we really care about is the error on new data (testing data) y = b + wx b = w = 2.7 Average Error on Testing Data = 35.0 = 𝑛=1 10 𝑒 𝑛 Testing data How can we do better? > Average Error on Training Data (31.9)

More Complex f Best Training Testing y = b + w1 x + w2 (x)2 b = -10.3
w1 = 1.0, w2 = 2.7 x 10-3 Average Error = 15.4 Testing Average Error = 18.4 Better! Could it be even better?

More Complex f Best Training Testing y = b + w1 x+ w2 (x)2 + w3 (x)3
Average Error = 15.3 Testing Average Error = 18.1 Slightly better. How about more complex model?

More Complex f Best Training Testing y = b + w1 x + w2 (x)2
Average Error = 14.9 Testing Average Error = 28.8 The results become worse ...

More Complex f Best Training Testing: y = b + w1 x + w2(x)2
Average Error = 12.8 Testing: Average Error = 232.1 The results are bad.

Training and Testing Fitting Error
1 31.9 35.0 2 15.4 18.4 3 15.3 18.1 4 14.9 28.2 5 12.8 232.1 A more complex model does not always lead to better performance on testing data , which is due to overfitting. Where does error come from?

Estimator 𝑓 𝑓 ∗ The true function 𝑓 Bias + Variance
From training data, we find 𝑓 ∗ 𝑓 ∗ is an estimator of 𝑓 𝑓

Bias and Variance of Estimator
Assume that a variable x follows a PDF the mean 𝜇 and the variance of x is 𝜎 2 , we want estimate it Estimator: sample N points using the PDF: 𝑥 1 , 𝑥 2 ,…, 𝑥 𝑁 𝑚= 1 𝑁 𝑛 𝑥 𝑛 𝑠 2 = 1 𝑁 𝑛 𝑥 𝑛 −𝑚 2 𝐸 𝑚 =𝐸 1 𝑁 𝑛 𝑥 𝑛 = 1 𝑁 𝑛 𝐸 𝑥 𝑛 =𝜇 unbiased 𝐸 𝑠 2 = 𝑁−1 𝑁 𝜎 2 biased

𝐸 𝑓 ∗ = 𝑓 𝑓 ∗ Variance Bias 𝑓

How to Compute Bias and Variance?
Assume that we have more sets of data Assume that we insist to use linear model y = b + w ∙ x

Training Results Different training data lead to different function 𝑓 ∗ y = b + w ∙ x y = b’ + w’ ∙ x

Different Functions/Models
y = b + w ∙ x y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 + w4 ∙ (x)4 + w5 ∙ (x)5

Black curve: the true function 𝑓
Red curves: 𝑓 ∗ Blue curve: the average of 𝑓 ∗ = 𝑓

Bias v.s. Variance Large Small Bias, Bias, Large Variance
y = b + w1 ∙ x + w2 ∙ (x)2 + w3 ∙ (x)3 + w4 ∙ (x)4 + w5 ∙ (x)5 y = b + w ∙ x Simpler model is less influenced by the sampled data, while more complex models tend to overfit the data (impacted more by changes in data) Large Bias, Small Variance Small Bias, Large Variance model model

Bias v.s. Variance Underfitting Overfitting Large Bias Small Bias
Error from bias Error from variance Error observed Underfitting Overfitting Large Bias Small Bias Small Variance Large Variance

What to do with large bias?
Diagnosis: If your model cannot even fit the training examples, then you have large bias If you can fit the training data, but large error on testing data, then you probably have large variance For bias, redesign your model: Add more features as input A more complex function/model Underfitting Overfitting

What to do with large variance?
More data Regularization Very effective, but not always practical 10 examples 100 examples

Exercise Suppose that you have data, and you will use the function/model y = b + w1 x + w2(x)2+ w3 (x)3 + w4 (x)4+ w5 (x)5 to fit them Plan 1: Put all of data to your training process Plan 2: Partition them into 10 sets. Use regression on each of these 10 sets and you get 10 different functions. Average these 10 functions and return it. Which one is better?

High Dimensional Data 𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , 𝑠𝑜 𝑥 1 5 =12
𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , 𝑠𝑜 𝑥 1 5 =12 𝒙 𝟐 = 2,9,5,15,19,17,9,21,52 𝑇 , 𝑠𝑜 𝑥 2 5 =19 𝒚 = 10,2,3,7,5,8,35,19,29 𝑇 , 𝑠𝑜 𝑦 5 =5 𝒚= 𝑤 1 𝒙 𝟏 + 𝑤 2 𝒙 𝟐 +𝒃= 𝒙 𝟏 , 𝒙 𝟐 ⋅𝒘+𝒃 𝑤ℎ𝑒𝑟𝑒 𝒘= 𝑤 1 , 𝑤 2 is a vector and 𝒃 is a vector We can still use the gradient descent method to solve it min 𝐿= 𝑛 𝑦 𝑛 − 𝑏 𝑛 𝑤 𝑖 𝑥 𝑖 𝑛 2 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊

Improve Robustness: Regularization
𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊 The functions with smaller 𝑤 𝑖 are better 𝐿= 𝑛 𝑦 𝑛 − 𝑏 𝑛 𝑤 𝑖 𝑥 𝑖 𝑛 2 +𝜆 𝑤 𝑖 2 Why smooth functions are preferred? 𝒚=𝒃+ 𝑤 𝑖 𝒙 𝒊 + 𝑤 𝑖 Δ 𝑥 𝑖 +Δ 𝑥 𝑖 If some noises are induced on input xi when testing, then less influence on the estimated y.

Regularization Results
𝜆 Training Testing 1.9 102.3 1 2.3 68.7 10 3.5 25.7 100 4.1 11.1 1000 5.6 12.8 10000 6.3 18.7 100000 8.5 26.8 smoother We prefer smooth function, but not too smooth.

Logistic Regression Training data What if y value is binary?
This is a classification problem. 𝑥 1 , 𝑦 1 𝑥 2 , 𝑦 2 …… 𝑥 10 , 𝑦 10 𝑥 1 𝑥 2 𝑥 3 𝑥 10 …… 𝐶 1 𝐶 2

Probabilistic Interpretation
Assume that the training data are generated based on a probabilistic distribution function (PDF). We aim to estimate this PDF by 𝑓(𝑥) which is characterized by parameters w and b, so we also write it as 𝑓 𝑤,𝑏 (𝑥) If 𝑃 𝐶 1 |𝑥 = 𝑓(𝑥)>0.5, output: y = class 1 Otherwise, output: y = class 2

Data Generation Probability
𝑥 1 𝑥 2 𝑥 3 𝑥 𝑁 …… 𝐶 1 𝐶 2 Training Data Given a set of w and b, what is its probability of generating the data? 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 What is the largest probability model (characterized by w* and b*) generating these data (maximum likelihood)? 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏

= …… 𝑥 1 𝑥 2 𝑥 3 …… 𝐶 1 𝐶 2 𝑥 1 𝑥 2 𝑥 3 …… 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0
𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 max 𝑤,𝑏 𝐿 𝑤,𝑏 = 𝑤 ∗ , 𝑏 ∗ =𝑎𝑟𝑔 min 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 −𝑙𝑛𝐿 𝑤,𝑏 =−𝑙𝑛 𝑓 𝑤,𝑏 𝑥 1 − 𝑦 1 𝑙𝑛𝑓 𝑥 − 𝑦 1 𝑙𝑛 1−𝑓 𝑥 1 1 −𝑙𝑛 𝑓 𝑤,𝑏 𝑥 2 − 𝑦 2 𝑙𝑛𝑓 𝑥 − 𝑦 2 𝑙𝑛 1−𝑓 𝑥 2 1 −𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 3 − 𝑦 3 𝑙𝑛𝑓 𝑥 − 𝑦 3 𝑙𝑛 1−𝑓 𝑥 3 1 ……

Error Function 𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁
𝐿 𝑤,𝑏 = 𝑓 𝑤,𝑏 𝑥 1 𝑓 𝑤,𝑏 𝑥 2 1− 𝑓 𝑤,𝑏 𝑥 3 ⋯ 𝑓 𝑤,𝑏 𝑥 𝑁 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 Cross entropy between two Bernoulli distribution (estimated output v.s. true output) Distribution p: p 𝑥=1 = 𝑦 𝑛 p 𝑥=0 = 1− 𝑦 𝑛 Distribution q: q 𝑥=1 =𝑓 𝑥 𝑛 q 𝑥=0 =1−𝑓 𝑥 𝑛 cross entropy 𝐻 𝑝,𝑞 =− 𝑥 𝑝 𝑥 𝑙𝑛 𝑞 𝑥

Logistic Regression Linear Regression
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 Cross entropy: 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 =− 𝑦 𝑛 𝑙𝑛𝑓 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1−𝑓 𝑥 𝑛

Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛
1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 𝜕𝑙𝑛𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 = 1 𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛
1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕 𝑤 𝑖 = 𝜕𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝜕𝑧 𝜕𝑧 𝜕 𝑤 𝑖 𝜕𝑧 𝜕 𝑤 𝑖 = 𝑥 𝑖 𝜕𝑙𝑛 1−𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜕𝜎 𝑧 𝜕𝑧 =− 1 1−𝜎 𝑧 𝜎 𝑧 1−𝜎 𝑧 𝑓 𝑤,𝑏 𝑥 =𝜎 𝑧 𝑧=𝑤∙𝑥+𝑏= 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 = 1 1+𝑒𝑥𝑝 −𝑧

Larger difference, larger update
Gradient Descent 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 𝑙𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 + 1− 𝑦 𝑛 𝑙𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 −𝑙𝑛𝐿 𝑤,𝑏 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 𝜕 𝑤 𝑖 = 𝑛 − 𝑦 𝑛 1− 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 − 1− 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 + 𝑦 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 = 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Larger difference, larger update

Logistic Regression Linear Regression
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 𝑓 𝑤,𝑏 𝑥 = 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: 𝑥 𝑛 , 𝑦 𝑛 Training data: 𝑥 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐶 𝑓 𝑥 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑥 𝑛 − 𝑦 𝑛 2 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Logistic regression: Step 3: 𝑤 𝑖 ← 𝑤 𝑖 −𝜂 𝑛 − 𝑦 𝑛 − 𝑓 𝑤,𝑏 𝑥 𝑛 𝑥 𝑖 𝑛 Linear regression:

Logistic Regression + Square Error
𝑓 𝑤,𝑏 𝑥 =𝜎 𝑖 𝑤 𝑖 𝑥 𝑖 +𝑏 Step 1: Step 2: Training data: 𝑥 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑤,𝑏 𝑥 𝑛 − 𝑦 𝑛 2 Step 3: 𝜕 ( 𝑓 𝑤,𝑏 (𝑥)− 𝑦 ) 2 𝜕 𝑤 𝑖 =2 𝑓 𝑤,𝑏 𝑥 − 𝑦 𝑓 𝑤,𝑏 𝑥 1− 𝑓 𝑤,𝑏 𝑥 𝑥 𝑖 𝑦 𝑛 =1 If 𝑓 𝑤,𝑏 𝑥 𝑛 =1 (close to target) 𝜕𝐿 𝜕 𝑤 𝑖 =0 If 𝑓 𝑤,𝑏 𝑥 𝑛 =0 (far from target) 𝜕𝐿 𝜕 𝑤 𝑖 =0

Cross Entropy v.s. Square Error
Total Loss Square Error w2 w1

Multi-class Classification
Probability: 1> 𝑦 𝑖 >0 𝑖 𝑦 𝑖 =1 C1: 𝑤 1 , 𝑏 1 𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 C2: 𝑤 2 , 𝑏 2 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 C3: 𝑤 3 , 𝑏 3 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 Softmax 3 20 0.88 0.12 1 2.7 ≈0 -3 0.05

Multi-class Classification
𝑧 1 = 𝑤 1 ∙𝑥+ 𝑏 1 Cross Entropy 𝑥 𝑧 2 = 𝑤 2 ∙𝑥+ 𝑏 2 Softmax − 𝑖=1 3 𝑦 𝑖 𝑙𝑛 𝑦 𝑖 𝑧 3 = 𝑤 3 ∙𝑥+ 𝑏 3 target If x ∈ class 1 If x ∈ class 2 If x ∈ class 3 1 0 0 0 1 0 0 0 1 𝑦 = 𝑦 = 𝑦 =

Limitation of Logistic Regression
Input Feature Label x1 x2 Class 2 1 Class 1 z ≥ 0 z < 0 z < 0 z ≥ 0

𝑥 1 ′ : distance to 0 0 𝑥 2 ′ : distance to 1 1 Feature Transformation 𝑥 1 ′ 𝑥 2 ′ 𝑥 1 𝑥 2 Not always easy to find a good transformation 𝑥 2 ′ 0 2 0 1 1 1 2 0 0 0 1 0 1 1 𝑥 1 ′

Cascading logistic regression models 𝑥 1 ′ 𝑥 2 ′ Feature Transformation Classification (ignore bias in this figure)

Feature Transformation
Folding the space

Deep Learning! “Neuron” Neural Network

Deep Learning for Non-Linear Control

Similar presentations

Presentation on theme: "Deep Learning for Non-Linear Control"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning for Non-Linear Control

Similar presentations

Presentation on theme: "Deep Learning for Non-Linear Control"— Presentation transcript:

Similar presentations

About project

Feedback