Presentation is loading. Please wait.

Presentation is loading. Please wait.

Deep Learning for Non-Linear Control

Similar presentations


Presentation on theme: "Deep Learning for Non-Linear Control"β€” Presentation transcript:

1 Deep Learning for Non-Linear Control
Shiyan Hu Michigan Technological University

2 The General Non-Linear Control System
Need to model the nonlinear dynamics of the plant, how?

3 An Example Current outside temperature at t The AC power level at t+1
AC dynamics model Current outside temperature at t The AC power level at t+1 Required temperature at t The actual temperature at t+1 Available AC power levels at t Forecast outside temperature at t+1 The electricity bill at t+1 Time t In most cases, one cannot analytically model this dynamics, and can only use historical operation data to approximate its dynamics. This is called learning.

4 What is Learning? 𝑓 = 𝑓 = Stock Market Forecast
Self-driving Car Dow Jones Industrial Average tomorrow 𝑓 = 𝑓 = Wheel control

5 Learning Examples Supervised learning Unsupervised learning
Given inputs and target outputs, fit the model (e.g., classification). Translate one language to the other language, which is beyond classification since we cannot enumerate all possible sentences. This is called structural learning. Unsupervised learning What if we do not know the target outputs? That is, we do not know what we want to learn. Reinforcement learning A machine talks to a person, learn what is a good way and what is not (only through reward function), and then it gradually learn to speak. That is, it evolves through getting the feedback. We do not feed it with exact input and output: not supervised We still give him some feedbacks using reward function: not unsupervised

6 Unsupervised Learning
Learn the meanings of words and sentences through reading the documents.

7 Reinforcement Learning

8 Deep learning trends at Google. Source: SIGMOD 2016/Jeff Dean

9 History of Deep Learning
1958: Perceptron (linear model) 1986: Backpropagation 2006: RBM initialization 2011: Start to be popular in speech recognition 2012: win ILSVRC image competition 2015.2: Image recognition surpassing human-level performanceΒ  2016.3: Alpha GO beats Lee Sedol : Speech recognition system as good as humans

10 Neural Network Neural Network β€œNeuron”
Different connection leads to different network structures

11 Deep = Many hidden layers
19 layers 8 layers 6.7% 7.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014)

12 Deep = Many hidden layers
3.57% 7.3% 6.7% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)

13 Supervised Learning Statistical and Signal Processing Techniques
Linear regression Logistic regression Nonlinear regression Machine Learning Techniques SVM Deep learning

14 Learning Basics Training Data Testing Data
You are given some training data You are to learn a function/model about these training data You will use this model to process testing data Training data and testing data do not necessarily share the same properties

15 Linear Regression: Input Data
Training Data: π‘₯ 1 , 𝑦 1 π‘₯ 2 , 𝑦 2 π‘₯ 10 , 𝑦 10 …… This is real data. y = b + wx π‘₯ 𝑛 , 𝑦 𝑛 This is a linear function where w and b are scalars

16 An Example Training data are (10, 5) (20, 6) (3,7) (40, 8) (50, 9)
5=b+10w 6=b+20w 7=b+30w 8=b+40w 9=b+50w Compute b and w which best fit these data

17 Linear Regression: Function to Learn
To compute scalars w and b such that y best approximates b + wx One can convert it to an optimization problem minimizing a Loss Function L 𝑀,𝑏 = 𝑛= 𝑦 𝑛 βˆ’ 𝑏+π‘€βˆ™ π‘₯ 𝑛 2 𝑀 βˆ— , 𝑏 βˆ— =π‘Žπ‘Ÿπ‘” min 𝑀,𝑏 𝐿 𝑀,𝑏 =π‘Žπ‘Ÿπ‘” min 𝑀,𝑏 𝑛= 𝑦 𝑛 βˆ’ 𝑏+π‘€βˆ™ π‘₯ 𝑛 2

18 Linear Regression: Gradient Descent
𝑀 βˆ— =π‘Žπ‘Ÿπ‘” min 𝑀 𝐿 𝑀 Consider loss function 𝐿(𝑀) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 0 Loss 𝐿 𝑀 Negative Increase w Positive Decrease w w0

19 Linear Regression: Gradient Descent
𝑀 βˆ— =π‘Žπ‘Ÿπ‘” min 𝑀 𝐿 𝑀 Consider loss function 𝐿(𝑀) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 0 𝑀 1 ← 𝑀 0 βˆ’πœ‚ 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 0 Loss 𝐿 𝑀 Ξ· is called β€œlearning rate” w0 βˆ’πœ‚ 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 0

20 Linear Regression: Gradient Descent
𝑀 βˆ— =π‘Žπ‘Ÿπ‘” min 𝑀 𝐿 𝑀 Consider loss function 𝐿(𝑀) with one parameter w: (Randomly) Pick an initial value w0 Compute 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 0 𝑀 1 ← 𝑀 0 βˆ’πœ‚ 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 0 Loss 𝐿 𝑀 Compute 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 1 𝑀 2 ← 𝑀 1 βˆ’πœ‚ 𝑑𝐿 𝑑𝑀 | 𝑀= 𝑀 1 …… Many iterations Local optimal not global optimal w0 w1 w2 wT

21 Still not guarantee reaching global minima, but give some hope ……
Momentum cost Movement = Negative of πœ•πΏβˆ•πœ•π‘€ + Momentum Negative of πœ•πΏβˆ•πœ•π‘€ Momentum Real Movement πœ•πΏβˆ•πœ•π‘€ = 0

22 Linear Regression: Gradient Descent
How about two parameters? 𝑀 βˆ— , 𝑏 βˆ— =π‘Žπ‘Ÿπ‘” min 𝑀,𝑏 𝐿 𝑀,𝑏 πœ•πΏ πœ•π‘€ πœ•πΏ πœ•π‘ 𝛻𝐿= gradient (Randomly) Pick an initial value w0, b0 Compute πœ•πΏ πœ•π‘€ | 𝑀= 𝑀 0 ,𝑏= 𝑏 0 , πœ•πΏ πœ•π‘ | 𝑀= 𝑀 0 ,𝑏= 𝑏 0 𝑀 1 ← 𝑀 0 βˆ’πœ‚ πœ•πΏ πœ•π‘€ | 𝑀= 𝑀 0 ,𝑏= 𝑏 0 𝑏 1 ← 𝑏 0 βˆ’πœ‚ πœ•πΏ πœ•π‘ | 𝑀= 𝑀 0 ,𝑏= 𝑏 0 Compute πœ•πΏ πœ•π‘€ | 𝑀= 𝑀 1 ,𝑏= 𝑏 1 , πœ•πΏ πœ•π‘ | 𝑀= 𝑀 1 ,𝑏= 𝑏 1 𝑀 2 ← 𝑀 1 βˆ’πœ‚ πœ•πΏ πœ•π‘€ | 𝑀= 𝑀 1 ,𝑏= 𝑏 1 𝑏 2 ← 𝑏 1 βˆ’πœ‚ πœ•πΏ πœ•π‘ | 𝑀= 𝑀 1 ,𝑏= 𝑏 1

23 Color: Value of Loss L(w,b)
2D Gradient Descent 𝑏 𝑀 Color: Value of Loss L(w,b) (βˆ’πœ‚ πœ•πΏ πœ•π‘ , βˆ’πœ‚ πœ•πΏ πœ•π‘€ ) Compute πœ•πΏ πœ•π‘ , πœ•πΏ πœ•π‘€

24 Convex L Not the case in linear regression where the loss function L is convex, so there is global optimum. 𝐿 𝑀 𝑏

25 Compute Gradient Descent
Formulation of πœ•πΏ πœ•π‘€ and πœ•πΏ πœ•π‘ 𝐿 𝑀,𝑏 = 𝑛= 𝑦 𝑛 βˆ’ 𝑏+π‘€βˆ™ π‘₯ 𝑛 2 𝑛= 𝑦 𝑛 βˆ’ 𝑏+π‘€βˆ™ π‘₯ 𝑛 βˆ’ π‘₯ 𝑛 πœ•πΏ πœ•π‘€ =? 𝑀 𝑖 ← 𝑀 𝑖 βˆ’πœ‚ 𝑛 βˆ’ 𝑦 𝑛 βˆ’π‘€β‹… π‘₯ 𝑛 π‘₯ 𝑖 𝑛 πœ•πΏ πœ•π‘ =?

26 Compute Gradient Descent
Formulation of πœ•πΏ πœ•π‘€ and πœ•πΏ πœ•π‘ 𝐿 𝑀,𝑏 = 𝑛= 𝑦 𝑛 βˆ’ 𝑏+π‘€βˆ™ π‘₯ 𝑛 2 𝑛= 𝑦 𝑛 βˆ’ 𝑏+π‘€βˆ™ π‘₯ 𝑛 βˆ’ π‘₯ 𝑛 πœ•πΏ πœ•π‘€ =? 𝑛= 𝑦 𝑛 βˆ’ 𝑏+π‘€βˆ™ π‘₯ 𝑛 βˆ’1 πœ•πΏ πœ•π‘ =?

27 How about the results? y = b + wx b = -188.4 w = 2.7 Training Data 𝑒 1
Let 𝑒 𝑛 denote the square error. Average Error on Training Data Training Data 𝑒 1 𝑒 2 = 31.9 = 𝑛=1 10 𝑒 𝑛

28 Generalization? What we really care about is the error on new data (testing data) y = b + wx b = w = 2.7 Average Error on Testing Data = 35.0 = 𝑛=1 10 𝑒 𝑛 Testing data How can we do better? > Average Error on Training Data (31.9)

29 More Complex f Best Training Testing y = b + w1 x + w2 (x)2 b = -10.3
w1 = 1.0, w2 = 2.7 x 10-3 Average Error = 15.4 Testing Average Error = 18.4 Better! Could it be even better?

30 More Complex f Best Training Testing y = b + w1 x+ w2 (x)2 + w3 (x)3
Average Error = 15.3 Testing Average Error = 18.1 Slightly better. How about more complex model?

31 More Complex f Best Training Testing y = b + w1 x + w2 (x)2
Average Error = 14.9 Testing Average Error = 28.8 The results become worse ...

32 More Complex f Best Training Testing: y = b + w1 x + w2(x)2
Average Error = 12.8 Testing: Average Error = 232.1 The results are bad.

33 Training and Testing Fitting Error
1 31.9 35.0 2 15.4 18.4 3 15.3 18.1 4 14.9 28.2 5 12.8 232.1 A more complex model does not always lead to better performance on testing data , which is due to overfitting. Where does error come from?

34 Estimator 𝑓 𝑓 βˆ— The true function 𝑓 Bias + Variance
From training data, we find 𝑓 βˆ— 𝑓 βˆ— is an estimator of 𝑓 𝑓

35 Bias and Variance of Estimator
Assume that a variable x follows a PDF the mean πœ‡ and the variance of x is 𝜎 2 , we want estimate it Estimator: sample N points using the PDF: π‘₯ 1 , π‘₯ 2 ,…, π‘₯ 𝑁 π‘š= 1 𝑁 𝑛 π‘₯ 𝑛 𝑠 2 = 1 𝑁 𝑛 π‘₯ 𝑛 βˆ’π‘š 2 𝐸 π‘š =𝐸 1 𝑁 𝑛 π‘₯ 𝑛 = 1 𝑁 𝑛 𝐸 π‘₯ 𝑛 =πœ‡ unbiased 𝐸 𝑠 2 = π‘βˆ’1 𝑁 𝜎 2 biased

36 𝐸 𝑓 βˆ— = 𝑓 𝑓 βˆ— Variance Bias 𝑓

37 How to Compute Bias and Variance?
Assume that we have more sets of data Assume that we insist to use linear model y = b + w βˆ™ x

38 Training Results Different training data lead to different function 𝑓 βˆ— y = b + w βˆ™ x y = b’ + w’ βˆ™ x

39 Different Functions/Models
y = b + w βˆ™ x y = b + w1 βˆ™ x + w2 βˆ™ (x)2 + w3 βˆ™ (x)3 y = b + w1 βˆ™ x + w2 βˆ™ (x)2 + w3 βˆ™ (x)3 + w4 βˆ™ (x)4 + w5 βˆ™ (x)5

40 Black curve: the true function 𝑓
Red curves: 𝑓 βˆ— Blue curve: the average of 𝑓 βˆ— = 𝑓

41 Bias v.s. Variance Large Small Bias, Bias, Large Variance
y = b + w1 βˆ™ x + w2 βˆ™ (x)2 + w3 βˆ™ (x)3 + w4 βˆ™ (x)4 + w5 βˆ™ (x)5 y = b + w βˆ™ x Simpler model is less influenced by the sampled data, while more complex models tend to overfit the data (impacted more by changes in data) Large Bias, Small Variance Small Bias, Large Variance model model

42 Bias v.s. Variance Underfitting Overfitting Large Bias Small Bias
Error from bias Error from variance Error observed Underfitting Overfitting Large Bias Small Bias Small Variance Large Variance

43 What to do with large bias?
Diagnosis: If your model cannot even fit the training examples, then you have large bias If you can fit the training data, but large error on testing data, then you probably have large variance For bias, redesign your model: Add more features as input A more complex function/model Underfitting Overfitting

44 What to do with large variance?
More data Regularization Very effective, but not always practical 10 examples 100 examples

45 Exercise Suppose that you have data, and you will use the function/model y = b + w1 x + w2(x)2+ w3 (x)3 + w4 (x)4+ w5 (x)5 to fit them Plan 1: Put all of data to your training process Plan 2: Partition them into 10 sets. Use regression on each of these 10 sets and you get 10 different functions. Average these 10 functions and return it. Which one is better?

46 High Dimensional Data 𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , π‘ π‘œ π‘₯ 1 5 =12
𝒙 𝟏 = 1,5,7,4,12,10,9,20,50 𝑇 , π‘ π‘œ π‘₯ 1 5 =12 𝒙 𝟐 = 2,9,5,15,19,17,9,21,52 𝑇 , π‘ π‘œ π‘₯ 2 5 =19 π’š = 10,2,3,7,5,8,35,19,29 𝑇 , π‘ π‘œ 𝑦 5 =5 π’š= 𝑀 1 𝒙 𝟏 + 𝑀 2 𝒙 𝟐 +𝒃= 𝒙 𝟏 , 𝒙 𝟐 β‹…π’˜+𝒃 π‘€β„Žπ‘’π‘Ÿπ‘’ π’˜= 𝑀 1 , 𝑀 2 is a vector and 𝒃 is a vector We can still use the gradient descent method to solve it min 𝐿= 𝑛 𝑦 𝑛 βˆ’ 𝑏 𝑛 𝑀 𝑖 π‘₯ 𝑖 𝑛 2 π’š=𝒃+ 𝑀 𝑖 𝒙 π’Š

47 Improve Robustness: Regularization
π’š=𝒃+ 𝑀 𝑖 𝒙 π’Š The functions with smaller 𝑀 𝑖 are better 𝐿= 𝑛 𝑦 𝑛 βˆ’ 𝑏 𝑛 𝑀 𝑖 π‘₯ 𝑖 𝑛 2 +πœ† 𝑀 𝑖 2 Why smooth functions are preferred? π’š=𝒃+ 𝑀 𝑖 𝒙 π’Š + 𝑀 𝑖 Ξ” π‘₯ 𝑖 +Ξ” π‘₯ 𝑖 If some noises are induced on input xi when testing, then less influence on the estimated y.

48 Regularization Results
πœ† Training Testing 1.9 102.3 1 2.3 68.7 10 3.5 25.7 100 4.1 11.1 1000 5.6 12.8 10000 6.3 18.7 100000 8.5 26.8 smoother We prefer smooth function, but not too smooth.

49 Logistic Regression Training data What if y value is binary?
This is a classification problem. π‘₯ 1 , 𝑦 1 π‘₯ 2 , 𝑦 2 …… π‘₯ 10 , 𝑦 10 π‘₯ 1 π‘₯ 2 π‘₯ 3 π‘₯ 10 …… 𝐢 1 𝐢 2

50 Probabilistic Interpretation
Assume that the training data are generated based on a probabilistic distribution function (PDF). We aim to estimate this PDF by 𝑓(π‘₯) which is characterized by parameters w and b, so we also write it as 𝑓 𝑀,𝑏 (π‘₯) If 𝑃 𝐢 1 |π‘₯ = 𝑓(π‘₯)>0.5, output: y = class 1 Otherwise, output: y = class 2

51 Function … … 𝑃 𝐢 1 |π‘₯ … … 𝑃 𝐢 1 |π‘₯ = 𝑓 𝑀,𝑏 π‘₯ =𝜎(𝑧)=𝜎 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏
𝑃 𝐢 1 |π‘₯ = 𝑓 𝑀,𝑏 π‘₯ =𝜎(𝑧)=𝜎 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 … … 𝑃 𝐢 1 |π‘₯ … … Sigmoid Function 𝑧β‰₯0⇒𝑃 𝐢 1 |π‘₯ = 𝑓 𝑀,𝑏 π‘₯ =𝜎 𝑧 >0.5, the data in class 1

52 Data Generation Probability
π‘₯ 1 π‘₯ 2 π‘₯ 3 π‘₯ 𝑁 …… 𝐢 1 𝐢 2 Training Data Given a set of w and b, what is its probability of generating the data? 𝐿 𝑀,𝑏 = 𝑓 𝑀,𝑏 π‘₯ 1 𝑓 𝑀,𝑏 π‘₯ 2 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 3 β‹― 𝑓 𝑀,𝑏 π‘₯ 𝑁 What is the largest probability model (characterized by w* and b*) generating these data (maximum likelihood)? 𝑀 βˆ— , 𝑏 βˆ— =π‘Žπ‘Ÿπ‘” max 𝑀,𝑏 𝐿 𝑀,𝑏

53 = …… π‘₯ 1 π‘₯ 2 π‘₯ 3 …… 𝐢 1 𝐢 2 π‘₯ 1 π‘₯ 2 π‘₯ 3 …… 𝑦 1 =1 𝑦 2 =1 𝑦 3 =0
𝑦 1 =1 𝑦 2 =1 𝑦 3 =0 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑀,𝑏 = 𝑓 𝑀,𝑏 π‘₯ 1 𝑓 𝑀,𝑏 π‘₯ 2 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 3 β‹― 𝑀 βˆ— , 𝑏 βˆ— =π‘Žπ‘Ÿπ‘” max 𝑀,𝑏 𝐿 𝑀,𝑏 = 𝑀 βˆ— , 𝑏 βˆ— =π‘Žπ‘Ÿπ‘” min 𝑀,𝑏 βˆ’π‘™π‘›πΏ 𝑀,𝑏 βˆ’π‘™π‘›πΏ 𝑀,𝑏 =βˆ’π‘™π‘› 𝑓 𝑀,𝑏 π‘₯ 1 βˆ’ 𝑦 1 𝑙𝑛𝑓 π‘₯ βˆ’ 𝑦 1 𝑙𝑛 1βˆ’π‘“ π‘₯ 1 1 βˆ’π‘™π‘› 𝑓 𝑀,𝑏 π‘₯ 2 βˆ’ 𝑦 2 𝑙𝑛𝑓 π‘₯ βˆ’ 𝑦 2 𝑙𝑛 1βˆ’π‘“ π‘₯ 2 1 βˆ’π‘™π‘› 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 3 βˆ’ 𝑦 3 𝑙𝑛𝑓 π‘₯ βˆ’ 𝑦 3 𝑙𝑛 1βˆ’π‘“ π‘₯ 3 1 ……

54 Error Function 𝐿 𝑀,𝑏 = 𝑓 𝑀,𝑏 π‘₯ 1 𝑓 𝑀,𝑏 π‘₯ 2 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 3 β‹― 𝑓 𝑀,𝑏 π‘₯ 𝑁
𝐿 𝑀,𝑏 = 𝑓 𝑀,𝑏 π‘₯ 1 𝑓 𝑀,𝑏 π‘₯ 2 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 3 β‹― 𝑓 𝑀,𝑏 π‘₯ 𝑁 𝑦 𝑛 : 1 for class 1, 0 for class 2 = 𝑛 βˆ’ 𝑦 𝑛 𝑙𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 + 1βˆ’ 𝑦 𝑛 𝑙𝑛 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 Cross entropy between two Bernoulli distribution (estimated output v.s. true output) Distribution p: p π‘₯=1 = 𝑦 𝑛 p π‘₯=0 = 1βˆ’ 𝑦 𝑛 Distribution q: q π‘₯=1 =𝑓 π‘₯ 𝑛 q π‘₯=0 =1βˆ’π‘“ π‘₯ 𝑛 cross entropy 𝐻 𝑝,π‘ž =βˆ’ π‘₯ 𝑝 π‘₯ 𝑙𝑛 π‘ž π‘₯

55 Logistic Regression Linear Regression
𝑓 𝑀,𝑏 π‘₯ =𝜎 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 𝑓 𝑀,𝑏 π‘₯ = 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: π‘₯ 𝑛 , 𝑦 𝑛 Training data: π‘₯ 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐢 𝑓 π‘₯ 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 π‘₯ 𝑛 βˆ’ 𝑦 𝑛 2 Cross entropy: 𝐢 𝑓 π‘₯ 𝑛 , 𝑦 𝑛 =βˆ’ 𝑦 𝑛 𝑙𝑛𝑓 π‘₯ 𝑛 + 1βˆ’ 𝑦 𝑛 𝑙𝑛 1βˆ’π‘“ π‘₯ 𝑛

56 Gradient Descent 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛
1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 = 𝑛 βˆ’ 𝑦 𝑛 𝑙𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 + 1βˆ’ 𝑦 𝑛 𝑙𝑛 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 βˆ’π‘™π‘›πΏ 𝑀,𝑏 πœ• 𝑀 𝑖 πœ• 𝑀 𝑖 πœ• 𝑀 𝑖 πœ•π‘™π‘› 𝑓 𝑀,𝑏 π‘₯ πœ• 𝑀 𝑖 = πœ•π‘™π‘› 𝑓 𝑀,𝑏 π‘₯ πœ•π‘§ πœ•π‘§ πœ• 𝑀 𝑖 πœ•π‘§ πœ• 𝑀 𝑖 = π‘₯ 𝑖 𝜎 𝑧 πœ•πœŽ 𝑧 πœ•π‘§ πœ•π‘™π‘›πœŽ 𝑧 πœ•π‘§ = 1 𝜎 𝑧 πœ•πœŽ 𝑧 πœ•π‘§ = 1 𝜎 𝑧 𝜎 𝑧 1βˆ’πœŽ 𝑧 𝑓 𝑀,𝑏 π‘₯ =𝜎 𝑧 𝑧=π‘€βˆ™π‘₯+𝑏= 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 = 1 1+𝑒π‘₯𝑝 βˆ’π‘§

57 Gradient Descent 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛
1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 = 𝑛 βˆ’ 𝑦 𝑛 𝑙𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 + 1βˆ’ 𝑦 𝑛 𝑙𝑛 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 βˆ’π‘™π‘›πΏ 𝑀,𝑏 πœ• 𝑀 𝑖 πœ• 𝑀 𝑖 πœ• 𝑀 𝑖 πœ•π‘™π‘› 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ πœ• 𝑀 𝑖 = πœ•π‘™π‘› 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ πœ•π‘§ πœ•π‘§ πœ• 𝑀 𝑖 πœ•π‘§ πœ• 𝑀 𝑖 = π‘₯ 𝑖 πœ•π‘™π‘› 1βˆ’πœŽ 𝑧 πœ•π‘§ =βˆ’ 1 1βˆ’πœŽ 𝑧 πœ•πœŽ 𝑧 πœ•π‘§ =βˆ’ 1 1βˆ’πœŽ 𝑧 𝜎 𝑧 1βˆ’πœŽ 𝑧 𝑓 𝑀,𝑏 π‘₯ =𝜎 𝑧 𝑧=π‘€βˆ™π‘₯+𝑏= 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 = 1 1+𝑒π‘₯𝑝 βˆ’π‘§

58 Larger difference, larger update
Gradient Descent 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 = 𝑛 βˆ’ 𝑦 𝑛 𝑙𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 + 1βˆ’ 𝑦 𝑛 𝑙𝑛 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 βˆ’π‘™π‘›πΏ 𝑀,𝑏 πœ• 𝑀 𝑖 πœ• 𝑀 𝑖 πœ• 𝑀 𝑖 = 𝑛 βˆ’ 𝑦 𝑛 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 βˆ’ 1βˆ’ 𝑦 𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 = 𝑛 βˆ’ 𝑦 𝑛 βˆ’ 𝑦 𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 + 𝑦 𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 = 𝑛 βˆ’ 𝑦 𝑛 βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 𝑀 𝑖 ← 𝑀 𝑖 βˆ’πœ‚ 𝑛 βˆ’ 𝑦 𝑛 βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 Larger difference, larger update

59 Logistic Regression Linear Regression
𝑓 𝑀,𝑏 π‘₯ =𝜎 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 𝑓 𝑀,𝑏 π‘₯ = 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 Step 1: Output: between 0 and 1 Output: any value Training data: π‘₯ 𝑛 , 𝑦 𝑛 Training data: π‘₯ 𝑛 , 𝑦 𝑛 Step 2: 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝑦 𝑛 : a real number 𝐿 𝑓 = 𝑛 𝐢 𝑓 π‘₯ 𝑛 , 𝑦 𝑛 𝐿 𝑓 = 1 2 𝑛 𝑓 π‘₯ 𝑛 βˆ’ 𝑦 𝑛 2 𝑀 𝑖 ← 𝑀 𝑖 βˆ’πœ‚ 𝑛 βˆ’ 𝑦 𝑛 βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 Logistic regression: Step 3: 𝑀 𝑖 ← 𝑀 𝑖 βˆ’πœ‚ 𝑛 βˆ’ 𝑦 𝑛 βˆ’ 𝑓 𝑀,𝑏 π‘₯ 𝑛 π‘₯ 𝑖 𝑛 Linear regression:

60 Logistic Regression + Square Error
𝑓 𝑀,𝑏 π‘₯ =𝜎 𝑖 𝑀 𝑖 π‘₯ 𝑖 +𝑏 Step 1: Step 2: Training data: π‘₯ 𝑛 , 𝑦 𝑛 , 𝑦 𝑛 : 1 for class 1, 0 for class 2 𝐿 𝑓 = 1 2 𝑛 𝑓 𝑀,𝑏 π‘₯ 𝑛 βˆ’ 𝑦 𝑛 2 Step 3: πœ• ( 𝑓 𝑀,𝑏 (π‘₯)βˆ’ 𝑦 ) 2 πœ• 𝑀 𝑖 =2 𝑓 𝑀,𝑏 π‘₯ βˆ’ 𝑦 𝑓 𝑀,𝑏 π‘₯ 1βˆ’ 𝑓 𝑀,𝑏 π‘₯ π‘₯ 𝑖 𝑦 𝑛 =1 If 𝑓 𝑀,𝑏 π‘₯ 𝑛 =1 (close to target) πœ•πΏ πœ• 𝑀 𝑖 =0 If 𝑓 𝑀,𝑏 π‘₯ 𝑛 =0 (far from target) πœ•πΏ πœ• 𝑀 𝑖 =0

61 Cross Entropy v.s. Square Error
Total Loss Square Error w2 w1

62 Multi-class Classification
Probability: 1> 𝑦 𝑖 >0 𝑖 𝑦 𝑖 =1 C1: 𝑀 1 , 𝑏 1 𝑧 1 = 𝑀 1 βˆ™π‘₯+ 𝑏 1 C2: 𝑀 2 , 𝑏 2 𝑧 2 = 𝑀 2 βˆ™π‘₯+ 𝑏 2 C3: 𝑀 3 , 𝑏 3 𝑧 3 = 𝑀 3 βˆ™π‘₯+ 𝑏 3 Softmax 3 20 0.88 0.12 1 2.7 β‰ˆ0 -3 0.05

63 Multi-class Classification
𝑧 1 = 𝑀 1 βˆ™π‘₯+ 𝑏 1 Cross Entropy π‘₯ 𝑧 2 = 𝑀 2 βˆ™π‘₯+ 𝑏 2 Softmax βˆ’ 𝑖=1 3 𝑦 𝑖 𝑙𝑛 𝑦 𝑖 𝑧 3 = 𝑀 3 βˆ™π‘₯+ 𝑏 3 target If x ∈ class 1 If x ∈ class 2 If x ∈ class 3 1 0 0 0 1 0 0 0 1 𝑦 = 𝑦 = 𝑦 =

64 Limitation of Logistic Regression
Input Feature Label x1 x2 Class 2 1 Class 1 z β‰₯ 0 z < 0 z < 0 z β‰₯ 0

65 Limitation of Logistic Regression
π‘₯ 1 β€² : distance to 0 0 π‘₯ 2 β€² : distance to 1 1 Feature Transformation π‘₯ 1 β€² π‘₯ 2 β€² π‘₯ 1 π‘₯ 2 Not always easy to find a good transformation π‘₯ 2 β€² 0 2 0 1 1 1 2 0 0 0 1 0 1 1 π‘₯ 1 β€²

66 Limitation of Logistic Regression
Cascading logistic regression models π‘₯ 1 β€² π‘₯ 2 β€² Feature Transformation Classification (ignore bias in this figure)

67 Feature Transformation
Folding the space

68 Deep Learning! β€œNeuron” Neural Network


Download ppt "Deep Learning for Non-Linear Control"

Similar presentations


Ads by Google