Download presentation
Presentation is loading. Please wait.
1
Deep Learning for Non-Linear Control
Shiyan Hu Michigan Technological University
2
The General Non-Linear Control System
Need to model the nonlinear dynamics of the plant, how?
3
An Example Current outside temperature at t The AC power level at t+1
AC dynamics model Current outside temperature at t The AC power level at t+1 Required temperature at t The actual temperature at t+1 Available AC power levels at t Forecast outside temperature at t+1 The electricity bill at t+1 Time t In most cases, one cannot analytically model this dynamics, and can only use historical operation data to approximate its dynamics. This is called learning.
4
What is Learning? π = π = Stock Market Forecast
Self-driving Car Dow Jones Industrial Average tomorrow π = π = Wheel control
5
Learning Examples Supervised learning Unsupervised learning
Given inputs and target outputs, fit the model (e.g., classification). Translate one language to the other language, which is beyond classification since we cannot enumerate all possible sentences. This is called structural learning. Unsupervised learning What if we do not know the target outputs? That is, we do not know what we want to learn. Reinforcement learning A machine talks to a person, learn what is a good way and what is not (only through reward function), and then it gradually learn to speak. That is, it evolves through getting the feedback. We do not feed it with exact input and output: not supervised We still give him some feedbacks using reward function: not unsupervised
6
Unsupervised Learning
Learn the meanings of words and sentences through reading the documents.
7
Reinforcement Learning
8
Deep learning trends at Google. Source: SIGMOD 2016/Jeff Dean
9
History of Deep Learning
1958: Perceptron (linear model) 1986: Backpropagation 2006: RBM initialization 2011: Start to be popular in speech recognition 2012: win ILSVRC image competition 2015.2: Image recognition surpassing human-level performanceΒ 2016.3: Alpha GO beats Lee Sedol : Speech recognition system as good as humans
10
Neural Network Neural Network βNeuronβ
Different connection leads to different network structures
11
Deep = Many hidden layers
19 layers 8 layers 6.7% 7.3% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014)
12
Deep = Many hidden layers
3.57% 7.3% 6.7% 16.4% AlexNet (2012) VGG (2014) GoogleNet (2014) Residual Net (2015)
13
Supervised Learning Statistical and Signal Processing Techniques
Linear regression Logistic regression Nonlinear regression Machine Learning Techniques SVM Deep learning
14
Learning Basics Training Data Testing Data
You are given some training data You are to learn a function/model about these training data You will use this model to process testing data Training data and testing data do not necessarily share the same properties
15
Linear Regression: Input Data
Training Data: π₯ 1 , π¦ 1 π₯ 2 , π¦ 2 π₯ 10 , π¦ 10 β¦β¦ This is real data. y = b + wx π₯ π , π¦ π This is a linear function where w and b are scalars
16
An Example Training data are (10, 5) (20, 6) (3,7) (40, 8) (50, 9)
5=b+10w 6=b+20w 7=b+30w 8=b+40w 9=b+50w Compute b and w which best fit these data
17
Linear Regression: Function to Learn
To compute scalars w and b such that y best approximates b + wx One can convert it to an optimization problem minimizing a Loss Function L π€,π = π= π¦ π β π+π€β π₯ π 2 π€ β , π β =πππ min π€,π πΏ π€,π =πππ min π€,π π= π¦ π β π+π€β π₯ π 2
18
Linear Regression: Gradient Descent
π€ β =πππ min π€ πΏ π€ Consider loss function πΏ(π€) with one parameter w: (Randomly) Pick an initial value w0 Compute ππΏ ππ€ | π€= π€ 0 Loss πΏ π€ Negative Increase w Positive Decrease w w0
19
Linear Regression: Gradient Descent
π€ β =πππ min π€ πΏ π€ Consider loss function πΏ(π€) with one parameter w: (Randomly) Pick an initial value w0 Compute ππΏ ππ€ | π€= π€ 0 π€ 1 β π€ 0 βπ ππΏ ππ€ | π€= π€ 0 Loss πΏ π€ Ξ· is called βlearning rateβ w0 βπ ππΏ ππ€ | π€= π€ 0
20
Linear Regression: Gradient Descent
π€ β =πππ min π€ πΏ π€ Consider loss function πΏ(π€) with one parameter w: (Randomly) Pick an initial value w0 Compute ππΏ ππ€ | π€= π€ 0 π€ 1 β π€ 0 βπ ππΏ ππ€ | π€= π€ 0 Loss πΏ π€ Compute ππΏ ππ€ | π€= π€ 1 π€ 2 β π€ 1 βπ ππΏ ππ€ | π€= π€ 1 β¦β¦ Many iterations Local optimal not global optimal w0 w1 w2 wT
21
Still not guarantee reaching global minima, but give some hope β¦β¦
Momentum cost Movement = Negative of ππΏβππ€ + Momentum Negative of ππΏβππ€ Momentum Real Movement ππΏβππ€ = 0
22
Linear Regression: Gradient Descent
How about two parameters? π€ β , π β =πππ min π€,π πΏ π€,π ππΏ ππ€ ππΏ ππ π»πΏ= gradient (Randomly) Pick an initial value w0, b0 Compute ππΏ ππ€ | π€= π€ 0 ,π= π 0 , ππΏ ππ | π€= π€ 0 ,π= π 0 π€ 1 β π€ 0 βπ ππΏ ππ€ | π€= π€ 0 ,π= π 0 π 1 β π 0 βπ ππΏ ππ | π€= π€ 0 ,π= π 0 Compute ππΏ ππ€ | π€= π€ 1 ,π= π 1 , ππΏ ππ | π€= π€ 1 ,π= π 1 π€ 2 β π€ 1 βπ ππΏ ππ€ | π€= π€ 1 ,π= π 1 π 2 β π 1 βπ ππΏ ππ | π€= π€ 1 ,π= π 1
23
Color: Value of Loss L(w,b)
2D Gradient Descent π π€ Color: Value of Loss L(w,b) (βπ ππΏ ππ , βπ ππΏ ππ€ ) Compute ππΏ ππ , ππΏ ππ€
24
Convex L Not the case in linear regression where the loss function L is convex, so there is global optimum. πΏ π€ π
25
Compute Gradient Descent
Formulation of ππΏ ππ€ and ππΏ ππ πΏ π€,π = π= π¦ π β π+π€β π₯ π 2 π= π¦ π β π+π€β π₯ π β π₯ π ππΏ ππ€ =? π€ π β π€ π βπ π β π¦ π βπ€β
π₯ π π₯ π π ππΏ ππ =?
26
Compute Gradient Descent
Formulation of ππΏ ππ€ and ππΏ ππ πΏ π€,π = π= π¦ π β π+π€β π₯ π 2 π= π¦ π β π+π€β π₯ π β π₯ π ππΏ ππ€ =? π= π¦ π β π+π€β π₯ π β1 ππΏ ππ =?
27
How about the results? y = b + wx b = -188.4 w = 2.7 Training Data π 1
Let π π denote the square error. Average Error on Training Data Training Data π 1 π 2 = 31.9 = π=1 10 π π
28
Generalization? What we really care about is the error on new data (testing data) y = b + wx b = w = 2.7 Average Error on Testing Data = 35.0 = π=1 10 π π Testing data How can we do better? > Average Error on Training Data (31.9)
29
More Complex f Best Training Testing y = b + w1 x + w2 (x)2 b = -10.3
w1 = 1.0, w2 = 2.7 x 10-3 Average Error = 15.4 Testing Average Error = 18.4 Better! Could it be even better?
30
More Complex f Best Training Testing y = b + w1 x+ w2 (x)2 + w3 (x)3
Average Error = 15.3 Testing Average Error = 18.1 Slightly better. How about more complex model?
31
More Complex f Best Training Testing y = b + w1 x + w2 (x)2
Average Error = 14.9 Testing Average Error = 28.8 The results become worse ...
32
More Complex f Best Training Testing: y = b + w1 x + w2(x)2
Average Error = 12.8 Testing: Average Error = 232.1 The results are bad.
33
Training and Testing Fitting Error
1 31.9 35.0 2 15.4 18.4 3 15.3 18.1 4 14.9 28.2 5 12.8 232.1 A more complex model does not always lead to better performance on testing data , which is due to overfitting. Where does error come from?
34
Estimator π π β The true function π Bias + Variance
From training data, we find π β π β is an estimator of π π
35
Bias and Variance of Estimator
Assume that a variable x follows a PDF the mean π and the variance of x is π 2 , we want estimate it Estimator: sample N points using the PDF: π₯ 1 , π₯ 2 ,β¦, π₯ π π= 1 π π π₯ π π 2 = 1 π π π₯ π βπ 2 πΈ π =πΈ 1 π π π₯ π = 1 π π πΈ π₯ π =π unbiased πΈ π 2 = πβ1 π π 2 biased
36
πΈ π β = π π β Variance Bias π
37
How to Compute Bias and Variance?
Assume that we have more sets of data Assume that we insist to use linear model y = b + w β x
38
Training Results Different training data lead to different function π β y = b + w β x y = bβ + wβ β x
39
Different Functions/Models
y = b + w β x y = b + w1 β x + w2 β (x)2 + w3 β (x)3 y = b + w1 β x + w2 β (x)2 + w3 β (x)3 + w4 β (x)4 + w5 β (x)5
40
Black curve: the true function π
Red curves: π β Blue curve: the average of π β = π
41
Bias v.s. Variance Large Small Bias, Bias, Large Variance
y = b + w1 β x + w2 β (x)2 + w3 β (x)3 + w4 β (x)4 + w5 β (x)5 y = b + w β x Simpler model is less influenced by the sampled data, while more complex models tend to overfit the data (impacted more by changes in data) Large Bias, Small Variance Small Bias, Large Variance model model
42
Bias v.s. Variance Underfitting Overfitting Large Bias Small Bias
Error from bias Error from variance Error observed Underfitting Overfitting Large Bias Small Bias Small Variance Large Variance
43
What to do with large bias?
Diagnosis: If your model cannot even fit the training examples, then you have large bias If you can fit the training data, but large error on testing data, then you probably have large variance For bias, redesign your model: Add more features as input A more complex function/model Underfitting Overfitting
44
What to do with large variance?
More data Regularization Very effective, but not always practical 10 examples 100 examples
45
Exercise Suppose that you have data, and you will use the function/model y = b + w1 x + w2(x)2+ w3 (x)3 + w4 (x)4+ w5 (x)5 to fit them Plan 1: Put all of data to your training process Plan 2: Partition them into 10 sets. Use regression on each of these 10 sets and you get 10 different functions. Average these 10 functions and return it. Which one is better?
46
High Dimensional Data π π = 1,5,7,4,12,10,9,20,50 π , π π π₯ 1 5 =12
π π = 1,5,7,4,12,10,9,20,50 π , π π π₯ 1 5 =12 π π = 2,9,5,15,19,17,9,21,52 π , π π π₯ 2 5 =19 π = 10,2,3,7,5,8,35,19,29 π , π π π¦ 5 =5 π= π€ 1 π π + π€ 2 π π +π= π π , π π β
π+π π€βπππ π= π€ 1 , π€ 2 is a vector and π is a vector We can still use the gradient descent method to solve it min πΏ= π π¦ π β π π π€ π π₯ π π 2 π=π+ π€ π π π
47
Improve Robustness: Regularization
π=π+ π€ π π π The functions with smaller π€ π are better πΏ= π π¦ π β π π π€ π π₯ π π 2 +π π€ π 2 Why smooth functions are preferred? π=π+ π€ π π π + π€ π Ξ π₯ π +Ξ π₯ π If some noises are induced on input xi when testing, then less influence on the estimated y.
48
Regularization Results
π Training Testing 1.9 102.3 1 2.3 68.7 10 3.5 25.7 100 4.1 11.1 1000 5.6 12.8 10000 6.3 18.7 100000 8.5 26.8 smoother We prefer smooth function, but not too smooth.
49
Logistic Regression Training data What if y value is binary?
This is a classification problem. π₯ 1 , π¦ 1 π₯ 2 , π¦ 2 β¦β¦ π₯ 10 , π¦ 10 π₯ 1 π₯ 2 π₯ 3 π₯ 10 β¦β¦ πΆ 1 πΆ 2
50
Probabilistic Interpretation
Assume that the training data are generated based on a probabilistic distribution function (PDF). We aim to estimate this PDF by π(π₯) which is characterized by parameters w and b, so we also write it as π π€,π (π₯) If π πΆ 1 |π₯ = π(π₯)>0.5, output: y = class 1 Otherwise, output: y = class 2
51
Function β¦ β¦ π πΆ 1 |π₯ β¦ β¦ π πΆ 1 |π₯ = π π€,π π₯ =π(π§)=π π π€ π π₯ π +π
π πΆ 1 |π₯ = π π€,π π₯ =π(π§)=π π π€ π π₯ π +π β¦ β¦ π πΆ 1 |π₯ β¦ β¦ Sigmoid Function π§β₯0βπ πΆ 1 |π₯ = π π€,π π₯ =π π§ >0.5, the data in class 1
52
Data Generation Probability
π₯ 1 π₯ 2 π₯ 3 π₯ π β¦β¦ πΆ 1 πΆ 2 Training Data Given a set of w and b, what is its probability of generating the data? πΏ π€,π = π π€,π π₯ 1 π π€,π π₯ 2 1β π π€,π π₯ 3 β― π π€,π π₯ π What is the largest probability model (characterized by w* and b*) generating these data (maximum likelihood)? π€ β , π β =πππ max π€,π πΏ π€,π
53
= β¦β¦ π₯ 1 π₯ 2 π₯ 3 β¦β¦ πΆ 1 πΆ 2 π₯ 1 π₯ 2 π₯ 3 β¦β¦ π¦ 1 =1 π¦ 2 =1 π¦ 3 =0
π¦ 1 =1 π¦ 2 =1 π¦ 3 =0 π¦ π : 1 for class 1, 0 for class 2 πΏ π€,π = π π€,π π₯ 1 π π€,π π₯ 2 1β π π€,π π₯ 3 β― π€ β , π β =πππ max π€,π πΏ π€,π = π€ β , π β =πππ min π€,π βπππΏ π€,π βπππΏ π€,π =βππ π π€,π π₯ 1 β π¦ 1 πππ π₯ β π¦ 1 ππ 1βπ π₯ 1 1 βππ π π€,π π₯ 2 β π¦ 2 πππ π₯ β π¦ 2 ππ 1βπ π₯ 2 1 βππ 1β π π€,π π₯ 3 β π¦ 3 πππ π₯ β π¦ 3 ππ 1βπ π₯ 3 1 β¦β¦
54
Error Function πΏ π€,π = π π€,π π₯ 1 π π€,π π₯ 2 1β π π€,π π₯ 3 β― π π€,π π₯ π
πΏ π€,π = π π€,π π₯ 1 π π€,π π₯ 2 1β π π€,π π₯ 3 β― π π€,π π₯ π π¦ π : 1 for class 1, 0 for class 2 = π β π¦ π ππ π π€,π π₯ π + 1β π¦ π ππ 1β π π€,π π₯ π Cross entropy between two Bernoulli distribution (estimated output v.s. true output) Distribution p: p π₯=1 = π¦ π p π₯=0 = 1β π¦ π Distribution q: q π₯=1 =π π₯ π q π₯=0 =1βπ π₯ π cross entropy π» π,π =β π₯ π π₯ ππ π π₯
55
Logistic Regression Linear Regression
π π€,π π₯ =π π π€ π π₯ π +π π π€,π π₯ = π π€ π π₯ π +π Step 1: Output: between 0 and 1 Output: any value Training data: π₯ π , π¦ π Training data: π₯ π , π¦ π Step 2: π¦ π : 1 for class 1, 0 for class 2 π¦ π : a real number πΏ π = π πΆ π π₯ π , π¦ π πΏ π = 1 2 π π π₯ π β π¦ π 2 Cross entropy: πΆ π π₯ π , π¦ π =β π¦ π πππ π₯ π + 1β π¦ π ππ 1βπ π₯ π
56
Gradient Descent 1β π π€,π π₯ π π₯ π π
1β π π€,π π₯ π π₯ π π = π β π¦ π ππ π π€,π π₯ π + 1β π¦ π ππ 1β π π€,π π₯ π βπππΏ π€,π π π€ π π π€ π π π€ π πππ π π€,π π₯ π π€ π = πππ π π€,π π₯ ππ§ ππ§ π π€ π ππ§ π π€ π = π₯ π π π§ ππ π§ ππ§ ππππ π§ ππ§ = 1 π π§ ππ π§ ππ§ = 1 π π§ π π§ 1βπ π§ π π€,π π₯ =π π§ π§=π€βπ₯+π= π π€ π π₯ π +π = 1 1+ππ₯π βπ§
57
Gradient Descent 1β π π€,π π₯ π π₯ π π π π€,π π₯ π π₯ π π
1β π π€,π π₯ π π₯ π π π π€,π π₯ π π₯ π π = π β π¦ π ππ π π€,π π₯ π + 1β π¦ π ππ 1β π π€,π π₯ π βπππΏ π€,π π π€ π π π€ π π π€ π πππ 1β π π€,π π₯ π π€ π = πππ 1β π π€,π π₯ ππ§ ππ§ π π€ π ππ§ π π€ π = π₯ π πππ 1βπ π§ ππ§ =β 1 1βπ π§ ππ π§ ππ§ =β 1 1βπ π§ π π§ 1βπ π§ π π€,π π₯ =π π§ π§=π€βπ₯+π= π π€ π π₯ π +π = 1 1+ππ₯π βπ§
58
Larger difference, larger update
Gradient Descent 1β π π€,π π₯ π π₯ π π π π€,π π₯ π π₯ π π = π β π¦ π ππ π π€,π π₯ π + 1β π¦ π ππ 1β π π€,π π₯ π βπππΏ π€,π π π€ π π π€ π π π€ π = π β π¦ π 1β π π€,π π₯ π π₯ π π β 1β π¦ π π π€,π π₯ π π₯ π π = π β π¦ π β π¦ π π π€,π π₯ π β π π€,π π₯ π + π¦ π π π€,π π₯ π π₯ π π = π β π¦ π β π π€,π π₯ π π₯ π π π€ π β π€ π βπ π β π¦ π β π π€,π π₯ π π₯ π π Larger difference, larger update
59
Logistic Regression Linear Regression
π π€,π π₯ =π π π€ π π₯ π +π π π€,π π₯ = π π€ π π₯ π +π Step 1: Output: between 0 and 1 Output: any value Training data: π₯ π , π¦ π Training data: π₯ π , π¦ π Step 2: π¦ π : 1 for class 1, 0 for class 2 π¦ π : a real number πΏ π = π πΆ π π₯ π , π¦ π πΏ π = 1 2 π π π₯ π β π¦ π 2 π€ π β π€ π βπ π β π¦ π β π π€,π π₯ π π₯ π π Logistic regression: Step 3: π€ π β π€ π βπ π β π¦ π β π π€,π π₯ π π₯ π π Linear regression:
60
Logistic Regression + Square Error
π π€,π π₯ =π π π€ π π₯ π +π Step 1: Step 2: Training data: π₯ π , π¦ π , π¦ π : 1 for class 1, 0 for class 2 πΏ π = 1 2 π π π€,π π₯ π β π¦ π 2 Step 3: π ( π π€,π (π₯)β π¦ ) 2 π π€ π =2 π π€,π π₯ β π¦ π π€,π π₯ 1β π π€,π π₯ π₯ π π¦ π =1 If π π€,π π₯ π =1 (close to target) ππΏ π π€ π =0 If π π€,π π₯ π =0 (far from target) ππΏ π π€ π =0
61
Cross Entropy v.s. Square Error
Total Loss Square Error w2 w1
62
Multi-class Classification
Probability: 1> π¦ π >0 π π¦ π =1 C1: π€ 1 , π 1 π§ 1 = π€ 1 βπ₯+ π 1 C2: π€ 2 , π 2 π§ 2 = π€ 2 βπ₯+ π 2 C3: π€ 3 , π 3 π§ 3 = π€ 3 βπ₯+ π 3 Softmax 3 20 0.88 0.12 1 2.7 β0 -3 0.05
63
Multi-class Classification
π§ 1 = π€ 1 βπ₯+ π 1 Cross Entropy π₯ π§ 2 = π€ 2 βπ₯+ π 2 Softmax β π=1 3 π¦ π ππ π¦ π π§ 3 = π€ 3 βπ₯+ π 3 target If x β class 1 If x β class 2 If x β class 3 1 0 0 0 1 0 0 0 1 π¦ = π¦ = π¦ =
64
Limitation of Logistic Regression
Input Feature Label x1 x2 Class 2 1 Class 1 z β₯ 0 z < 0 z < 0 z β₯ 0
65
Limitation of Logistic Regression
π₯ 1 β² : distance to 0 0 π₯ 2 β² : distance to 1 1 Feature Transformation π₯ 1 β² π₯ 2 β² π₯ 1 π₯ 2 Not always easy to find a good transformation π₯ 2 β² 0 2 0 1 1 1 2 0 0 0 1 0 1 1 π₯ 1 β²
66
Limitation of Logistic Regression
Cascading logistic regression models π₯ 1 β² π₯ 2 β² Feature Transformation Classification (ignore bias in this figure)
67
Feature Transformation
Folding the space
68
Deep Learning! βNeuronβ Neural Network
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.