Download presentation
Presentation is loading. Please wait.
1
Machine Learning – Regression David Fenyő
Contact:
2
Linear Regression – one independent variable
2 Relationship: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Data: 𝑦 𝑗 , 𝑥 1𝑗 for j=1..n Loss function: sum of squared errors: 𝐿= 𝑗 𝜖 𝑗 2 = 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2
3
Minimizing the loss function:
Linear Regression – One Independent Variable 3 Minimizing the loss function: 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 =0
4
Minimizing the loss function, L (sum of squared errors):
Linear Regression – One Independent Variable 4 Minimizing the loss function, L (sum of squared errors): 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 1 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 0 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0
5
Linear Regression – One Independent Variable
5 Relationship: 𝑦= 𝑤 1 𝑥+ 𝑤 0 +𝜖 𝒙=(1, 𝑥 1 ) 𝒘=( 𝑤 0 , 𝑤 1 ) 𝑦=𝒙∙𝒘+𝜖
6
𝒙=(1, 𝑓 1 ( 𝑥 1 ), 𝑓 2 ( 𝑥 1 ), 𝑓 3 ( 𝑥 1 ),…, 𝑓 𝑙 ( 𝑥 1 ))
Linear Regression – Sum of Functions 𝑦=𝒙∙𝒘+𝜖 6 𝒙=(1, 𝑓 1 ( 𝑥 1 ), 𝑓 2 ( 𝑥 1 ), 𝑓 3 ( 𝑥 1 ),…, 𝑓 𝑙 ( 𝑥 1 )) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 )
7
Linear Regression – Polynomial
𝑦=𝒙∙𝒘+𝜖 7 𝒙=(1, 𝑥 1 , 𝑥 1 2 , 𝑥 1 3 ,…, 𝑥 1 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )
8
Linear Regression - Multiple Independent Variables
𝑦=𝒙∙𝒘+𝜖 8 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )
9
𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 ,
Linear Regression - Multiple Independent Variables 𝑦=𝒙∙𝒘+𝜖 9 𝒙=(1, 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , …, 𝑓 𝑙 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 )
10
Gradient Descent 10 min 𝒘 𝑳 𝒘
11
Gradient Descent 11 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 )
12
⟹ 𝑤 1 = 𝑤 0 −𝜂 𝑑 𝑑𝑤 𝐿( 𝑤 0 ) 𝑤 1 = 𝑤 0 −𝜂 𝐿 𝑤 0 +∆𝑤 −𝐿( 𝑤 0 ) ∆𝑤
Gradient Descent 12 𝑤 1 = 𝑤 0 −𝜂 𝑑 𝑑𝑤 𝐿( 𝑤 0 ) ⟹ 𝑤 1 = 𝑤 0 −𝜂 𝐿 𝑤 0 +∆𝑤 −𝐿( 𝑤 0 ) ∆𝑤
13
Gradient Descent 13 𝑤 2 = 𝑤 1 −𝜂 𝐿 𝑤 1 +∆𝑤 −𝐿( 𝑤 1 ) ∆𝑤
14
Gradient Descent 14 𝑤 3 = 𝑤 2 −𝜂 𝐿 𝑤 2 +∆𝑤 −𝐿( 𝑤 2 ) ∆𝑤
15
𝑤 4 = 𝑤 3 −𝜂 𝐿 𝑤 3 +∆𝑤 −𝐿( 𝑤 3 ) ∆𝑤 Gradient Descent
15 We want to use a large training rate when we are far from the minimum and decrease it as we get closer. 𝑤 4 = 𝑤 3 −𝜂 𝐿 𝑤 3 +∆𝑤 −𝐿( 𝑤 3 ) ∆𝑤
16
Training: Gradient Descent
16 If the gradient is small in an extended region, gradient descent becomes very slow.
17
Training: Gradient Descent
17 Gradient descent can get stuck in local minima. To improve the behavior for shallow local minima, we can modify gradient descent to take the average of the gradient for the last few steps (similar to momentum and friction).
18
Linear Regression – Error Landscape
Sum of Square Errors Slope
19
Linear Regression – Error Landscape
Slope Sum of Square Errors Intercept Slope
20
Linear Regression – Error Landscape
Slope Sum of Square Errors Intercept Slope
21
Linear Regression – Error Landscape
Sum of Square Errors
22
Linear Regression – Error Landscape
Sum of Square Errors
23
Linear Regression – Error Landscape
Sum of Absolute Errors
24
Linear Regression – Error Landscape
25
Gradient Descent
26
Gradient Descent
27
Gradient Descent
28
Gradient Descent
29
Linear Regression – Gradient Descent
30
Linear Regression – Gradient Descent
31
Linear Regression – Gradient Descent
32
Linear Regression – Gradient Descent
33
Gradient Descent
34
Gradient Descent Batch gradient descent: Uses the whole training set to calculate the gradient for each step. Stochastic gradient descent or Mini-batch gradient descent: Uses a subset of the training set to calculate the gradient for each step.
35
Gradient Descent – Learning Rate
Too Small Too Large
36
Gradient Descent – Learning Rate Decay
Constant Learning Rate Decaying Learning Rate
37
Partially Remembering
Gradient Descent – Unequal Gradients Constant Learning Rate Decaying Learning Rate Partially Remembering Previous Gradients
38
Gradient Descent Sum of Square Errors Sum of Absolute Errors
39
Outliers Sum of Square Errors Sum of Absolute Errors
40
Variable Variance
41
Model Capacity: Overfitting and Underfitting
41
42
Model Capacity: Overfitting and Underfitting
42
43
Model Capacity: Overfitting and Underfitting
43
44
Model Capacity: Overfitting and Underfitting
44 Training Error Error on Training Set Degree of polynomial
45
Training and Testing Data Set Test Training
46
Training and Testing – Linear relationship
Error Testing Error Training Error Degree of polynomial
47
Testing Error and Training Set Size
Low Variance High Variance 10 10 10 30 30 100 30 300 Error 100 1000 100 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial
48
Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Absolute Value of coefficient Coefficient Coefficient Coefficient
49
Training and Testing – Non-linear relationship
Error Testing Error Training Error Degree of polynomial
50
Testing Error and Training Set Size
Low Variance High Variance 10 30 30 30 100 Log(Error) 100 100 300 300 1000 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial
51
min 𝒘 𝑳 𝒘 min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2 Regularization
No Regularization: min 𝒘 𝑳 𝒘 Regularization: min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) Ridge Regression: min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2
52
Regularization: Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Coefficient Coefficient Coefficient
53
Nearest Neighbor Regression – Fixed Distance
54
Nearest Neighbor Regression – Fixed Number
55
Nearest Neighbor Regression
Linear Data Non-Linear Data 10 10 30 30 Error Log(Error) 100 300 1000 100 3000 Number of Neighbors Number of Neighbors
56
Nearest Neighbor Regression
Linear Data Non-Linear Data 10 30 Error Log(Error) 10 100 30 300 1000 3000 Number of Neighbors Number of Neighbors
57
Data Set Test Validation Training Validation: Choosing Hyperparameters
Examples of hyperparameters: Learning rate schedule Regularization parameter Number of nearest neighbors
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.