Presentation is loading. Please wait.

Presentation is loading. Please wait.

Machine Learning – Regression David Fenyő

Similar presentations


Presentation on theme: "Machine Learning – Regression David Fenyő"— Presentation transcript:

1 Machine Learning – Regression David Fenyő
Contact:

2 Linear Regression – one independent variable
2 Relationship: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Data: 𝑦 𝑗 , 𝑥 1𝑗 for j=1..n Loss function: sum of squared errors: 𝐿= 𝑗 𝜖 𝑗 2 = 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2

3 Minimizing the loss function:
Linear Regression – One Independent Variable 3 Minimizing the loss function: 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 =0

4 Minimizing the loss function, L (sum of squared errors):
Linear Regression – One Independent Variable 4 Minimizing the loss function, L (sum of squared errors): 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 1 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 0 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0

5 Linear Regression – One Independent Variable
5 Relationship: 𝑦= 𝑤 1 𝑥+ 𝑤 0 +𝜖 𝒙=(1, 𝑥 1 ) 𝒘=( 𝑤 0 , 𝑤 1 ) 𝑦=𝒙∙𝒘+𝜖

6 𝒙=(1, 𝑓 1 ( 𝑥 1 ), 𝑓 2 ( 𝑥 1 ), 𝑓 3 ( 𝑥 1 ),…, 𝑓 𝑙 ( 𝑥 1 ))
Linear Regression – Sum of Functions 𝑦=𝒙∙𝒘+𝜖 6 𝒙=(1, 𝑓 1 ( 𝑥 1 ), 𝑓 2 ( 𝑥 1 ), 𝑓 3 ( 𝑥 1 ),…, 𝑓 𝑙 ( 𝑥 1 )) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 )

7 Linear Regression – Polynomial
𝑦=𝒙∙𝒘+𝜖 7 𝒙=(1, 𝑥 1 , 𝑥 1 2 , 𝑥 1 3 ,…, 𝑥 1 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )

8 Linear Regression - Multiple Independent Variables
𝑦=𝒙∙𝒘+𝜖 8 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )

9 𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 ,
Linear Regression - Multiple Independent Variables 𝑦=𝒙∙𝒘+𝜖 9 𝒙=(1, 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , …, 𝑓 𝑙 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 )

10 𝒚=𝑿𝒘+𝝐 Linear Regression – Matrix notation
10 𝒚=𝑿𝒘+𝝐 Data: 𝑦 𝑗 , 𝑥 1𝑗 , 𝑥 2𝑗 ,…, 𝑥 𝑘𝑗 for j=1..n 𝒚= ( 𝑦 0 , 𝑦 1 , 𝑦 2 , 𝑦 3 ,… , 𝑦 𝑘 ) 𝑇 𝑿= …1 𝑥 11 𝑥 12 …𝑥 1𝑛 ⋮ 𝑥 𝑘1 ⋮ 𝑥 𝑘2 ⋮ … 𝑥 𝑘𝑛 𝒘= ( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 ) 𝑇

11 Linear Regression 𝒚=𝑿𝒘+𝝐 Minimizing the loss function, L (sum of squared errors): 𝑑𝐿 𝑑𝒘 = 𝑑 𝑑𝒘 𝝐 𝑻 𝝐 =0 ⇒ 𝑑 𝑑𝒘 𝒚−𝑿𝒘 𝑻 𝒚−𝑿𝒘 = 𝑑 𝑑𝒘 𝒚 𝑻 𝒚− 𝒚 𝑻 𝑿𝒘− 𝑿𝒘 𝑻 𝒚+𝟐 𝑿𝒘 𝑻 𝑿𝒘 𝒚 𝑻 𝒚− 𝒚 𝑻 𝑿𝒘− 𝑿𝒘 𝑻 𝒚+𝟐 𝑿𝒘 𝑻 𝑿𝒘 =−𝟐 𝑿 𝑻 𝒚+𝟐 𝑿 𝑻 𝑿 𝒘 =0 ⇒ 𝒘 = (𝑿 𝑻 𝑿) −𝟏 𝑿 𝑻 𝒚

12 Gradient Descent 12 min 𝒘 𝑳 𝒘

13 Gradient Descent 13 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 )

14 ⟹ 𝑤 1 = 𝑤 0 −𝜂 𝑑 𝑑𝑤 𝐿( 𝑤 0 ) 𝑤 1 = 𝑤 0 −𝜂 𝐿 𝑤 0 +∆𝑤 −𝐿( 𝑤 0 ) ∆𝑤
Gradient Descent 14 𝑤 1 = 𝑤 0 −𝜂 𝑑 𝑑𝑤 𝐿( 𝑤 0 ) 𝑤 1 = 𝑤 0 −𝜂 𝐿 𝑤 0 +∆𝑤 −𝐿( 𝑤 0 ) ∆𝑤

15 Gradient Descent 15 𝑤 2 = 𝑤 1 −𝜂 𝐿 𝑤 1 +∆𝑤 −𝐿( 𝑤 1 ) ∆𝑤

16 Gradient Descent 16 𝑤 3 = 𝑤 2 −𝜂 𝐿 𝑤 2 +∆𝑤 −𝐿( 𝑤 2 ) ∆𝑤

17 𝑤 4 = 𝑤 3 −𝜂 𝐿 𝑤 3 +∆𝑤 −𝐿( 𝑤 3 ) ∆𝑤 Gradient Descent
17 We want to use a large training rate when we are far from the minimum and decrease it as we get closer. 𝑤 4 = 𝑤 3 −𝜂 𝐿 𝑤 3 +∆𝑤 −𝐿( 𝑤 3 ) ∆𝑤

18 Linear Regression – Error Landscape
Sum of Square Errors Slope

19 Linear Regression – Error Landscape
Slope Sum of Square Errors Intercept Slope

20 Linear Regression – Error Landscape
Slope Sum of Square Errors Intercept Slope

21 Linear Regression – Error Landscape
Sum of Square Errors

22 Linear Regression – Error Landscape
Sum of Square Errors

23 Linear Regression – Error Landscape
Sum of Absolute Errors

24 Linear Regression – Error Landscape

25 Gradient Descent

26 Gradient Descent

27 Gradient Descent

28 Gradient Descent

29 Linear Regression – Gradient Descent

30 Linear Regression – Gradient Descent

31 Linear Regression – Gradient Descent

32 Linear Regression – Gradient Descent

33 Gradient Descent

34 Gradient Descent Batch gradient descent: Uses the whole training set to calculate the gradient for each step. Stochastic gradient descent or Mini-batch gradient descent: Uses a subset of the training set to calculate the gradient for each step.

35 Gradient Descent – Learning Rate
Too Small Too Large

36 Gradient Descent – Learning Rate Decay
Constant Learning Rate Decaying Learning Rate

37 Partially Remembering
Gradient Descent – Unequal Gradients Constant Learning Rate Decaying Learning Rate Partially Remembering Previous Gradients

38 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛
Gradient Descent – Momentum and Friction 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 ) Partially Remembering Previous Gradients: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Nesterov accelerated gradient: 𝒗 𝑛 =𝛾 𝒗 𝑛−1 +𝜂𝛁𝐿( 𝒘 𝑛 −𝛾 𝒗 𝑛−1 ) 𝒘 𝑛+1 = 𝒘 𝑛 − 𝒗 𝑛 Adagrad: Decreases the learning rate monotonically based on sum of squared past gradients. Adadelta & RMSprop: Extensions of Adagrad that slowly forgets. Adaptive Moment Estimation (Adam): Adaptive learning rates and momentum  

39 Gradient Descent Sum of Square Errors Sum of Absolute Errors

40 Outliers Sum of Square Errors Sum of Absolute Errors

41 Variable Variance

42 Model Capacity: Overfitting and Underfitting
42

43 Model Capacity: Overfitting and Underfitting
43

44 Model Capacity: Overfitting and Underfitting
44

45 Model Capacity: Overfitting and Underfitting
45 Training Error Error on Training Set Degree of polynomial

46 Model Capacity: Overfitting and Underfitting
46 With four parameters I can fit an elephant, and with five I can make him wiggle his trunk. John von Neumann

47 Training and Testing Data Set Test Training

48 Training and Testing – Linear relationship
Error Testing Error Training Error Degree of polynomial

49 Testing Error and Training Set Size
Low Variance High Variance 10 10 10 30 30 100 30 300 Error 100 1000 100 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

50 Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Absolute Value of coefficient Coefficient Coefficient Coefficient

51 Training and Testing – Non-linear relationship
Error Testing Error Training Error Degree of polynomial

52 Testing Error and Training Set Size
Low Variance High Variance 10 30 30 30 100 Log(Error) 100 100 300 300 1000 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

53 min 𝒘 𝑳 𝒘 min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2 Regularization
No Regularization: min 𝒘 𝑳 𝒘 Regularization: min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) Ridge Regression: min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2

54 Regularization: Coefficients and Training Set Size
Degree of polynomial = 9 10 100 1000 Coefficient Coefficient Coefficient

55 Nearest Neighbor Regression – Fixed Distance

56 Nearest Neighbor Regression – Fixed Number

57 Nearest Neighbor Regression
Linear Data Non-Linear Data 10 10 30 30 Error Log(Error) 100 300 1000 100 3000 Number of Neighbors Number of Neighbors

58 Nearest Neighbor Regression
Linear Data Non-Linear Data 10 30 Error Log(Error) 10 100 30 300 1000 3000 Number of Neighbors Number of Neighbors

59 Data Set Test Validation Training Validation: Choosing Hyperparameters
Examples of hyperparameters: Learning rate schedule Regularization parameter Number of nearest neighbors

60 𝒚=𝑿𝒘+𝝐 Multiple Linear Regression
60 Data: 𝑦 𝑗 , 𝑥 1𝑗 , 𝑥 2𝑗 ,…, 𝑥 𝑘𝑗 for j=1..n 𝒚=( 𝑦 0 , 𝑦 1 , 𝑦 2 , 𝑦 3 ,… , 𝑦 𝑘 ) 𝑿= …1 𝑥 11 𝑥 12 …𝑥 1𝑛 ⋮ 𝑥 𝑘1 ⋮ 𝑥 𝑘2 ⋮ … 𝑥 𝑘𝑛 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )

61 Multiple Linear Regression
Correlation of independent variables increase the uncertainty in parameter determination Standard deviation R2 = 0.9 Uncorrelated Number of Data Points

62 Home Work Implement gradient descent for linear regression by extending the scripts provided (or write it from scratch). Explore the effect of using different learning rates on data of different sizes and variances. 62

63


Download ppt "Machine Learning – Regression David Fenyő"

Similar presentations


Ads by Google