Download presentation
Presentation is loading. Please wait.
Published byFelix Ryan Modified over 6 years ago
2
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
3
Todayβs learning goals
At the end of today, you should be able to Describe gradient descent for learning model parameters Explain the difference between logistic regression and linear regression Tell if a 2-D dataset is linearly separable Explain the structure of a neural network
4
Least squares with a non-linear function
Consider data from a nonlinear distribution. Here, assume sinusoidal We now want the sine wave of best fit π¦=sinβ‘( π€ 1 π₯+ π€ 0 ) Considering only two of the 4 possible parameters for convenience
5
Least squares with a non-linear function
With observed: π₯ 1 , π¦ 1 β¦( π₯ π , π¦ π ) We want: π¦=sinβ‘( π€ 1 π₯+ π€ 0 ) Least squares: Minimizing L2 loss (sum of squared errors) πΏ π€ ;π₯,π¦ = π=1 π π¦ π βπ ππ π€ 1 π₯ π + π€ π€ β = argmin π€ πΏ( π€ ;π₯,π¦)
6
Least squares with a non-linear function
πΏ π€ ;π₯,π¦ = π=1 π π¦ π βπ ππ π€ 1 π₯ π + π€ Using L2 loss Again, calculate the partial derivatives w.r.t. π€ 0 , π€ 1 πΏπΏ πΏ π€ 1 π€ ;π₯,π¦ = π 2 π₯ π cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π πΏπΏ πΏ π€ 0 π€ ;π₯,π¦ = π 2 cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π
7
Least squares with a non-linear function
πΏπΏ πΏ π€ 1 π€ ;π₯,π¦ = π 2 π₯ π cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π πΏπΏ πΏ π€ 0 π€ ;π₯,π¦ = π 2 cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π But thereβs no unique solution for these! In many cases, there wonβt even be a closed form solution
8
Least squares with a non-linear function
Hereβs the loss function over π€ π , π€ 1 Very much non-convex! Lots of local minima Instead of solving exactly, iterative solution with gradient descent
9
Gradient descent algorithm
π€ 0 β random point in π€ 0 , π€ 1 space loop until convergence do for each π€ π in π€ π‘ do π€ π β π€ π βπΌ πΏπΏ πΏ π€ π ( π€ ;π₯,π¦) Learning rate
10
Gradient descent π€ π πΏ( π€ ;π₯,π¦)
Good! Escaped the local solution for a better solution. π€ π
11
Gradient descent π€ π πΏ( π€ ;π₯,π¦)
Good! Escaped the local solution for a better solution. π€ π
12
Letβs run it! Simpler example: Data Loss
13
Letβs run it! We have our partial derivatives:
πΏπΏ πΏ π€ 1 π€ ;π₯,π¦ = π 2 π₯ π cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π πΏπΏ πΏ π€ 0 π€ ;π₯,π¦ = π 2 cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π We have our data: 2.04, , β0.33 β¦
14
Gradient descent example
Start with random π€ 0 =0.4, π€ 1 =0.4
15
Gradient descent example
π€ π β π€ π βπΌ πΏπΏ πΏ π€ π ( π€ ;π₯,π¦) Gradient descent update: πΌ=0.0001 πΏπΏ πΏ π€ 1 π€ ;π₯,π¦ = π 2 π₯ π cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π = cos sin 0.4(2.04)+0.4 β cos 0.4(6.15) sin β¦ =β189 π€ 1 β0.4β0.0001(β189)β0.42
16
Gradient descent example
π€ π β π€ π βπΌ πΏπΏ πΏ π€ π ( π€ ;π₯,π¦) Gradient descent update: πΌ=0.0001 πΏπΏ πΏ π€ 0 π€ ;π₯,π¦ = π 2 cos π€ 1 π₯ π + π€ sin π€ 1 π₯ π + π€ 0 β π¦ π =2 cos sin 0.4(2.04)+0.4 β cos 0.4(6.15) sin β¦ =β20.5 π€ 0 β0.4β0.0001(β20.5)β0.402
17
Gradient descent example
After 1 iteration, we have π€ 0 =0.402, π€ 1 =0.42
18
Gradient descent example
After 2 iterations, we have π€ 0 =0.404, π€ 1 =0.44
19
Gradient descent example
After 3 iterations, we have π€ 0 =0.405, π€ 1 =0.45
20
Gradient descent example
After 4 iterations, we have π€ 0 =0.407, π€ 1 =0.47
21
Gradient descent example
After 5 iterations, we still have π€ 0 =0.407, π€ 1 =0.47
22
Gradient descent example
By 13 iterations, weβve pretty well converged around π€ 0 =0.409, π€ 1 =0.49
23
What about the complicated example?
Gradient descent doesnβt always behave well with complicated data It can Overfit Oscillate
24
Gradient descent example
Start with random π€ 0 =3.1, π€ 1 =0.2, πΌ=0.01
25
Gradient descent example
After 1 iteration
26
Gradient descent example
After 2 iterations
27
Gradient descent example
After 3 iterations
28
Gradient descent example
After 4 iterations
29
Gradient descent example
After 5 iterations
30
Gradient descent example
After 6 iterations
31
Gradient descent example
After 7 iterations
32
Gradient descent example
After 8 iterations
33
Gradient descent example
After 9 iterations
34
Gradient descent example
After 10 iterations
35
Linear classifiers Weβve been talking about fitting a line.
But what about this linear classification example? Remember that βlinearβ in AI means constant slope; other functions may be polynomial, trigonometric, etc.
36
Threshold classifier The line separating the two regions is a decision boundary. Easiest is a hard threshold: f(x) x 1 π π§ = 1, ππ π§β₯0 0, πππ π
37
Linear classifiers Here, our binary classifier would be
π π=π₯ 1 , π₯ 2 = 1, ππ ( π₯ 2 + π₯ 1 β2.7)β₯0 0, πππ π In general, for any line: π π;π = 1, ππ ( π€ 2 π₯ 2 + π€ 1 π₯ 1 + π€ 0 )β₯0 0, πππ π
38
Perceptron (Neuron) π₯ 1 π₯ 2
We can think of this as a composition of two functions: π π;π = 1, ππ π(π;π)β₯0 0, πππ π π π;π = π€ 2 π₯ 2 + π€ 1 π₯ 1 + π€ 0 We can represent this composition graphically like: π€ 0 π₯ 1 π€ 1 π(π;π) π€ 2 π₯ 2
39
Perceptron learning rule
We can train a perceptron with a simple update: π€ π β π€ π +πΌ π¦βπ π;π π₯ π The error on x with model w Called the perceptron learning rule Iterative updates to the weight vector Calculate updates to each weight and apply them all at once! Will converge to the optimal solution if the data are linearly separable
40
Linear separability Can you draw a line that perfectly separates the classes?
41
The problem with hard thresholding
Perceptron updates wonβt converge if the data arenβt separable! So letβs try gradient descent: π π;π = 1, ππ ( π€ 2 π₯ 2 + π€ 1 π₯ 1 + π€ 0 )β₯0 0, πππ π Minimizing the L2 loss w.r.t. true labels Y: πΏ π,πΏ,π = π=1 π π¦ π βπ π₯ π ;π 2 What breaks this minimization?
42
Switching to Logistic Regression
We need a differentiable classifier function Use the logistic function (aka sigmoid function) π π;π = 1 1+ π βπβ
π Using this, itβs now called logistic regression
43
Modified neuron π₯ 1 π₯ 2 π€ 1 π€ 2 π€ 0 π(π;π)
44
Gradient descent for logistic regression
Now we have a differentiable loss function! π π;π = 1 1+ π βπ π;π π π;π = π€ 2 π₯ 2 + π€ 1 π₯ 1 + π€ 0 L2 loss w.r.t. true labels Y: πΏ π,πΏ,π = π=1 π π¦ π βπ π₯ π ;π 2
45
Gradient descent for logistic regression
Partial differentiation gives πΏπΏ πΏ π€ π π = π β2( π¦ π βπ π₯ π ,π Γπ π₯ π ,π 1βπ π₯ π ,π Γ π₯ π,π So now our gradient-based update for each π€ π looks like: π€ π β π€ π +πΌ π β2( π¦ π βπ π₯ π ,π Γπ π₯ π ,π 1βπ π₯ π ,π Γ π₯ π,π
46
Gradient descent for logistic regression
47
Gradient descent for logistic regression
48
Gradient descent for logistic regression
49
Gradient descent for logistic regression
50
Gradient descent for logistic regression
51
Gradient descent for logistic regression
52
Gradient descent for logistic regression
53
The XOR problem Logistic regression is great! But it requires tolerating error, and sometimes thereβs just too much error.
54
The XOR problem Logistic regression is great! But it requires tolerating error, and sometimes thereβs just too much error. π₯ 1 π₯ 2 1 A linear model will always have at least 50% error!
55
Neural Networks We can model nonlinear decision boundaries by stacking up neurons π₯ 1 π₯ 2 π₯ 3
56
XOR neural network OR XOR has two components: OR and Β¬AND
Each of these is linearly separable OR ? π₯ 1 ? π₯ 1 β¨ π₯ 2 ? π₯ 2
57
XOR neural network OR XOR has two components: OR and Β¬AND
Each of these is linearly separable OR β0.5 π₯ 1 1 π₯ 1 β¨ π₯ 2 1 π₯ 2
58
XOR neural network AND XOR has two components: OR and Β¬AND
Each of these is linearly separable AND ? π₯ 1 ? π₯ 1 β§ π₯ 2 ? π₯ 2
59
XOR neural network AND XOR has two components: OR and Β¬AND
Each of these is linearly separable AND β1.5 π₯ 1 1 π₯ 1 β§ π₯ 2 1 π₯ 2
60
XOR neural network XOR = ππ
π₯ 1 , π₯ 2 β§Β¬π΄ππ·( π₯ 1 , π₯ 2 ) β0.5 ? π₯ 1 1
πππ
( π₯ 1 , π₯ 2 ) ? π₯ 2 1 1 β1.5
61
XOR neural network XOR = ππ
π₯ 1 , π₯ 2 β§Β¬π΄ππ·( π₯ 1 , π₯ 2 ) β0.5 β0.1 π₯ 1
πππ
( π₯ 1 , π₯ 2 ) β1 π₯ 2 1 1 β1.5
62
Letβs see whatβs going on
XOR neural network Letβs see whatβs going on XOR = ππ
π₯ 1 , π₯ 2 β§Β¬π΄ππ·( π₯ 1 , π₯ 2 ) β0.5 π( π₯ 1 , π₯ 2 ) β0.1 1 1 π₯ 1 1 πππ
( π₯ 1 , π₯ 2 ) β1 π₯ 2 1 1 β( π₯ 1 , π₯ 2 ) β1.5
63
Nonlinear mapping in middle layer
π₯ 1 π₯ 2 1 π(π₯ 1 , π₯ 2 ) β( π₯ 1 ,π₯ 2 ) 1 OR AND
64
Nonlinear mapping in middle layer
π₯ 1 π₯ 2 1 π(π₯ 1 , π₯ 2 ) β( π₯ 1 ,π₯ 2 ) 1 OR AND
65
Nonlinear mapping in middle layer
π₯ 1 π₯ 2 1 π(π₯ 1 , π₯ 2 ) β( π₯ 1 ,π₯ 2 ) 1 OR AND
66
Nonlinear mapping in middle layer
π₯ 1 π₯ 2 1 π(π₯ 1 , π₯ 2 ) β( π₯ 1 ,π₯ 2 ) 1 OR AND
67
Nonlinear mapping in middle layer
π(π₯ 1 , π₯ 2 ) β( π₯ 1 ,π₯ 2 ) 1 AND OR Now itβs linearly separable!
68
Backpropagation π π;πΎ =π ππ π π β
π π; π π
This is just another composition of functions πππ
π₯ 1 , π₯ 2 =ππ
π₯ 1 , π₯ 2 β§Β¬π΄ππ·( π₯ 1 , π₯ 2 ) Generally, let β 1 β¦ β π be the intermediate functions (called the hidden layer) π π , π π be the weight vectors for input->hidden, hidden->output π ππ(π§) denote the sigmoid function over z π π;πΎ =π ππ π π β
π π; π π
69
Backpropagation To learn with gradient descent (example: L2 loss w.r.t. Y) πΏ πΎ,πΏ,π = π=1 π π¦ π βπ π₯ π ;πΎ 2 π π;πΎ =π ππ π π β
π π; π π Apply the Chain Rule (for differentiation this time) to differentiate the composed functions Get partial derivatives of overall error w.r.t. each parameter in the network In hidden node β π , get derivative w.r.t. the output of β π , then differentiate that w.r.t. π€ π
70
βDeepβ learning Deep models use more than one hidden layer (i.e., more than one set of nonlinear functions before the final output π₯ 1 π₯ 2 π₯ 3
71
Todayβs learning goals
At the end of today, you should be able to Describe gradient descent for learning model parameters Explain the difference between logistic regression and linear regression Tell if a 2-D dataset is linearly separable Explain the structure of a neural network
72
Next time AI as an empirical science Experimental design
73
End of class recap How does gradient descent use the loss function to tell us how to update model parameters? What machine learning problem is logistic regression for? What about linear regression? Can the dataset at right be correctly classified with a logistic regression? Can it be correctly classified with a neural network? What is your current biggest question about machine learning?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.