Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Slides:

Advertisements

Similar presentations

Slides from: Doug Gray, David Poole

Advertisements

G53MLE | Machine Learning | Dr Guoping Qiu

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Lecture 14 – Neural Networks

Simple Neural Nets For Pattern Classification

x – independent variable (input)

CS 4700: Foundations of Artificial Intelligence

Collaborative Filtering Matrix Factorization Approach

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Classification Part 3: Artificial Neural Networks

Artificial Neural Networks

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Classification / Regression Neural Networks 2

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

Linear Discrimination Reading: Chapter 2 of textbook.

Non-Bayes classifiers. Linear discriminants, neural networks.

Linear Classification with Perceptrons

Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.

CS621 : Artificial Intelligence

Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

EEE502 Pattern Recognition

Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Today’s Lecture Neural networks Training

Machine Learning Supervised Learning Classification and Regression

Neural networks and support vector machines

Fall 2004 Backpropagation CS478 - Machine Learning.

CS 388: Natural Language Processing: Neural Networks

Gradient descent David Kauchak CS 158 – Fall 2016.

Deep Feedforward Networks

Supervised Learning in ANNs

The Gradient Descent Algorithm

Learning with Perceptrons and Neural Networks

第 3 章神经网络.

CSE 473 Introduction to Artificial Intelligence Neural Networks

Neural Networks CS 446 Machine Learning.

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Artificial neural networks (ANNs)

Neural Networks and Backpropagation

Classification / Regression Neural Networks 2

Machine Learning Today: Reading: Maria Florina Balcan

CSC 578 Neural Networks and Deep Learning

Disadvantages of Discrete Neurons

Classification Neural Networks 1

Collaborative Filtering Matrix Factorization Approach

Synaptic DynamicsII : Supervised Learning

cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10

Perceptron as one Type of Linear Discriminants

Multilayer Perceptron & Backpropagation

Lecture Notes for Chapter 4 Artificial Neural Networks

Neural networks (1) Traditional multi-layer perceptrons

Backpropagation David Kauchak CS159 – Fall 2019.

Computer Vision Lecture 19: Object Recognition III

Linear Discrimination

CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.

David Kauchak CS158 – Spring 2019

Introduction to Neural Networks

Outline Announcement Neural networks Perceptrons - continued

Presentation transcript:

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Today’s learning goals At the end of today, you should be able to Describe gradient descent for learning model parameters Explain the difference between logistic regression and linear regression Tell if a 2-D dataset is linearly separable Explain the structure of a neural network

Least squares with a non-linear function Consider data from a nonlinear distribution. Here, assume sinusoidal We now want the sine wave of best fit 𝑦=sin⁡( 𝑤 1 𝑥+ 𝑤 0 ) Considering only two of the 4 possible parameters for convenience

Least squares with a non-linear function With observed: 𝑥 1 , 𝑦 1 …( 𝑥 𝑁 , 𝑦 𝑁 ) We want: 𝑦=sin⁡( 𝑤 1 𝑥+ 𝑤 0 ) Least squares: Minimizing L2 loss (sum of squared errors) 𝐿 𝑤 ;𝑥,𝑦 = 𝑗=1 𝑁 𝑦 𝑗 −𝑠𝑖𝑛 𝑤 1 𝑥 𝑗 + 𝑤 0 2 𝑤 ∗ = argmin 𝑤 𝐿( 𝑤 ;𝑥,𝑦)

Least squares with a non-linear function 𝐿 𝑤 ;𝑥,𝑦 = 𝑗=1 𝑁 𝑦 𝑗 −𝑠𝑖𝑛 𝑤 1 𝑥 𝑗 + 𝑤 0 2 Using L2 loss Again, calculate the partial derivatives w.r.t. 𝑤 0 , 𝑤 1 𝛿𝐿 𝛿 𝑤 1 𝑤 ;𝑥,𝑦 = 𝑗 2 𝑥 𝑗 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗 𝛿𝐿 𝛿 𝑤 0 𝑤 ;𝑥,𝑦 = 𝑗 2 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗

Least squares with a non-linear function 𝛿𝐿 𝛿 𝑤 1 𝑤 ;𝑥,𝑦 = 𝑗 2 𝑥 𝑗 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗 𝛿𝐿 𝛿 𝑤 0 𝑤 ;𝑥,𝑦 = 𝑗 2 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗 But there’s no unique solution for these! In many cases, there won’t even be a closed form solution

Least squares with a non-linear function Here’s the loss function over 𝑤 𝑜 , 𝑤 1 Very much non-convex! Lots of local minima Instead of solving exactly, iterative solution with gradient descent

Gradient descent algorithm 𝑤 0 ← random point in 𝑤 0 , 𝑤 1 space loop until convergence do for each 𝑤 𝑖 in 𝑤 𝑡 do 𝑤 𝑖 ← 𝑤 𝑖 −𝛼 𝛿𝐿 𝛿 𝑤 𝑖 ( 𝑤 ;𝑥,𝑦) Learning rate

Gradient descent 𝑤 𝑖 𝐿( 𝑤 ;𝑥,𝑦) Good! Escaped the local solution for a better solution. 𝑤 𝑖

Gradient descent 𝑤 𝑖 𝐿( 𝑤 ;𝑥,𝑦) Good! Escaped the local solution for a better solution. 𝑤 𝑖

Let’s run it! Simpler example: Data Loss

Let’s run it! We have our partial derivatives: 𝛿𝐿 𝛿 𝑤 1 𝑤 ;𝑥,𝑦 = 𝑗 2 𝑥 𝑗 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗 𝛿𝐿 𝛿 𝑤 0 𝑤 ;𝑥,𝑦 = 𝑗 2 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗 We have our data: 2.04, 0.94 6.15, −0.33 …

Gradient descent example Start with random 𝑤 0 =0.4, 𝑤 1 =0.4

Gradient descent example 𝑤 𝑖 ← 𝑤 𝑖 −𝛼 𝛿𝐿 𝛿 𝑤 𝑖 ( 𝑤 ;𝑥,𝑦) Gradient descent update: 𝛼=0.0001 𝛿𝐿 𝛿 𝑤 1 𝑤 ;𝑥,𝑦 = 𝑗 2 𝑥 𝑗 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗 =2 2.04 cos 0.4 2.04 +0.4 sin 0.4(2.04)+0.4 −0.94 +2 6.15 cos 0.4(6.15)+0.4 sin 0.4 6.15 +0.4 +0.33 +… =−189 𝑤 1 ←0.4−0.0001(−189)≈0.42

Gradient descent example 𝑤 𝑖 ← 𝑤 𝑖 −𝛼 𝛿𝐿 𝛿 𝑤 𝑖 ( 𝑤 ;𝑥,𝑦) Gradient descent update: 𝛼=0.0001 𝛿𝐿 𝛿 𝑤 0 𝑤 ;𝑥,𝑦 = 𝑗 2 cos 𝑤 1 𝑥 𝑗 + 𝑤 0 sin 𝑤 1 𝑥 𝑗 + 𝑤 0 − 𝑦 𝑗 =2 cos 0.4 2.04 +0.4 sin 0.4(2.04)+0.4 −0.94 +2 cos 0.4(6.15)+0.4 sin 0.4 6.15 +0.4 +0.33 +… =−20.5 𝑤 0 ←0.4−0.0001(−20.5)≈0.402

Gradient descent example After 1 iteration, we have 𝑤 0 =0.402, 𝑤 1 =0.42

Gradient descent example After 2 iterations, we have 𝑤 0 =0.404, 𝑤 1 =0.44

Gradient descent example After 3 iterations, we have 𝑤 0 =0.405, 𝑤 1 =0.45

Gradient descent example After 4 iterations, we have 𝑤 0 =0.407, 𝑤 1 =0.47

Gradient descent example After 5 iterations, we still have 𝑤 0 =0.407, 𝑤 1 =0.47

Gradient descent example By 13 iterations, we’ve pretty well converged around 𝑤 0 =0.409, 𝑤 1 =0.49

What about the complicated example? Gradient descent doesn’t always behave well with complicated data It can Overfit Oscillate

Gradient descent example Start with random 𝑤 0 =3.1, 𝑤 1 =0.2, 𝛼=0.01

Gradient descent example After 1 iteration

Gradient descent example After 2 iterations

Gradient descent example After 3 iterations

Gradient descent example After 4 iterations

Gradient descent example After 5 iterations

Gradient descent example After 6 iterations

Gradient descent example After 7 iterations

Gradient descent example After 8 iterations

Gradient descent example After 9 iterations

Gradient descent example After 10 iterations

Linear classifiers We’ve been talking about fitting a line. But what about this linear classification example? Remember that “linear” in AI means constant slope; other functions may be polynomial, trigonometric, etc.

Threshold classifier The line separating the two regions is a decision boundary. Easiest is a hard threshold: f(x) x 1 𝑓 𝑧 = 1, 𝑖𝑓 𝑧≥0 0, 𝑒𝑙𝑠𝑒

Linear classifiers Here, our binary classifier would be 𝑓 𝒙=𝑥 1 , 𝑥 2 = 1, 𝑖𝑓 ( 𝑥 2 + 𝑥 1 −2.7)≥0 0, 𝑒𝑙𝑠𝑒 In general, for any line: 𝑓 𝒙;𝒘 = 1, 𝑖𝑓 ( 𝑤 2 𝑥 2 + 𝑤 1 𝑥 1 + 𝑤 0 )≥0 0, 𝑒𝑙𝑠𝑒

Perceptron (Neuron) 𝑥 1 𝑥 2 We can think of this as a composition of two functions: 𝑔 𝒙;𝒘 = 1, 𝑖𝑓 𝑓(𝒙;𝒘)≥0 0, 𝑒𝑙𝑠𝑒 𝑓 𝒙;𝒘 = 𝑤 2 𝑥 2 + 𝑤 1 𝑥 1 + 𝑤 0 We can represent this composition graphically like: 𝑤 0 𝑥 1 𝑤 1 𝑔(𝒙;𝒘) 𝑤 2 𝑥 2

Perceptron learning rule We can train a perceptron with a simple update: 𝑤 𝑖 ← 𝑤 𝑖 +𝛼 𝑦−𝑔 𝒙;𝒘 𝑥 𝑖 The error on x with model w Called the perceptron learning rule Iterative updates to the weight vector Calculate updates to each weight and apply them all at once! Will converge to the optimal solution if the data are linearly separable

Linear separability Can you draw a line that perfectly separates the classes?

The problem with hard thresholding Perceptron updates won’t converge if the data aren’t separable! So let’s try gradient descent: 𝑔 𝒙;𝒘 = 1, 𝑖𝑓 ( 𝑤 2 𝑥 2 + 𝑤 1 𝑥 1 + 𝑤 0 )≥0 0, 𝑒𝑙𝑠𝑒 Minimizing the L2 loss w.r.t. true labels Y: 𝐿 𝒘,𝑿,𝒀 = 𝑗=1 𝑁 𝑦 𝑗 −𝑔 𝑥 𝑗 ;𝒘 2 What breaks this minimization?

Switching to Logistic Regression We need a differentiable classifier function Use the logistic function (aka sigmoid function) 𝑓 𝒙;𝒘 = 1 1+ 𝑒 −𝒘⋅𝒙 Using this, it’s now called logistic regression

Modified neuron 𝑥 1 𝑥 2 𝑤 1 𝑤 2 𝑤 0 𝑔(𝒙;𝒘)

Gradient descent for logistic regression Now we have a differentiable loss function! 𝑔 𝒙;𝒘 = 1 1+ 𝑒 −𝑓 𝒙;𝒘 𝑓 𝒙;𝒘 = 𝑤 2 𝑥 2 + 𝑤 1 𝑥 1 + 𝑤 0 L2 loss w.r.t. true labels Y: 𝐿 𝒘,𝑿,𝒀 = 𝑗=1 𝑁 𝑦 𝑗 −𝑔 𝑥 𝑗 ;𝒘 2

Gradient descent for logistic regression Partial differentiation gives 𝛿𝐿 𝛿 𝑤 𝑖 𝒘 = 𝑗 −2( 𝑦 𝑗 −𝑔 𝑥 𝑗 ,𝒘 ×𝑔 𝑥 𝑗 ,𝒘 1−𝑔 𝑥 𝑗 ,𝒘 × 𝑥 𝑗,𝑖 So now our gradient-based update for each 𝑤 𝑖 looks like: 𝑤 𝑖 ← 𝑤 𝑖 +𝛼 𝑗 −2( 𝑦 𝑗 −𝑔 𝑥 𝑗 ,𝒘 ×𝑔 𝑥 𝑗 ,𝒘 1−𝑔 𝑥 𝑗 ,𝒘 × 𝑥 𝑗,𝑖

Gradient descent for logistic regression

Gradient descent for logistic regression

Gradient descent for logistic regression

Gradient descent for logistic regression

Gradient descent for logistic regression

Gradient descent for logistic regression

Gradient descent for logistic regression

The XOR problem Logistic regression is great! But it requires tolerating error, and sometimes there’s just too much error.

The XOR problem Logistic regression is great! But it requires tolerating error, and sometimes there’s just too much error. 𝑥 1 𝑥 2 1 A linear model will always have at least 50% error!

Neural Networks We can model nonlinear decision boundaries by stacking up neurons 𝑥 1 𝑥 2 𝑥 3

XOR neural network OR XOR has two components: OR and ¬AND Each of these is linearly separable OR ? 𝑥 1 ? 𝑥 1 ∨ 𝑥 2 ? 𝑥 2

XOR neural network OR XOR has two components: OR and ¬AND Each of these is linearly separable OR −0.5 𝑥 1 1 𝑥 1 ∨ 𝑥 2 1 𝑥 2

XOR neural network AND XOR has two components: OR and ¬AND Each of these is linearly separable AND ? 𝑥 1 ? 𝑥 1 ∧ 𝑥 2 ? 𝑥 2

XOR neural network AND XOR has two components: OR and ¬AND Each of these is linearly separable AND −1.5 𝑥 1 1 𝑥 1 ∧ 𝑥 2 1 𝑥 2

XOR neural network XOR = 𝑂𝑅 𝑥 1 , 𝑥 2 ∧¬𝐴𝑁𝐷( 𝑥 1 , 𝑥 2 ) −0.5 ? 𝑥 1 1 𝑋𝑂𝑅( 𝑥 1 , 𝑥 2 ) ? 𝑥 2 1 1 −1.5

XOR neural network XOR = 𝑂𝑅 𝑥 1 , 𝑥 2 ∧¬𝐴𝑁𝐷( 𝑥 1 , 𝑥 2 ) −0.5 −0.1 𝑥 1 𝑋𝑂𝑅( 𝑥 1 , 𝑥 2 ) −1 𝑥 2 1 1 −1.5

Let’s see what’s going on XOR neural network Let’s see what’s going on XOR = 𝑂𝑅 𝑥 1 , 𝑥 2 ∧¬𝐴𝑁𝐷( 𝑥 1 , 𝑥 2 ) −0.5 𝑓( 𝑥 1 , 𝑥 2 ) −0.1 1 1 𝑥 1 1 𝑋𝑂𝑅( 𝑥 1 , 𝑥 2 ) −1 𝑥 2 1 1 ℎ( 𝑥 1 , 𝑥 2 ) −1.5

Nonlinear mapping in middle layer 𝑥 1 𝑥 2 1 𝑓(𝑥 1 , 𝑥 2 ) ℎ( 𝑥 1 ,𝑥 2 ) 1 OR AND

Nonlinear mapping in middle layer 𝑥 1 𝑥 2 1 𝑓(𝑥 1 , 𝑥 2 ) ℎ( 𝑥 1 ,𝑥 2 ) 1 OR AND

Nonlinear mapping in middle layer 𝑥 1 𝑥 2 1 𝑓(𝑥 1 , 𝑥 2 ) ℎ( 𝑥 1 ,𝑥 2 ) 1 OR AND

Nonlinear mapping in middle layer 𝑥 1 𝑥 2 1 𝑓(𝑥 1 , 𝑥 2 ) ℎ( 𝑥 1 ,𝑥 2 ) 1 OR AND

Nonlinear mapping in middle layer 𝑓(𝑥 1 , 𝑥 2 ) ℎ( 𝑥 1 ,𝑥 2 ) 1 AND OR Now it’s linearly separable!

Backpropagation 𝑔 𝒙;𝑾 =𝑠𝑖𝑔 𝒘 𝟐 ⋅𝒉 𝒙; 𝒘 𝟏 This is just another composition of functions 𝑋𝑂𝑅 𝑥 1 , 𝑥 2 =𝑂𝑅 𝑥 1 , 𝑥 2 ∧¬𝐴𝑁𝐷( 𝑥 1 , 𝑥 2 ) Generally, let ℎ 1 … ℎ 𝑚 be the intermediate functions (called the hidden layer) 𝒘 𝟏 , 𝒘 𝟐 be the weight vectors for input->hidden, hidden->output 𝑠𝑖𝑔(𝑧) denote the sigmoid function over z 𝑔 𝒙;𝑾 =𝑠𝑖𝑔 𝒘 𝟐 ⋅𝒉 𝒙; 𝒘 𝟏

Backpropagation To learn with gradient descent (example: L2 loss w.r.t. Y) 𝐿 𝑾,𝑿,𝒀 = 𝑗=1 𝑁 𝑦 𝑗 −𝑔 𝑥 𝑗 ;𝑾 2 𝑔 𝒙;𝑾 =𝑠𝑖𝑔 𝒘 𝟐 ⋅𝒉 𝒙; 𝒘 𝟏 Apply the Chain Rule (for differentiation this time) to differentiate the composed functions Get partial derivatives of overall error w.r.t. each parameter in the network In hidden node ℎ 𝑖 , get derivative w.r.t. the output of ℎ 𝑖 , then differentiate that w.r.t. 𝑤 𝑗

“Deep” learning Deep models use more than one hidden layer (i.e., more than one set of nonlinear functions before the final output 𝑥 1 𝑥 2 𝑥 3

Today’s learning goals At the end of today, you should be able to Describe gradient descent for learning model parameters Explain the difference between logistic regression and linear regression Tell if a 2-D dataset is linearly separable Explain the structure of a neural network

Next time AI as an empirical science Experimental design

End of class recap How does gradient descent use the loss function to tell us how to update model parameters? What machine learning problem is logistic regression for? What about linear regression? Can the dataset at right be correctly classified with a logistic regression? Can it be correctly classified with a neural network? What is your current biggest question about machine learning?