Machine Learning Today: Reading: Maria Florina Balcan

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Beyond Linear Separability
A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.
Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS 4700: Foundations of Artificial Intelligence
Artificial Neural Networks
Classification Neural Networks 1
Perceptron.
Machine Learning Neural Networks
Artificial Neural Networks
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Back-Propagation Algorithm
Artificial Neural Networks
Artificial Neural Networks
CS 4700: Foundations of Artificial Intelligence
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Artificial Neural Networks
Computer Science and Engineering
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
Artificial Neural Network Yalong Li Some slides are from _24_2011_ann.pdf.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Non-Bayes classifiers. Linear discriminants, neural networks.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Artificial Neural Network
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Artificial Neural Networks
Learning with Perceptrons and Neural Networks
ECE 5424: Introduction to Machine Learning
第 3 章 神经网络.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Artificial Neural Networks
Artificial Neural Networks
Classification / Regression Neural Networks 2
CSC 578 Neural Networks and Deep Learning
Classification Neural Networks 1
Artificial Intelligence 13. Multi-Layer ANNs
Artificial Intelligence Chapter 3 Neural Networks
Perceptron as one Type of Linear Discriminants
Artificial Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
COSC 4335: Part2: Other Classification Techniques
Seminar on Machine Learning Rada Mihalcea
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Machine Learning 10-601 Today: Reading: Maria Florina Balcan Machine Learning Department Carnegie Mellon University 02/10/2016 Today: Artificial neural networks Backpropagation Reading: Mitchell: Chapter 4 Bishop: Chapter 5 Slide courtesy: Tom Mitchell

Artificial Neural Network (ANN) Biological systems built of very complex webs of interconnected neurons. Highly connected to other neurons, and performs computations by combining signals from other neurons. Outputs of these computations may be transmitted to one or more other neurons. Artificial Neural Networks built out of a densely interconnected set of simple units (e.g., sigmoid units). Each unit takes real-valued inputs (possibly the outputs of other units) and produces a real-valued output (which may become input to many other units).

ALVINN [Pomerleau 1993]

Artificial Neural Networks to learn f: X  Y f 𝐰 typically a non-linear function, f 𝐰 : X →Y X feature space: (vector of) continuous and/or discrete vars Y ouput space: (vector of) continuous and/or discrete vars f 𝐰 network of basic units Learning algorithm: given 𝑥 𝑑 , 𝑡 𝑑 𝑑∈𝐷 , train weights w of all units to minimize sum of squared errors of predicted network outputs. 𝑑∈𝐷 𝑓 𝑤 𝑥 𝑑 – 𝑡 𝑑 2 Find parameters w to minimize Use gradient descent!

What type of units should we use? Classifier is a multilayer network of units. Each unit takes some inputs and produces one output. Output of one unit can be the input of another. Input layer Hidden layer Output layer 1 0.5 0.3

Multilayer network of Linear units? Advantage: we know how to do gradient descent on linear units Input layer Output layer Hidden layer 𝑣 1 = 𝒘 𝟏 ⋅𝒙 𝑤 0,1 𝑥 1 𝑥 2 𝑥 0 =1 𝑧= 𝒘 𝟑 ⋅𝒗 𝑤 1,1 𝑤 2,1 𝑤 0,2 𝑤 1,2 𝑤 2,2 𝑤 3,2 𝑣 2 = 𝒘 𝟐 ⋅𝒙 𝑤 3,1 Problem: linear of linear is just linear. 𝑧= 𝑤 3,1 𝒘 𝟏 ⋅𝒙 + 𝑤 3,2 𝒘 𝟐 ⋅𝒙 = 𝑤 3,1 𝒘 𝟏 + 𝑤 3,2 𝒘 𝟐 ⋅𝒙= linear

Multilayer network of Perceptron units? Advantage: Can produce highly non-linear decision boundaries! Input layer Output layer Hidden layer 𝑣 1 = (𝒘 𝟏 ⋅𝒙) 𝑤 0,1 𝑥 1 𝑥 2 𝑤 1,1 𝑤 2,1 𝑤 0,2 𝑤 1,2 𝑤 2,2 𝑤 3,2 𝑤 3,1 𝑣 2 = (𝒘 𝟐 ⋅𝒙) 𝑧= (𝒘 3 ⋅𝒗) 𝑥 0 =1 Threshold function: 𝑥=1 if 𝑥 is positive, 0 if 𝑥 is negative. Problem: discontinuous threshold is not differentiable. Can’t do gradient descent.

Multilayer network of sigmoid units Advantage: Can produce highly non-linear decision boundaries! Sigmoid is differentiable, so can use gradient descent Input layer Output layer Hidden layer 𝑣 1 =𝝈 (𝒘 𝟏 ⋅𝒙) 𝑤 0,1 𝑥 1 𝑥 2 𝑤 1,1 𝑤 2,1 𝑤 0,2 𝑤 1,2 𝑤 2,2 𝑤 3,2 𝑤 3,1 𝑣 2 =𝝈 (𝒘 𝟐 ⋅𝒙) 𝑧=𝝈 (𝒘 𝟑 ⋅𝒗) 𝑥 0 =1 𝝈 𝑥 = 1 1+ 𝑒 −𝑥 = Very useful in practice!

The Sigmoid Unit 𝜎 is the sigmoid function; 𝜎 𝑥 = 1 1+ 𝑒 −𝑥 Nice property: 𝑑𝜎 𝑥 𝑑𝑥 =𝜎 𝑥 1−𝜎 𝑥 We can derive gradient descent rules to train One sigmoid unit Multilayer networks of sigmoid units → Backpropagation

Gradient Descent to Minimize Squared Error 𝐸 𝐷 𝒘 = 1 2 𝑑∈𝐷 𝑓 𝑤 𝑥 𝑑 – 𝑡 𝑑 2 Goal: Given 𝑥 𝑑 , 𝑡 𝑑 𝑑∈𝐷 find w to minimize Batch mode Gradient Descent: Do until satisfied Compute the gradient 𝛻 𝐸 𝐷 [𝒘] 𝒘←𝒘−𝜂𝛻 𝐸 𝐷 [𝒘] 𝛻𝐸 𝒘 ≡ 𝜕𝐸 𝜕 𝑤 0 , 𝜕𝐸 𝜕 𝑤 1 , …, 𝜕𝐸 𝜕 𝑤 𝑛 Incremental (stochastic) Gradient Descent: Do until satisfied For each training example 𝑑 in 𝐷 𝐸 𝑑 𝒘 ≡ 1 2 𝑡 𝑑 − 𝑜 𝑑 2 Compute the gradient 𝛻 𝐸 𝑑 [𝒘] w←𝒘−𝜂𝛻 𝐸 𝑑 [𝒘] Note: Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if 𝜂 made small enough

Gradient Descent for the Sigmoid Unit 𝑑∈𝐷 𝑜 𝑑 – 𝑡 𝑑 2 Given 𝑥 𝑑 , 𝑡 𝑑 𝑑∈𝐷 find w to minimize od = observed unit output for 𝑥 𝑑 ne t d = i w i x i,d od =σ(ne t d ); 𝜕𝐸 𝜕 𝑤 𝑖 = 𝜕 𝜕 𝑤 𝑖 1 2 𝑑∈𝐷 𝑡 𝑑 − 𝑜 𝑑 2 = 1 2 𝑑∈𝐷 𝜕 𝜕 𝑤 𝑖 𝑡 𝑑 − 𝑜 𝑑 2 = 1 2 𝑑∈𝐷 2 𝑡 𝑑 − 𝑜 𝑑 𝜕 𝜕 𝑤 𝑖 𝑡 𝑑 − 𝑜 𝑑 = 𝑑∈𝐷 𝑡 𝑑 − 𝑜 𝑑 − 𝜕 𝑜 𝑑 𝜕 𝑤 𝑖 =− 𝑑∈𝐷 𝑡 𝑑 − 𝑜 𝑑 𝜕 𝑜 𝑑 𝜕𝑛𝑒 𝑡 𝑑 𝜕𝑛𝑒 𝑡 𝑑 𝜕 𝑤 𝑖 𝜕 𝑜 𝑑 𝜕𝑛𝑒 𝑡 𝑑 = 𝜕𝜎 𝑛𝑒 𝑡 𝑑 𝜕𝑛𝑒 𝑡 𝑑 = 𝑜 𝑑 1− 𝑜 𝑑 𝜕𝑛𝑒 𝑡 𝑑 𝜕 𝑤 𝑖 = 𝜕 𝒘⋅ 𝒙 𝒅 𝜕 𝑤 𝑖 = 𝑥 𝑖,𝑑 But we know: and 𝜕𝐸 𝜕 𝑤 𝑖 =− 𝑑∈𝐷 𝑡 𝑑 − 𝑜 𝑑 𝑜 𝑑 1− 𝑜 𝑑 𝑥 𝑖,𝑑 So:

Gradient Descent for the Sigmoid Unit 𝑑∈𝐷 𝑜 𝑑 – 𝑡 𝑑 2 Given 𝑥 𝑑 , 𝑡 𝑑 𝑑∈𝐷 find w to minimize od = observed unit output for 𝑥 𝑑 ne t d = i w i x i,d od =σ(ne t d ); 𝜕𝐸 𝜕 𝑤 𝑖 =− 𝑑∈𝐷 𝑡 𝑑 − 𝑜 𝑑 𝑜 𝑑 1− 𝑜 𝑑 𝑥 𝑖,𝑑 𝛿 𝑑 error term 𝑡 𝑑 − 𝑜 𝑑 multiplied by 𝑜 𝑑 1− 𝑜 𝑑 that comes from the derivative of the sigmoid function 𝜕𝐸 𝜕 𝑤 𝑖 =− 𝑑∈𝐷 𝛿 𝑑 𝑥 𝑖,𝑑 Update rule: 𝑤←𝑤−𝜂𝛻𝐸[𝑤]

Gradient Descent for Multilayer Networks 1 2 𝑑∈𝐷 𝑘∈𝑂𝑢𝑡𝑝𝑢𝑡𝑠 𝑜 𝑘,𝑑 – 𝑡 𝑘𝑑 2 Given 𝑥 𝑑 , 𝑡 𝑑 𝑑∈𝐷 find w to minimize

𝛿 ℎ ← 𝑜 ℎ 1− 𝑜 ℎ 𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑤 ℎ,𝑘 𝛿 𝑘 Backpropagation Algorithm 𝑘 Incremental/stochastic gradient descent ℎ Initialize all weights to small random numbers. Until satisfied, Do: For each training example (𝑥,𝑡) do: o = observed unit output Input the training example to the network and compute the network outputs t = target output x = input For each output unit 𝑘: x i,j = 𝑖th input to 𝑗th unit 𝛿 𝑘 ← 𝑜 𝑘 (1− 𝑜 𝑘 )( 𝑡 𝑘 − 𝑜 𝑘 ) wij = wt from i to j For each hidden unit ℎ: 𝛿 ℎ ← 𝑜 ℎ 1− 𝑜 ℎ 𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠 𝑤 ℎ,𝑘 𝛿 𝑘 Update each network weight 𝑤 𝑖,𝑗 𝑤 𝑖,𝑗 ← 𝑤 𝑖,𝑗 + Δw 𝑖,𝑗 where Δ𝑤 𝑖,𝑗 =𝜂 𝛿 𝑗 𝑥 𝑖,𝑗

Validation/generalization error first decreases, then increases. Weights tuned to fit the idiosyncrasies of the training examples that are not representative of the general distribution. Stop when lowest error over validation set. Not always obvious when lowest error over validation set has been reached.

Dealing with Overfitting Our learning algorithm involves a parameter n=number of gradient descent iterations How do we choose n to optimize future error? Separate available data into training and validation set Use training to perform gradient descent n  number of iterations that optimizes validation set error

K-Fold Cross Validation Idea: train multiple times, leaving out a disjoint subset of data each time for test. Average the test set accuracies. ________________________________________________ Partition data into K disjoint subsets For k=1 to K testData = kth subset h  classifier trained* on all data except for testData accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies * might withhold some of this to choose number of gradient decent steps

Leave-One-Out Cross Validation This is just k-fold cross validation leaving out one example each iteration ________________________________________________ Partition data into K disjoint subsets, each containing one example For k=1 to K testData = kth subset h  classifier trained* on all data except for testData accuracy(k) = accuracy of h on testData end FinalAccuracy = mean of the K recorded testset accuracies * might withhold some of this to choose number of gradient decent steps

Expressive Capabilities of ANNs Boolean functions: Every Boolean function can be represented by a network with a single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrarily accuracy by a network with two hidden layers [Cybenko 1988]

Representing Simple Boolean Functions Inputs 𝑥 𝑖 ∈{0,1} Or function 𝑥 1 𝑥 2 𝑥 𝑛 𝑥 𝑖 1 ∨ 𝑥 𝑖 2 ∨…∨ 𝑥 𝑖 𝑘 … 1 𝑥 0 =1 −0.5 𝑤 𝑖 =1 if 𝑖 is an 𝑖 𝑗 𝑤 𝑖 =0 otherwise And function 𝑥 1 𝑥 2 𝑥 𝑛 𝑥 𝑖 1 ∧ 𝑥 𝑖 2 ∧…∧ 𝑥 𝑖 𝑘 … 1 𝑥 0 =1 0.5−𝑘 𝑤 𝑖 =1 if 𝑖 is an 𝑖 𝑗 𝑤 𝑖 =0 otherwise And with negations 𝑥 1 𝑥 2 𝑥 𝑛 𝑥 𝑖 1 ∧ 𝑥 𝑖 2 ∧…∧ 𝑥 𝑖 𝑘 … 1 −1 𝑥 0 =1 0.5−𝑡 𝑤 𝑖 =1 if 𝑖 is 𝑖 𝑗 not negated 𝑤 𝑖 =0 otherwise 𝑤 𝑖 =−1 if 𝑖 is 𝑖 𝑗 negated 𝑡= # not negated

General Boolean functions Every Boolean function can be represented by a network with a single hidden layer; might require exponential # of hidden units Can write any Boolean function as a truth table: 000 | + 001 | − 010 | − 011 | + 100 | − 101 | − 110 | + 111 | + View as OR of ANDs, with one AND for each positive entry. 𝑥 1 𝑥 2 𝑥 3 ∨ 𝑥 1 𝑥 2 𝑥 3 ∨ 𝑥 1 𝑥 2 𝑥 3 ∨ 𝑥 1 𝑥 2 𝑥 3 Then combine AND and OR networks into a 2-layer network.

Artificial Neural Networks: Summary Highly non-linear regression/classification Vector valued inputs and outputs Potentially millions of parameters to estimate Actively used to model distributed comptutation in the brain Hidden layers learn intermediate representations Stochastic gradient descent, local minima problems Overfitting and how to deal with it.