Machine Learning Chapter 4. Artificial Neural Networks

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Artificial Neural Networks
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)
Classification Neural Networks 1
Perceptron.
Machine Learning Neural Networks
Overview over different methods – Supervised Learning
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Artificial Neural Networks ML Paul Scheible.
Artificial Neural Networks
Machine Learning Neural Networks.
Back-Propagation Algorithm
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Artificial Neural Networks
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
Foundations of Learning and Adaptive Systems ICS320
CS 484 – Artificial Intelligence
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Artificial Neural Networks
Computer Science and Engineering
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Artificial Neural Network Yalong Li Some slides are from _24_2011_ann.pdf.
Classification / Regression Neural Networks 2
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
EE459 Neural Networks Backpropagation
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
SUPERVISED LEARNING NETWORK
Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.
Artificial Neural Network
Multilayer Neural Networks (sometimes called “Multilayer Perceptrons” or MLPs)
Chapter 6 Neural Network.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.
Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Fall 2004 Backpropagation CS478 - Machine Learning.
INTRODUCTION TO Machine Learning 3rd Edition
Artificial Neural Networks
Artificial Neural Networks
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Artificial Neural Networks
Classification Neural Networks 1
Artificial Intelligence Chapter 3 Neural Networks
Artificial Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Seminar on Machine Learning Rada Mihalcea
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Machine Learning Chapter 4. Artificial Neural Networks Tom M. Mitchell

Artificial Neural Networks Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics

Connectionist Models (1/2) Consider humans: Neuron switching time ~ .001 second Number of neurons ~ 1010 Connections per neuron ~ 104-5 Scene recognition time ~ .1 second 100 inference steps doesn’t seem like enough  much parallel computation

Connectionist Models (2/2) Properties of artificial neural nets (ANN’s): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically

When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction

ALVINN drives 70 mph on highways

Perceptron Sometimes we’ll use simpler vector notation:

Decision Surface of a Perceptron Represents some useful functions What weights represent g(x1, x2) = AND(x1, x2)? But some functions not representable e.g., not linearly separable Therefore, we’ll want networks of these...

Perceptron training rule wi  wi + wi where wi =  (t – o) xi Where: t = c(x) is target value o is perceptron output  is small constant (e.g., .1) called learning rate Can prove it will converge If training data is linearly separable and  sufficiently small

Gradient Descent (1/4) To understand, consider simpler linear unit, where o = w0 + w1x1 + ··· + wnxn Let's learn wi’s that minimize the squared error Where D is set of training examples

Gradient Descent (2/4) Gradient Training rule: i.e.,

Gradient Descent (3/4)

Gradient Descent (4/4) Initialize each wi to some small random value Until the termination condition is met, Do Initialize each wi to zero. For each <x, t> in training_examples, Do * Input the instance x to the unit and compute the output o * For each linear unit weight wi, Do wi  wi +  (t – o) xi For each linear unit weight wi , Do wi  wi + wi

Summary Perceptron training rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate  Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate  Even when training data contains noise Even when training data not separable by H

Incremental (Stochastic) Gradient Descent (1/2) Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ED[w] 2. w  w -  ED[w] Incremental mode Gradient Descent: For each training example d in D 1. Compute the gradient Ed[w] 2. w  w -  Ed[w]

Incremental (Stochastic) Gradient Descent (2/2) Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if  made small enough

Multilayer Networks of Sigmoid Units

Sigmoid Unit (x) is the sigmoid function Nice property: We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoid units  Backpropagation

Error Gradient for a Sigmoid Unit But we know: So:

Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k : k  k(1 - k) (tk - k) 3. For each hidden unit h h  h(1 - h) k outputs wh,kk 4. Update each network weight wi,j wi,j  wi,j + wi,j where wi,j =  j xi,j 

More on Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum In practice, often works well (can run multiple times) Often include weight momentum  wi,j (n) =  j xi,j +  wi,j (n - 1) Minimizes error over training examples Will it generalize well to subsequent examples? Training can take thousands of iterations  slow! Using network after training is very fast

Learning Hidden Layer Representations (1/2) A target function: Can this be learned??

Learning Hidden Layer Representations (2/2) A network: Learned hidden layer representation:

Training (1/3)

Training (2/3)

Training (3/3)

Convergence of Backpropagation Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses

Expressive Capabilities of ANNs Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Overfitting in ANNs (1/2)

Overfitting in ANNs (2/2)

Neural Nets for Face Recognition 90% accurate learning head pose, and recognizing 1-of-20 faces

Learned Hidden Unit Weights http://www.cs.cmu.edu/tom/faces.html

Alternative Error Functions Penalize large weights: Train on target slopes as well as values: Tie together weights: e.g., in phoneme recognition network

Recurrent Networks (a) (b) (c) Feedforward network Recurrent network Recurrent network unfolded in time