CS 4700: Foundations of Artificial Intelligence

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Learning from Observations Chapter 18 Section 1 – 3.
Beyond Linear Separability
Slides from: Doug Gray, David Poole
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Artificial Neural Networks
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Learning Department of Computer Science & Engineering Indian Institute of Technology Kharagpur.
Classification Neural Networks 1
Machine Learning Neural Networks
Decision making in episodic environments
Simple Neural Nets For Pattern Classification
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Neural Networks: Concepts (Reading:
Neural Networks Marco Loog.
Carla P. Gomes CS4700 CS 4700: Foundations of Artificial Intelligence Prof. Carla P. Gomes Module: Neural Networks Expressiveness.
Back-Propagation Algorithm
Chapter 6: Multilayer Neural Networks
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Artificial Neural Networks
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
CS 4700: Foundations of Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Machine learning Image source:
Machine learning Image source:
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Intelligent Systems – Lecture 13
Artificial Neural Networks
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Chapter 9 Neural Network.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
LEARNING DECISION TREES Yılmaz KILIÇASLAN. Definition - I Decision tree induction is one of the simplest, and yet most successful forms of learning algorithm.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
Learning from Observations Chapter 18 Through
CHAPTER 18 SECTION 1 – 3 Learning from Observations.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 31: Feedforward N/W; sigmoid.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Classification with Perceptrons
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Artificial Neural Network
EEE502 Pattern Recognition
Learning Neural Networks (NN) Christina Conati UBC
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Chapter 18 Section 1 – 3 Learning from Observations.
© Copyright 2008 STI INNSBRUCK Neural Networks Intelligent Systems – Lecture 13.
Learning From Observations Inductive Learning Decision Trees Ensembles.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Multi-layer Perceptrons (MLPs)
Machine Learning Today: Reading: Maria Florina Balcan
Classification Neural Networks 1
Decision making in episodic environments
Seminar on Machine Learning Rada Mihalcea
Presentation transcript:

CS 4700: Foundations of Artificial Intelligence Carla P. Gomes gomes@cs.cornell.edu Module: Neural Networks Multi-Layer-Perceptron (Reading: Chapter 20.5)

Perceptron: Linear Separable Functions x2 Not linearly separable + - XOR x1 - + Minsky & Papert (1969) Perceptrons can only represent linearly separable functions.

Good news: Adding hidden layer allows more target functions to be represented. Minsky & Papert (1969)

Hidden Units Hidden units are nodes that are situated between the input nodes and the output nodes. Hidden units allow a network to learn non-linear functions. Hidden units allow the network to represent combinations of the input features.

Two levels (i.e., one hidden layer) is enough!!! Boolean functions Perceptron can be used to represent the following Boolean functions AND OR Any m-of-n function NOT NAND (NOT AND) NOR (NOT OR) Every Boolean function can be represented by a network of interconnected units based on these primitives Two levels (i.e., one hidden layer) is enough!!!

Boolean XOR o h1 x2 x1 h2 Let’s consider a single hidden layer 0.5 XOR -1 X1  X2  (X1 X2) (X1X2) Not Linear separable  Cannot be represented by a single-layer perceptron h1 x2 x1 1 0.5 OR h2 1 1.5 AND Let’s consider a single hidden layer network, using as building blocks threshold units. w1 x1+w2 x2 –w0 > 0

Expressiveness of MLP: Soft Threshold Advantage of adding hidden layers it enlarges the space of hypotheses that the network can represent. Example: we can think of each hidden unit as a perceptron that represents a soft threshold function in the input space, and an output unit as as a soft- thresholded linear combination of several such functions. Soft threshold function

Expressiveness of MLP a) b) (a) The result of combining two opposite-facing soft threshold functions to produce a ridge. (b) The result of combining two ridges to produce a bump. Add bumps of various sizes and locations to any surface All continuous functions w/ 2 layers, all functions w/ 3 layers

Expressiveness of MLP With a single, sufficiently large hidden layer, it is possible to represent any continuous function of the inputs with arbitrary accuracy; With two layers, even discontinuous functions can be represented. The proof is complex  main point, required number of hidden units grows exponentially with the number of inputs. For example, 2n/n hidden units are needed to encode all Boolean functions of n inputs. Issue: For any particular network structure, it is harder to characterize exactly which functions can be represented and which ones cannot.

Multi-Layer Feedforward Networks x1 x2 xN o1 o2 oO Boolean functions: Every boolean function can be represented by a network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with single hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. Note: Threshold can be simulated by additional input x that is always 1.

The Restaurant Example Problem: decide whether to wait for a table at a restaurant, based on the following attributes: Alternate: is there an alternative restaurant nearby? Bar: is there a comfortable bar area to wait in? Fri/Sat: is today Friday or Saturday? Hungry: are we hungry? Patrons: number of people in the restaurant (None, Some, Full) Price: price range ($, $$, $$$) Raining: is it raining outside? Reservation: have we made a reservation? Type: kind of restaurant (French, Italian, Thai, Burger) WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) Goal predicate: WillWait?

Multi-layer Feedforward Neural Networks or Multi-Layer Perceptrons (MLP) A multilayer neural network with one hidden layer and 10 inputs, suitable for the restaurant problem. Layers are often fully connected (but not always). Number of hidden units typically chosen by hand.

Learning Algorithm for MLP

Learning Algorithms for MLP Goal: minimize sum squared errors Err1=y1-o1 Err2=y2-o2 Erri=yi-oi Erro=yo-oo Clear error at the output layer parameterized function of inputs: weights are the parameters of the function. oi Note: Threshold can be simulated by additional input x that is always 1. How to compute the errors for the hidden units? We can back-propagate the error from the output layer to the hidden layers. The back-propagation process emerges directly from a derivation of the overall error gradient.

Backpropagation Learning Algorithm for MLP Perceptron update: Wkj Wji Output layer weight update (similar to perceptron) oi Erri=yi-oi j k Note: Threshold can be simulated by additional input x that is always 1. Hidden node j is “responsible” for some fraction of the error i in each of the output nodes to which it connects  depending on the strength of the connection between the hidden node and the output node i. Errj  “Error” for hidden node j

Backpropagation Training (Overview) Optimization Problem Obj.: minimize E Variables: network weights wij Algorithm: local search via gradient descent. Randomly initialize weights. Until performance is satisfactory, cycle through examples (epochs): Update each weight: Choice of learning rate  How many restarts (local optima) of search to find good optimum of objective function? Backprop procedure is most-used method. 2. Forget about learning the thresholds for now, assume that they’re fixed... 3. Backprop amounts to hill-climbing through weight space. (Gradient ascent) 4. state representation is the set of weights in the network. Evaluation function is a measure of how well the net deals with the training instances, i.e., instances for which the actual output is known. sum squared error 5. The hill-climbing search algorithm that we looked at a few weeks ago would require us to try each possible step and then choose the one that brings us closer to the goal. Means that we’d change each weight individually, see which change helped the most, and then make that weight change permanent. 6. Backpropagation procedure lets us vary all of the weights simultaneously in proportion to how much good is done by individual changes. 7. It’s a relatively efficient way to compute how much performance improves with individual weight changes. 8. * Usually use sum-squared error. 9. Adapt weights. Propagate the error backwards... 10. Idea: make a large change to a weight if the change leads to a large reduction in the errors observed at the output nodes. 11. Present all training instances. Keep track of weight changes. Only make them after you’ve seen all the input. Output node: Hidden node: See derivation details in the next hidden slides (pages 745-747 R&N)

Learning Algorithms for MLP Similar to the perceptron learning algorithm: One minor difference is that we may have several outputs, so we have an output vector hW(x) rather than a single value, and each example has an output vector y. The major difference is that, whereas the error y - hW at the perceptron output layer is clear, the error at the hidden layers seems mysterious because the training data does not say what value the hidden nodes should have We can back-propagate the error from the output layer to the hidden layers. The back-propagation process emerges directly from a derivation of the overall error gradient.

Back Propagation Learning Perceptron update: Erri  ith component of vector y - hW Hidden node j is “responsible” for some fraction of the error i in each of the output nodes to which it connects  depending on . the strength of the connection between the hidden node and the output node i.

Back Propagation Learning (Derivation)

Back Propagation Learning (Derivation)

Back-Propagation Learning Algorithm

(a) (b) Typical problems: slow convergence, local minima (b) Comparative learning curves showing that decision-tree learning does slightly better than back-propagation in a multilayer network. 4 hidden units: (a) Training curve showing the gradual reduction in error as weights are modified over several epochs, for a given set of examples in the restaurant domain. Typical problems: slow convergence, local minima

Design Decisions Network architecture How many hidden layers? How many hidden units per layer? Given too many hidden units, a neural net will simply memorize the input patterns (overfitting). Given too few hidden units, the network may not be able to represent all of the necessary generalizations (underfitting). How should the units be connected? (Fully? Partial? Use domain knowledge?)

Learning Neural Networks Structures Fully connected networks How many layers? How many hidden units? Cross-validation to choose the one with the highest prediction accuracy on the validation sets. Not fully connected networks – search for right topology (large space) Optimal- Brain damage – start with a fully connected network; Try removing connections from it. Tiling – algorithm for growing a network starting with a single unit

How long should you train the net? The goal is to achieve a balance between correct responses for the training patterns and correct responses for new patterns. (That is, a balance between memorization and generalization). If you train the net for too long, then you run the risk of overfitting. Select number of training iterations via cross-validation on a holdout set.

Multi Layer Networks: expressiveness vs. computational complexity Multi- Layer networks  very expressive! They can represent general non-linear functions!!!! But… In general they are hard to train due to the abundance of local minima and high dimensionality of search space Also resulting hypotheses cannot be understood easily

Summary Perceptrons (one-layer networks) limited expressive power—they can learn only linear decision boundaries in the input space. Single-layer networks have a simple and efficient learning algorithm; Multi-layer networks are sufficiently expressive —they can represent general nonlinear function ; can be trained by gradient descent, i.e., error back-propagation. Multi-layer perceptron harder to train because of the abundance of local minima and the high dimensionality of the weight space Many applications: speech, driving, handwriting, fraud detection, etc.