Feed-Forward Neural Networks

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Perceptron Lecture 4.

Introduction to Artificial Neural Networks

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Introduction to Neural Networks Computing

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

NEURAL NETWORKS Perceptron

Neural Network I Week 7 1. Team Homework Assignment #9 Read pp. 327 – 334 and the Week 7 slide. Design a neural network for XOR (Exclusive OR) Explore.

Artificial Neural Networks

Linear Discriminant Functions

Simple Neural Nets For Pattern Classification

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Connectionist models. Connectionist Models Motivated by Brain rather than Mind –A large number of very simple processing elements –A large number of weighted.

20.5 Nerual Networks Thanks: Professors Frank Hoffmann and Jiawei Han, and Russell and Norvig.

Rutgers CS440, Fall 2003 Neural networks Reading: Ch. 20, Sec. 5, AIMA 2 nd Ed.

Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Introduction to Neural Networks John Paxton Montana State University Summer 2003.

Artificial Neural Networks

Before we start ADALINE

Data Mining with Neural Networks (HK: Chapter 7.5)

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

September 28, 2010Neural Networks Lecture 7: Perceptron Modifications 1 Adaline Schematic Adjust weights i1i1i1i1 i2i2i2i2 inininin …  w 0 + w 1 i 1 +

NEURAL NETWORKS Introduction

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

1 Introduction to Artificial Neural Networks Andrew L. Nelson Visiting Research Faculty University of South Florida.

Some more Artificial Intelligence

Neural Networks. Plan Perceptron  Linear discriminant Associative memories  Hopfield networks  Chaotic networks Multilayer perceptron  Backpropagation.

Artificial Neural Networks

Computer Science and Engineering

Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.

Introduction to Neural Networks Debrup Chakraborty Pattern Recognition and Machine Learning 2006.

2101INT – Principles of Intelligent Systems Lecture 10.

Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University

1 Machine Learning The Perceptron. 2 Heuristic Search Knowledge Based Systems (KBS) Genetic Algorithms (GAs)

Lecture 3 Introduction to Neural Networks and Fuzzy Logic President UniversityErwin SitompulNNFL 3/1 Dr.-Ing. Erwin Sitompul President University

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

Artificial Intelligence Techniques Multilayer Perceptrons.

Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.

Artificial Neural Networks An Introduction. What is a Neural Network? A human Brain A porpoise brain The brain in a living creature A computer program.

Feed-Forward Neural Networks 主講人 : 虞台文. Content Introduction Single-Layer Perceptron Networks Learning Rules for Single-Layer Perceptron Networks – Perceptron.

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 8: Neural Networks.

Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.

Neural Network Basics Anns are analytical systems that address problems whose solutions have not been explicitly formulated Structure in which multiple.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

Chapter 2 Single Layer Feedforward Networks

SUPERVISED LEARNING NETWORK

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Supervised learning network G.Anuradha. Learning objectives The basic networks in supervised learning Perceptron networks better than Hebb rule Single.

Artificial Neural Networks Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

IE 585 History of Neural Networks & Introduction to Simple Learning Rules.

Perceptrons Michael J. Watts

Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.

Chapter 6 Neural Network.

Neural NetworksNN 21 Architecture We consider the architecture: feedforward NN with one layer It is sufficient to study single layer perceptrons with.

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Lecture 2 Introduction to Neural Networks and Fuzzy Logic President UniversityErwin SitompulNNFL 2/1 Dr.-Ing. Erwin Sitompul President University

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

第 3 章神经网络.

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

Artificial Intelligence Chapter 3 Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Presentation transcript:

Feed-Forward Neural Networks 主講人: 虞台文

Content Introduction Single-Layer Perceptron Networks Learning Rules for Single-Layer Perceptron Networks Perceptron Learning Rule Adaline Leaning Rule -Leaning Rule Multilayer Perceptron Back Propagation Learning algorithm

Feed-Forward Neural Networks Introduction

Historical Background 1943 McCulloch and Pitts proposed the first computational models of neuron. 1949 Hebb proposed the first learning rule. 1958 Rosenblatt’s work in perceptrons. 1969 Minsky and Papert’s exposed limitation of the theory. 1970s Decade of dormancy for neural networks. 1980-90s Neural network return (self-organization, back-propagation algorithms, etc)

Nervous Systems Human brain contains ~ 1011 neurons. Each neuron is connected ~ 104 others. Some scientists compared the brain with a “complex, nonlinear, parallel computer”. The largest modern neural networks achieve the complexity comparable to a nervous system of a fly.

Neurons The main purpose of neurons is to receive, analyze and transmit further the information in a form of signals (electric pulses). When a neuron sends the information we say that a neuron “fires”.

Neurons Acting through specialized projections known as dendrites and axons, neurons carry information throughout the neural network. This animation demonstrates the firing of a synapse between the pre-synaptic terminal of one neuron to the soma (cell body) of another neuron.

A Model of Artificial Neuron x1 x2 xm= 1 wi1 wi2 wim =i . f (.) a (.)  yi bias

A Model of Artificial Neuron x1 x2 xm= 1 wi1 wi2 wim =i . f (.) a (.)  yi bias

Feed-Forward Neural Networks Graph representation: nodes: neurons arrows: signal flow directions A neural network that does not contain cycles (feedback loops) is called a feed–forward network (or perceptron). . . . x1 x2 xm y1 y2 yn

Layered Structure Output Layer Hidden Layer(s) Input Layer x1 x2 xm y1 . . . x1 x2 xm y1 y2 yn Output Layer Hidden Layer(s) Input Layer

Knowledge and Memory x1 x2 xm y1 y2 yn The output behavior of a network is determined by the weights. Weights  the memory of an NN. Knowledge  distributed across the network. Large number of nodes increases the storage “capacity”; ensures that the knowledge is robust; fault tolerance. Store new information by changing weights. . . . x1 x2 xm y1 y2 yn

Pattern Classification output pattern y Function: x  y The NN’s output is used to distinguish between and recognize different input patterns. Different output patterns correspond to particular classes of input patterns. Networks with hidden layers can be used for solving more complex problems then just a linear pattern classification. . . . x1 x2 xm y1 y2 yn input pattern x

Training Goal: Training Set yi1 yi2 yin di1 di2 din xi1 xi2 xim . . . . . . . . . Goal: . . . . . . xi1 xi2 xim

Generalization x1 x2 xm y1 y2 yn By properly training a neural network may produce reasonable answers for input patterns not seen during training (generalization). Generalization is particularly useful for the analysis of a “noisy” data (e.g. time–series). . . . x1 x2 xm y1 y2 yn

Generalization x1 x2 xm y1 y2 yn By properly training a neural network may produce reasonable answers for input patterns not seen during training (generalization). Generalization is particularly useful for the analysis of a “noisy” data (e.g. time–series). . . . x1 x2 xm y1 y2 yn without noise with noise

Applications Pattern classification Object recognition Function approximation Data compression Time series analysis and forecast . . .

Feed-Forward Neural Networks Single-Layer Perceptron Networks

The Single-Layered Perceptron . . . x1 x2 xm= 1 y1 y2 yn xm-1 w11 w12 w1m w21 w22 w2m wn1 wnm wn2

Training a Single-Layered Perceptron Training Set . . . x1 x2 xm= 1 y1 y2 yn xm-1 w11 w12 w1m w21 w22 w2m wn1 wnm wn2 Goal: d1 d2 dn

Learning Rules . . . x1 x2 xm= 1 y1 y2 yn xm-1 d1 d2 dn Training Set Linear Threshold Units (LTUs) : Perceptron Learning Rule Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule Training Set . . . x1 x2 xm= 1 y1 y2 yn xm-1 w11 w12 w1m w21 w22 w2m wn1 wnm wn2 Goal: d1 d2 dn

Feed-Forward Neural Networks Learning Rules for Single-Layered Perceptron Networks Perceptron Learning Rule Adline Leaning Rule -Learning Rule

 Perceptron Linear Threshold Unit x1 wi1 x2 sgn wi2 wim =i xm= 1 . +1 1 sgn

 Perceptron Linear Threshold Unit Goal: x1 wi1 x2 sgn wi2 wim =i xm= 1 wi1 wi2 wim =i .  +1 1 sgn

Example Goal: x1 x2 x3= 1 y Class 1 (+1) Class 2 (1) Class 1 2 1 g(x) = 2x1 +x2+2=0 Class 2

Augmented input vector Goal: Augmented input vector Class 1 (+1) Class 2 (1) x1 x2 x3= 1 2 1 y Class 1 (+1) Class 2 (1)

Augmented input vector Goal: Augmented input vector x1 x2 x3 (0,0,0) Class 1 (1, 2, 1) (1.5, 1, 1) (1,0, 1) x1 x2 x3= 1 2 1 y Class 2 (1, 2, 1) (2.5, 1, 1) (2,0, 1)

Augmented input vector Goal: Augmented input vector A plane passes through the origin in the augmented input space. x1 x2 x3 (0,0,0) Class 1 (1, 2, 1) (1.5, 1, 1) (1,0, 1) x1 x2 x3= 1 2 1 y Class 2 (1, 2, 1) (2.5, 1, 1) (2,0, 1)

Linearly Separable vs. Linearly Non-Separable AND OR XOR 1 1 1 Linearly Separable Linearly Separable Linearly Non-Separable

Goal Given training sets T1C1 and T2  C2 with elements in form of x=(x1, x2 , ... , xm-1 , xm) T , where x1, x2 , ... , xm-1 R and xm= 1. Assume T1 and T2 are linearly separable. Find w=(w1, w2 , ... , wm) T such that

wTx = 0 is a hyperplain passes through the origin of augmented input space. Goal Given training sets T1C1 and T2  C2 with elements in form of x=(x1, x2 , ... , xm-1 , xm) T , where x1, x2 , ... , xm-1 R and xm= 1. Assume T1 and T2 are linearly separable. Find w=(w1, w2 , ... , wm) T such that

Observation x1 x2 Which w’s correctly classify x? w1 x w2 w6 w3 w4 w5 +  d = +1 d = 1 Observation x1 x2 Which w’s correctly classify x? + w1 x w2 w6 w3 w4 w5 What trick can be used?

+  d = +1 d = 1 Observation x1 x2 Is this w ok? + w x w1x1 +w2x2 = 0

+  d = +1 d = 1 Observation x1 x2 w1x1 +w2x2 = 0 Is this w ok? + x w

Observation x2 Is this w ok? x How to adjust w? ? x1 w ? w = ? d = +1  d = +1 d = 1 Observation x1 x2 w1x1 +w2x2 = 0 Is this w ok? + x How to adjust w? ? w ? w = ?

Observation x2 Is this w ok? x How to adjust w? x1 w w = x d = +1 reasonable? <0 >0

Observation x2 Is this w ok? x How to adjust w? w = x x1 w d = +1  d = +1 d = 1 Observation x1 x2 Is this w ok? + x reasonable? How to adjust w? w = x w <0 >0

Observation w = ? +x x x2 Is this w ok? w x x1 or d = +1 d = 1 +

Perceptron Learning Rule Upon misclassification on + d = +1  d = 1 Define error +  No error

Perceptron Learning Rule Define error +  No error

Perceptron Learning Rule Input Learning Rate Error (d  y)

Summary  Perceptron Learning Rule Based on the general weight learning rule. Summary  Perceptron Learning Rule correct incorrect

Summary  Perceptron Learning Rule Converge? Summary  Perceptron Learning Rule x y .  d + 

Perceptron Convergence Theorem Exercise: Reference some papers or textbooks to prove the theorem. Perceptron Convergence Theorem If the given training set is linearly separable, the learning process will converge in a finite number of steps. x y   . d +

The Learning Scenario x2 x1 Linearly Separable. x(1) x(2) x(3) x(4) +  x(3)  x(4)

The Learning Scenario x1 x2 + x(1) + x(2) w0  x(3)  x(4)

The Learning Scenario x1 x2 + x(1) + x(2) w0 w1 w0  x(3)  x(4)

The Learning Scenario x2 x1 x(1) w2 w1 x(2) w1 w0 x(3) x(4) w0 + + 

The Learning Scenario x2 x1 w2 x(1) w3 w2 w1 x(2) w1 w0 x(3) x(4) + x(1) w3 w2 w1 + x(2) w0 w0  x(3)  x(4)

w4 = w3 The Learning Scenario x2 x1 w2 x(1) w3 w2 w1 x(2) w1 w0 + x(1) w3 w2 w1 + x(2) w0 w0  x(3)  x(4)

The Learning Scenario x1 x2 + x(1) w + x(2)  x(3)  x(4)

The Learning Scenario x2 x1 x(1) w x(2) x(3) x(4) The demonstration is in augmented space. The Learning Scenario x1 x2 + x(1) w + x(2)  x(3)  x(4) Conceptually, in augmented space, we adjust the weight vector to fit the data.

A weight in the shaded area will give correct classification for the positive example. Weight Space w1 w2 w + x

Weight Space w2 w = x x w w1 A weight in the shaded area will give correct classification for the positive example. Weight Space w1 w2 w = x + x w

A weight not in the shaded area will give correct classification for the negative example. Weight Space w1 w2  x w

Weight Space w2 w w = x x w1 A weight not in the shaded area will give correct classification for the negative example. Weight Space w1 w2 w w = x  x

The Learning Scenario in Weight Space + x(2) + x(1)  x(3)  x(4)

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w1  x(3)  x(4)

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w2 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w2 =w3 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w4 w2 =w3 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w4 w2 =w3 w5 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) + x(1) w4 w2 =w3 w5 w6 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w9 w2 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w10 w9 w2 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w10 w9 w2 w11 w8 + x(2) w7 + x(1) w4 w2 =w3 w5 w6 w1 w1  x(3)  x(4) w0

The Learning Scenario in Weight Space To correctly classify the training set, the weight must move into the shaded area. The Learning Scenario in Weight Space w2 w11 + x(2) + x(1) w1  x(3)  x(4) w0 Conceptually, in weight space, we move the weight into the feasible region.

Feed-Forward Neural Networks Learning Rules for Single-Layered Perceptron Networks Perceptron Learning Rule Adaline Leaning Rule -Learning Rule

Adaline (Adaptive Linear Element) Widrow [1962] x1 x2 xm wi1 wi2 wim . 

Adaline (Adaptive Linear Element) In what condition, the goal is reachable? Adaline (Adaptive Linear Element) Goal: Widrow [1962] x1 x2 xm wi1 wi2 wim . 

LMS (Least Mean Square) Minimize the cost function (error function):

Gradient Decent Algorithm Our goal is to go downhill. Gradient Decent Algorithm E(w) w1 w2 Contour Map w w2 (w1, w2) w1

Gradient Decent Algorithm Our goal is to go downhill. Gradient Decent Algorithm How to find the steepest decent direction? E(w) w1 w2 Contour Map w w2 (w1, w2) w1

Gradient Operator Let f(w) = f (w1, w2,…, wm) be a function over Rm. Define

Gradient Operator f w df : positive df : zero df : negative Go uphill Plain Go downhill

The Steepest Decent Direction To minimize f , we choose w =   f f w df : positive df : zero df : negative Go uphill Plain Go downhill

LMS (Least Mean Square) Minimize the cost function (error function):  (k)

Adaline Learning Rule Minimize the cost function (error function): Weight Modification Rule

Learning Modes Batch Learning Mode: Incremental Learning Mode:

Summary  Adaline Learning Rule LMS Algorithm Widrow-Hoff Learning Rule Summary  Adaline Learning Rule Converge? x y .   d + 

LMS Convergence Based on the independence theory (Widrow, 1976). The successive input vectors are statistically independent. At time t, the input vector x(t) is statistically independent of all previous samples of the desired response, namely d(1), d(2), …, d(t1). At time t, the desired response d(t) is dependent on x(t), but statistically independent of all previous values of the desired response. The input vector x(t) and desired response d(t) are drawn from Gaussian distributed populations.

LMS Convergence It can be shown that LMS is convergent if where max is the largest eigenvalue of the correlation matrix Rx for the inputs.

LMS Convergence It can be shown that LMS is convergent if Since max is hardly available, we commonly use LMS Convergence It can be shown that LMS is convergent if where max is the largest eigenvalue of the correlation matrix Rx for the inputs.

Comparisons Perceptron Learning Rule Adaline (Widrow-Hoff) Hebbian Assumption Gradient Decent Fundamental Converge Asymptotically Convergence In finite steps Linearly Separable Linear Independence Constraint

Feed-Forward Neural Networks Learning Rules for Single-Layered Perceptron Networks Perceptron Learning Rule Adaline Leaning Rule -Learning Rule

Adaline x1 x2 xm wi1 wi2 wim . 

Unipolar Sigmoid x1 x2 xm wi1 wi2 wim . 

Bipolar Sigmoid x1 x2 xm wi1 wi2 wim . 

Goal Minimize

Gradient Decent Algorithm Minimize

Depends on the activation function used. The Gradient Minimize Depends on the activation function used. ? ?

Weight Modification Rule Minimize Batch Learning Rule Incremental

The Learning Efficacy Minimize Exercise Sigmoid Adaline Unipolar Bipolar Exercise

Learning Rule  Unipolar Sigmoid Minimize Weight Modification Rule

Comparisons Batch Adaline Incremental Batch Sigmoid Incremental

The Learning Efficacy Sigmoid Adaline y=a(net) y=a(net) = net net net constant depends on output

The Learning Efficacy Sigmoid Adaline The learning efficacy of Adaline is constant meaning that the Adline will never get saturated. net y=a(net) = net net y=a(net) y=net a’(net) 1 constant depends on output

The Learning Efficacy Sigmoid Adaline The sigmoid will get saturated if its output value nears the two extremes. Adaline net y=a(net) = net net y=a(net) y a’(net) 1 constant depends on output

Initialization for Sigmoid Neurons Why? Before training, it weight must be sufficiently small. x1 x2 xm . wi1 wi2 wim 

Feed-Forward Neural Networks Multilayer Perceptron

Multilayer Perceptron . . . x1 x2 xm y1 y2 yn Output Layer Hidden Layer Input Layer

Multilayer Perceptron Where the knowledge from? Classification Output Analysis Learning Input

How an MLP Works? Example: XOR Not linearly separable. Is a single layer perceptron workable? XOR 1 x1 x2

How an MLP Works? 00 01 11 Example: XOR x1 x2 x1 x2 x3= 1 y1 y2 L1 1 1 XOR x1 x2 00 L2 L1 x1 x2 x3= 1 y1 y2 L2 01 11

How an MLP Works? Example: L1 1 XOR x1 x2 L3 00 L2 01 1 y1 y2 11

How an MLP Works? Example: L1 1 XOR x1 x2 L3 00 L2 01 1 y1 y2 11

How an MLP Works? Example: y1 y2 y3= 1 z y1 y2 x1 x2 x3= 1 L3 L3 1 1 y1 y2 L2 L1 x1 x2 x3= 1

Parity Problem Is the problem linearly separable? 1 000 001 010 011 1 000 001 010 011 100 101 110 111 x1 x2 x3 x1 x2 x3

Parity Problem 1 000 001 010 011 100 101 110 111 x3 x2 x1 x1 x2 x3 P1 1 000 001 010 011 100 101 110 111 x1 x2 x3 x3 P1 P2 x2 P3 x1

Parity Problem 1 000 001 010 011 100 101 110 111 x1 x2 x3 x1 x2 x3 P1 P2 P3 111 011 001 000

Parity Problem x1 x2 x3 x1 x2 x3 y1 y2 y3 111 011 001 000 P1 P2 P3 P1

Parity Problem y1 y2 y3 x1 x2 x3 P1 P2 P3 111 P4 011 001 000

Parity Problem y1 z y3 P4 y2 y1 y2 y3 P4 P1 x1 x2 x3 P2 P3

General Problem

General Problem

Hyperspace Partition L3 L2 L1

Region Encoding 001 000 L3 L2 L1 010 100 101 110 111

Hyperspace Partition & Region Encoding Layer 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Region Identification Layer 101 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Region Identification Layer 001 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Region Identification Layer 000 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Region Identification Layer 110 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Region Identification Layer 010 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Region Identification Layer 100 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Region Identification Layer 111 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Classification 1 1 1 x1 x2 x3 001 101 000 110 010 100 111 L1 L2 L3 1 1 1 001 101 000 110 010 100 111 101 L1 L2 L3 001 000 010 110 111 100 L1 x1 x2 x3 L2 L3

Feed-Forward Neural Networks Back Propagation Learning algorithm

Activation Function — Sigmoid net 1 0.5 Remember this

Supervised Learning Output Layer Hidden Layer Input Layer x1 x2 xm o1 Training Set Supervised Learning . . . x1 x2 xm o1 o2 on d1 d2 dn Output Layer Hidden Layer Input Layer

Supervised Learning Goal: Sum of Squared Errors Minimize x1 x2 xm o1 Training Set Supervised Learning . . . x1 x2 xm o1 o2 on Sum of Squared Errors d1 d2 dn Goal: Minimize

Back Propagation Learning Algorithm . . . x1 x2 xm o1 o2 on d1 d2 dn Learning on Output Neurons Learning on Hidden Neurons

Learning on Output Neurons . . . j i o1 oj on d1 dj dn wji ? ?

Learning on Output Neurons . . . j i o1 oj on d1 dj dn wji depends on the activation function

Learning on Output Neurons . . . j i o1 oj on d1 dj dn wji Using sigmoid,

Learning on Output Neurons . . . j i o1 oj on d1 dj dn wji Using sigmoid,

Learning on Output Neurons . . . j i o1 oj on d1 dj dn wji

Learning on Output Neurons . . . j i o1 oj on d1 dj dn wji How to train the weights connecting to output neurons?

Learning on Hidden Neurons . . . j k i wik wji ? ?

Learning on Hidden Neurons . . . j k i wik wji

Learning on Hidden Neurons . . . j k i wik wji ?

Learning on Hidden Neurons . . . j k i wik wji

Learning on Hidden Neurons . . . j k i wik wji

Back Propagation o1 oj on . . . j k i d1 dj dn x1 xm

Back Propagation o1 oj on . . . j k i d1 dj dn x1 xm

Back Propagation o1 oj on . . . j k i d1 dj dn x1 xm

Learning Factors Initial Weights Learning Constant () Cost Functions Momentum Update Rules Training Data and Generalization Number of Layers Number of Hidden Nodes

Reading Assignments Shi Zhong and Vladimir Cherkassky, “Factors Controlling Generalization Ability of MLP Networks.” In Proc. IEEE Int. Joint Conf. on Neural Networks, vol. 1, pp. 625-630, Washington DC. July 1999. (http://www.cse.fau.edu/~zhong/pubs.htm) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986b). "Learning Internal Representations by Error Propagation," in Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. I, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group. MIT Press, Cambridge (1986). (http://www.cnbc.cmu.edu/~plaut/85-419/papers/RumelhartETAL86.backprop.pdf).