Feed-forward Networks

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Multi-Layer Perceptron (MLP)

Perceptron Lecture 4.

Beyond Linear Separability

Slides from: Doug Gray, David Poole

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Introduction to Neural Networks Computing

G53MLE | Machine Learning | Dr Guoping Qiu

Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.

NEURAL NETWORKS Perceptron

Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.

Machine Learning Neural Networks

Overview over different methods – Supervised Learning

Simple Neural Nets For Pattern Classification

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.

Chapter 6: Multilayer Neural Networks

September 23, 2010Neural Networks Lecture 6: Perceptron Learning 1 Refresher: Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly.

Data Mining with Neural Networks (HK: Chapter 7.5)

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Artificial Neural Networks

Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Multi-Layer Perceptrons Michael J. Watts

Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.

Artificial Intelligence Techniques Multilayer Perceptrons.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 21 Oct 28, 2005 Nanjing University of Science & Technology.

CS 478 – Tools for Machine Learning and Data Mining Backpropagation.

A note about gradient descent: Consider the function f(x)=(x-x 0 ) 2 Its derivative is: By gradient descent. x0x0 + -

Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.

CS344: Introduction to Artificial Intelligence (associated lab: CS386) Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 31: Feedforward N/W; sigmoid.

Linear Discrimination Reading: Chapter 2 of textbook.

Non-Bayes classifiers. Linear discriminants, neural networks.

Instructor: Prof. Pushpak Bhattacharyya 13/08/2004 CS-621/CS-449 Lecture Notes CS621/CS449 Artificial Intelligence Lecture Notes Set 4: 24/08/2004, 25/08/2004,

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

CS621 : Artificial Intelligence

For Friday No reading Take home exam due Exam 2. For Monday Read chapter 22, sections 1-3 FOIL exercise due.

Chapter 2 Single Layer Feedforward Networks

Chapter 18 Connectionist Models

EEE502 Pattern Recognition

Chapter 6 Neural Network.

Neural NetworksNN 21 Architecture We consider the architecture: feedforward NN with one layer It is sufficient to study single layer perceptrons with.

A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Fall 2004 Backpropagation CS478 - Machine Learning.

Chapter 2 Single Layer Feedforward Networks

第 3 章神经网络.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Classification with Perceptrons Reading:

CSC 578 Neural Networks and Deep Learning

Data Mining with Neural Networks (HK: Chapter 7.5)

Goodfellow: Chap 6 Deep Feedforward Networks

Chapter 3. Artificial Neural Networks - Introduction -

Neuro-Computing Lecture 4 Radial Basis Function Network

Artificial Intelligence Chapter 3 Neural Networks

network of simple neuron-like computing elements

Artificial Intelligence Chapter 3 Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Neuro-Computing Lecture 2 Single-Layer Perceptrons

Chapter - 3 Single Layer Percetron

Artificial Intelligence Chapter 3 Neural Networks

Outline Announcement Neural networks Perceptrons - continued

Presentation transcript:

Feed-forward Networks AI - NN Lecture Notes Chapter 8 Feed-forward Networks

§8.1 Introduction To Classification The Classification Model X = [x x … x ] -- the input patterns of classifier. i (X) -- decision function The response of the classifier is 1 or 2 or … or R. t 1 2 n x 1 or 2 or … or R 1 x i (X) 2 x n Class Pattern Classifier

Geometric Explanation of Classification Pattern -- an n-dimensional vector. All n-dimensional patterns constitute an n-dimensional Euclidean space E and is called pattern space. If all patterns can be divided into R classes, then the region of the space containing only patterns of r-th class is called the r-th region, r = 1, …, R. Regions are separated from each other by decision surface. A pattern classifier maps sets of patters in E into one of the regions denoted by numbers i = 1, 2, …, R. n n

Classifiers That Use The Discriminant Functions The membership in a class are determined based on the comparison of R discriminant functions g (X), i =1, …, R, computed for the input pattern under consideration. g (X) are scalar values and the pattern X belongs to the i-th class iff g (X) > g (X), i,j = 1, …, R, i  j. The decision surface equation is g (X) - g (X) = 0. Assuming that the discrimminant functions are known, the block diagram of a basic pattern classifier can be shown below: i i i i j i j

Maximum Selector g (X) g (X) g (X) 1 Maximum Selector g (X) 1 i X i g (X) i Class R g (X) R Discriminators For a given pattern, the i-th discriminator computes the value of the function g (X) called briefly the discriminant. i

When R = 2, the classifier called dichotomizer is simplified as below: TLU Discriminator i 1 X g (X) i -1 Class Discriminant Its discriminant function is g(X) = g (X) - g (X) 1 2 If g(X) > 0, then X belongs to Class 1; If g (X) < 0, then X belongs to Class 2.

The following figure is an example where 6 patterns belong to one of the 2 classes and the decision surface is a straight line. x 2 g(X) > 0 g (X) < 0 -2 -1 (0,0) (2,0) x 1 1 2 (-1/2, -1) -1 (3/2,-1) (-1, -2) -2 (1, -2) Decision Surface g(X) = 0 g(X) = -2x + x +2 1 2 Infinite number of discrimminant functions may exist.

Training and Classification Consider neural network classifiers that derive their weights during the learning cycle. The sample patterns, called the training sequence, are presented to the machine along with the correct response provided by the teacher. After each incorrect response, the classifier modifies its parameters by means of iterative, supervised learning based on comparing the targeted correct response with the actual response.

x w x TLU w 1 0 = I =1 or -1 + w -1 x g(y) - x w + + d d - 0 1 1 2 2 n + w -1 x n g(y) n - x w + n+1 n+1 + d d - 0

§8.2 Single Layer Perceptron 1. Linear Threshold Unit and Separability X W 1 1 W T y X n n X W N N t X = (x , …, x ), x R = {1, -1}  N 1 t W = (w , …, w ) T R 1 N t y = sgn(W X - T)  {1, -1}

Let X’ = and W’ = (W , T), then we have -1 t Let X’ = and W’ = (W , T), then we have t N+1 y = sgn(W’ X’)=( w x )  n n n=1 Linearly Separable Patterns Assume that there is a pattern set  which is divided into subset  ,  , …,  , respectively. If a linear machine can classify the pattern from  , as belonging to class i, for i = 1, …, N, then the pattern sets are linearly separable. 2 N 1 i

When R = 2, for given X and Y, if there exists such W, f and T that makes the relation f: {-1, 1} valid, then the function is said to be linearly separable. Example 1: NAND is a linearly separable function. Given X and Y such that x x y -1 -1 1 -1 1 1 1 -1 1 1 1 -1 It can be found that W = (-1, -1) and T = -3/2 is the solution: y = sgn(W X - T) = sgn (-x -x +3/2) 1 2 t t 1 2

x x T w x + w x - T y -1 -1 -3/2 1+1-(-3/2)=7/2 1 -1 -1 -3/2 1+1-(-3/2)=7/2 1 -1 1 -3/2 1-1-(-3/2)=3/2 1 1 -1 -3/2 -1+1-(-3/2)=3/2 1 1 1 -3/2 -1-1-(-3/2)= -1/2 -1 1 2 1 1 2 2 x 2 x w 3/2 Y<0 1 1 3/2 -3/2 x Y>0 1 Y=sgn(-x -x ) x w 1 2 2 2 x + x = 3/2 1 2

Example 2 XOR is not a linearly separable function. Given that x x y 1 2 -1 -1 -1 -1 1 1 1 -1 1 1 1 -1 It is impossible for finding such W and T that satisfying y = sgn(W X - T): If (-1)w + (-1)w < T, then (+1)w + (+1)w > T If (-1)w + (+1)w > T, then (+1)w + (-1)w < T t 1 2 1 2 1 2 1 2 It is seen that linearly separable binary patterns can be classified by a suitably designed neuron, but linearly non-separable binary patterns cannot.

Given training set {X(t), d(t)}, t=0, 1, 2, …, where Perceptron Learning Given training set {X(t), d(t)}, t=0, 1, 2, …, where X(t) = {x (t), …, x (t), -1} let w = T, x = -1 1 N N+1 N+1 N+1 +1,  w x )  0 -1, otherwise n n n=1 N+1 y = sgn(  w x ) = n n n=1 (1) Set w (0) = small random values, n=1, …, N+1 (2) Input a sample X(t) = {x (t), …, x (t), -1} and d(t) (3) Compute the actual output y(t) (4) Revise the weights w (t+1) = w (t) + [d(t)-y(t)]x (t) (5) Return to (2) until w (t+1) = w (t) for all n. n 1 N n n n n n

Where 0 <  < 1 is a coefficient. Theorem The perceptron learning algorithm will converge if the function is linearly separable. Gradient Descent Algorithm The perceptron learning algorithm is restricted to the linearly separable function cases (hard limiting activation function). Gradient descent algorithm can be applied in more general cases with the only requirement that the function be differentiable.

Given the training set (x , y ), n = 1, …, N, try to find W* such that y^ = f(W*x )  y . Let E =  E = (1/2)  (y - y^) = (1/2)  (y - f(W*x )) be the error measure of learning. To minimize E, take grad E = =  = (1/2)  = (1/2)  = -  (y - f(W* • x )) n n n n n N N N 2 2 n n n n n n=1 n=1 n=1 (y - y^) 2 E N E N n n n W W W w n=1 m m n=1 m N 2 (y - f(W*•x )) n n n=1 W m N  f(W*•x ) n n n n=1 W m

The learning (adjusting) rule is thus as follows  f(W*•x ) N Wx n n = - = -  (y - f(W* • x )) • n n W n=1 Wx m n N  (y - f(W* • x )) f’ • x n n mn n=1 The learning (adjusting) rule is thus as follows W = W -  = W +   > 0 E N  (y - y^) f’ • x , m m m n n mn W n=1 m

§ 8.3 Multi-Layer Perceptron 1. Why Multi-Layer ? XOR which cannot be implemented by 1-layer network as was seen can be implemented by 2-layer network: x 1 1 y = sgn(1 x + 1 x - 2 x - 0.5) 1 1 2 1 x 1 1.5 0.5 y 2 1 1 x 2 1 X = sgn(1 x + 1 x - 1.5) 1 2

Single-layer networks has no hidden units and thus have x x x y 1 1 1 -1 1 -1 -1 1 -1 1 -1 1 -1 -1 -1 -1 1 1 2 Single-layer networks has no hidden units and thus have no internal representation ability. They can only map similar input patterns to similar output ones. If there is a layer of hidden units, there is always an internal representation of the input patterns that may support any mappings from input to output units.

This figure shows the internal representation abilities. Classification for XOR Messed Classified Region Types of the Classified Region General form of Classified Region Network Structure 2 Regions Separated by a sphereplane Open Convex Region or Closed Convex Region Arbitrary Forms

2. Back Propagation Algorithm More precisely, BP is an error back-propagation learning algorithm for multi-layer perceptron networks and is also a sort of generalized gradient descent algorithm. Assumptions: 1) MLP has M layers and one single output node. 2) Each node is of sigmoid type with activation function: f(x) = 1 - x 1 + e 3) Training samples are (x , y ) , n = 1, …, N. n n

4) Error measure is chosen as E =  E = (1/2)  (y -y^ ) Let =  , net =  w O jn j i net ji i=1 jn I) If j is output node, then O = y^ and jn n E y^ n n  = = - (y - y^ ) f’(net ) y^ net jn n n jn n jn Thus net E E jn n n =  O = - (y -y^ )O f’(net ) = W net W jn in n n in jn ij jn ij

II) If j is not output node, then jn  = n = F’(net ), but jn net Jn O  O jn jn jn net E E ( w O ) kn E n n ki in =  n =  i O  O k net net jn jn k O kn kn jn =   w kn kj k Therefore  = f’(net ) and   w jn jn kn kj k E n =  O = O f’(net )   w W in in jn jn kn ij k kj

The MLP - BP algorithm can then be described as below (1) Set the initial W (2) Until convergence (w = const), repeat the following (i) For n = 1 to N (number of samples) (a) Compute O , net , y^ and E (b) For m = M to 2, for all jm, compute  E / w (for the unit j in the same layer m) (ii) Revise the weights: w = w -  ,  > 0, ij n n in jn n ij  E n ij ij w ij

Mapping Ability of Feedforward Networks MLP play a mathematical role of mapping: R  R , Kolmogorov Theorem (1957): Let (x) be a bounded monotonically increasing continuous function, K be a bounded close subset of R , f(X) = f(x , …, x ) be real continuous function on K, then for  > 0, there exist integer N and constants c , T , and w (i,j = 1, …, N) such that m n n 1 n i i ij N n (1) f^(x , …, x ) =  c  (  w x -T ) i 1 n j i=1 i j=1 ij and

Max |f(x , …, x ) - f^(x , …, x ) | <  (2) 1 n 1 n That is to say, for  > 0, there exists a 3-layer network whose hidden unit output function is (x) and whose input and output units are linear, and whose total input- output relation f^(x , …, x ) satisfies Eq(2). Significance: Any continuous mappings R  R can be approximated by a k-layer (k-2 hidden layer) network’s input-output relation, k  3. 1 n m n

§8.4 Applications of Feed forward Networks MLP can be successfully applied to classification and diagnosing problems whose solution is obtained via experimentation and simulation rather than via rigorous and formal approach to the problem. MLP can also act as an expert system. Formulation of the rules is one of the bottle neck in building up expert systems. However, the layered networks can acquire knowledge without extracting IF-THEN rules if the number of training vector pairs is sufficient to suitably form all decision regions.

1) Fault Diagnosing Automobile engine diagnosing (Marko et al, 1989) -- employ single-hidden layer network -- to identify 26 different faults -- training set consists of 16 sets of data for each failure -- training time needed is 10 minutes -- the main-frame is NESTOR NDS-100 computer -- fault recognition accuracy is 100% Switching System Fault Diagnosing (Wen Fang, 1994) -- BP algorithm -- 3-layer network -- no mis-diagnosing, much better than the existing one.

2) Handwritten Digits Recognition Postal Codes Recognition (Wang et al, 1995) -- 3-layer network -- preprocessing -- 130 features -- rejection rate < 5% -- mis-classification rate < 0.01% 3) Other Applications Include -- text reading -- speech recognition -- image recognition -- medical diagnosing -- approximation

-- optimization -- coding -- robot controlling, etc. Advantages and Disadvantages -- learning from examples -- better performance than traditional approach -- long training time (much improved) -- local minima (already overcome)