Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Beyond Linear Separability
Slides from: Doug Gray, David Poole
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)
Classification Neural Networks 1
Machine Learning Neural Networks
Overview over different methods – Supervised Learning
Artificial Neural Networks
Neural Networks.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Artificial Neural Networks ML Paul Scheible.
Biological neuron artificial neuron.
Artificial Neural Networks #1 Machine Learning CH4 : 4.1 – 4.5
Back-Propagation Algorithm
Artificial Neural Networks
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Artificial Neural Networks
Computer Science and Engineering
Artificial Neural Networks
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Neural Networks Biointelligence Laboratory Department of Computer Engineering Seoul National University.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Non-Bayes classifiers. Linear discriminants, neural networks.
EE459 Neural Networks Backpropagation
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Artificial Neural Network
EEE502 Pattern Recognition
Multilayer Neural Networks (sometimes called “Multilayer Perceptrons” or MLPs)
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Artificial Neural Networks
第 3 章 神经网络.
Linear separability Hyperplane In 2D: Feature 1 Feature 2 A perceptron can separate data that is linearly separable.
Artificial Neural Networks
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Classification Neural Networks 1
Artificial Intelligence Chapter 3 Neural Networks
Perceptron as one Type of Linear Discriminants
Artificial Neural Networks
Artificial Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Seminar on Machine Learning Rada Mihalcea
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Artificial Neural Network

Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation algorithm Successful in many practical problems, such as interpreting visual scenes and speech recognition. Robust to errors in the training data

Biological Motivation Biological learning systems are built of very complex webs of interconnected neurons A densely interconnected set of simple units, where each unit takes a number of real-valued inputs and produces a simple real-valued output The human brain contain a densely interconnected network of approximately neurons. Each neurons interconnected 10 3 other neurons. The fastest neuron switching times are quite slow than computer switching speeds, yet human make complex decisions quickly.

Biological Motivation Not exactly same as biological systems. Two group of research – Using ANNs to study and model biological learning processes – The goal of obtaining highly effective machine learning algorithms

Neural Network Representation Steer an autonomous vehicle driving at normal speeds on public highways

Neural Network Representation ALVINN is typical of ANNs – Direct and cycle free Other Structures – Acyclic and cyclic – Directed or undirected Backpropagation algorithm assume network is a fixed structure that corresponds to a directed graph, possibly containing cycles Choose weight value for each edge in the graph

Appropriate Problems for Neural Network Learning Instances are represented by many attribute-value pairs The target function output may be discrete-valued, real- valued, or a vector of several real- or discrete-valued attributes The training examples may contain errors Long training times are acceptable Fast evaluation of the learned target function may be required The ability of humans to understand the learned target function is not important

Perceptrons One type of ANN is based on a unit called a perceptron. A perceptron take a vector of real-valued inputs, calculates a linear combination of these inputs, then output a 1 if the result is greater than some threshold. Each w i is a real-valued constant, or weight, that determines the contribution of x i to the perceptron output

Perceptrons To simplify notation, we imagine an additional constant input x 0, allowing us to write the above inequality as Or in vector form as Perceptron function where

Perceptrons Learning a perceptron involves choosing values for the weights w 0, w 1,…, w n. Therefore, the space H of candidate hypotheses considered in perceptron learning is the set of all possible real-valued weight vectors

Representational Power of Perceptrons A hyperplane decision surface in the n- dimensional space of instances One side of hyperplane is 1, the other is -1 The decision hyperplane Linearly separable sets.

Representational Power of Perceptrons A single perceptron can be used to represent many Boolean functions. – How to implement AND and OR? Perceptrons can represent all of main Boolean functions AND, OR, NAND, and NOR XOR is non-separable training examples

Representational Power of Perceptrons Boolean function can be represented by some network of interconnected units based on these primitives. Every Boolean function can be represented by network of perceptrons only two levels deep. Networks can represent a rich variety of functions and single units along cannot.

The Perceptron Training Rules Learning the weight for a single perceptron Determine a weight vector that causes the perceptron to produce the correct 1/-1 Two algorithms – Perceptron rule – Delta rule Converge to somewhat different acceptable hypotheses

The Perceptron Training Rules Perceptron Rules – Random weight – Iteratively apply the perceptron to each training example and modify weights whenever it misclassifies an example – Iterating as many as needed until all the examples has been correctly classified

Gradient Descent and the Delta Rule Delta rule can converge even the examples are not linearly separable. Gradient descent to search the hypothesis space of possible weight vectors to find the best one Search hypothesis space containing many different types of continuously parameterized hypotheses

Gradient Descent To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient Descent and the Delta Rule Consider the task of training an un-thresholded perceptron Training error of a hypothesis

Visualizing the Hypothesis Space The w0, w1 plane represents the entire hypothesis space Start arbitrary initial weight vector, then repeatedly modifying it in small step in the direction that produces the steepest descent.

Gradient Descent Direction of steepest descent along the error surface. – Derivative of E with respect to each component of the vector, written – The gradient specifies the direction that produces the steepest increase in E. is that direction Gradient descent rule

Derivative Rules

Gradient Descent The vector of derivatives that form the gradient can be obtained by differentiating E

Gradient Descent Algorithm Gradient-Descent (training_example, η) – Each training example is a pair of the form, where is the vector of input values, and t is the target output value. η is the learning rate Initialize each w i to some small random value Until the termination condition is met, Do – Initialize each Δw i to zero – For each in training_example, Do Input the instance to the unit and compute the output o For each linear unit weight w i, Do –  w i  w i +  (t-o)x i(1) – For each linear unit weight w i, Do w i  w i +  w i

Gradient Descent Algorithm Gradient-Descent – Pick an initial random weight vector – Update each weight w i by adding  w i Error surface contains only a single global minimum this algorithm will converge to a weight vector with minimum error. Determine the η

Stochastic Approximation to Gradient Descent Searching through a large or infinite hypothesis space that can be applied whenever – The hypothesis space contains continuously parameterized hypotheses – The error can be differentiated with respect to these hypothesis parameters. Major problem of Gradient Descent – Converging to a local minimum can sometimes be quite slow – No guarantee to find the global minimum

Stochastic Gradient Descent Algorithm Gradient-Descent (training_example, η) – Each training example is a pair of the form, where is the vector of input values, and t is the target output value. η is the learning rate Initialize each w i to some small random value Until the termination condition is met, Do – Initialize each Δw i to zero – For each in training_example, Do Input the instance to the unit and compute the output o For each linear unit weight w i, Do – w i  w i +  (t-o)x i

Stochastic Approximation to Gradient Descent Stochastic gradient descent – Approximate the gradient descent search by updating weights incrementally, following the calculation of the error for each individual example – A distinct error function for each individual training example  w i  (t-o)x i – Provide a reasonable approximation to descending the gradient with respect to our original error function – By making the η sufficiently small, it can be made to approximate rule gradient descent arbitrarily closely.

Gradient Descent and Stochastic Gradient Descent In standard gradient descent, the error is summed over all examples before updating weights, whereas in stochastic gradient descent weights are updated upon examining each training example Summing over multiple examples in standard gradient descent requires more computation per weight update step. The step size is larger than stochastic gradient descent In case where there are multiple local minima with respect to E, stochastic gradient descent avoid falling into these local minima.

Multilayer Network and The Backpropagation Algorithm Single perceptrons can only express linear decision surfaces. This kind of multilayer networks learned by the Backpropagation algorithm are capable of expressing a rich variety of nonlinear decision surface.

Multilayer Network and The Backpropagation Algorithm

A Differentiable Threshold Unit What type of unit shall we use as the basis for constructing multiplayer networks? Multiple layers of cascaded linear units still produce linear functions The Perceptron unit is a option, however, its discontinuous threshold makes it undifferentiable and hence unsuitable for gradient descent Sigmoid unit – Output is a nonlinear function of its input – Output is a differentiable function of its input

A Differentiable Threshold Unit Sigmoid unit computes it output o where

A Differentiable Threshold Unit Sigmoid function – Logistic function – Output ranges between 0 to 1 – Increasing with its input – Its derivative is easily expressed in terms of its output Other function – Easily calculated derivatives are sometimes used in place of σ – For example, e -y in the sigmoid function can be replaced by e -k.y where k is some positive number that determine the steepness of the threshold

Backpropagation Algorithm The backpropagation algorithm learns the weights for a multilayer network It employs gradient descent to attempt to minimize the squared error between the network output values and the target values Sum the errors over all of the network output units

Backpropagation Algorithm Search a large hypothesis space defined by all possible weight values for all the units in the network One major difference in the case of multilayer networks is that the error surface can have multiple local minima, The gradient descent is guaranteed only to converge toward some local minimum It still can produce excellent results in many real- world applications

Backpropagation Algorithm Backpropagation(training_example, η, n in,n out, n hidden ) – Create a feed-forward network with n in inputs, n out outputs and n hidden hidden units – Initialize all network weights to small random number (between and 0.05) – Until the termination condition is met, Do For each in training_examples, Do – Propagate the input forward through the network: – 1. input the instance to the network and compute the output o u of every unit u in the network – Propagate the errors backward through the network: – 2. For each network output unit k, calculate its error term  k  o k (1-o k )(t k -o k ) – 3. For each hidden unit h, calculate its error term – 4. Update each network weight w ji  w ji +  w ji , where  w ji =  j x ji

Backpropagation Algorithm This algorithm applies to layered networks containing two layers of sigmoid units, with units at each layer connected to all units from the preceding layer This is the incremental or stochastic gradient descent version of Backpropagation algorithm – An index is assigned to each node in the network, where a node is either an input or output of some unit – x ji denotes the input from node i to unit j, and w ji denotes the corresponding weight –  n denotes the error term associated with unit n.

Backpropagation Algorithm Constructing a network with the desired number of hidden and output units Initializing all network weights to small random values Given a fixed network structure, the main loop of the algorithm then repeatedly iterates over the training examples. For each training example, calculate the error for this example, computes the gradient and update the weights The gradient descent step is iterated until the network performs acceptably well

Weight Update Rule Similar to the delta rule Update each weight in proportion to the learning rate , the input value x ji and the error in the output of the unit The error (t-o) in the delta rule is replaced by a more complex error term  j

Error in Backpropagation Algorithm Error for output unit k –  k is the familiar (t k -o k ) as delta rule , multiplied by the factor o k (1-o k ), which is derivative of the sigmoid squashing function Error for hidden unit – No target values are directly available to indicate the error of hidden units’ values – The error terms for hidden unit h is calculated by summing the errors  k for each output unit influenced by h, weighting each of the  k by w kh. The weight from hidden unit h to output unit k.

Backpropagation Algorithm Updating weights incrementally, following the presentation of each training example. This corresponds to a stochastic approximation to gradient descent To obtain the true gradient of E, one would sum the  j x ji values over all training examples before altering weight values Iterated thousands of times in a typical application. Termination condition can be used to halt the procedure – Choose to halt after certain iteration – Error on training examples falls below some threshold – Error on a separate validation set of examples meets some criterion Avoid overfitting

Derivation of the Backpropagation Rules The stochastic gradient descent involves iterating through the training examples one at a time, for each training example d descending the gradient of the error E d with respect to this single example

Subscripts and Variables x ji , the ith input to unit j W ji, the weight associate with the ith input to unit j net j , the weighted sum of inputs for unit j o j , the output computed by unit j t j , the target output for unit j  , sigmoid function outputs , the set of units in the final layer of the network Downstream(j) , the set of units whose immediate inputs include the output of unit j

Derivation of the Backpropagation Rules, We consider two cases in turn, the case where unit j is an output unit for the network, and the case where j is an internal unit

Hidden Unit Weight

Convergence and Local Minima It can guarantee to converge toward some local minimum E and not necessarily to the global minimum error Back Propagation is a highly effective function approximation method in practice

Overfitting

Some weights begin to grow in order to reduce the error over the training data, and the complexity of the learned decision surface increase Given enough iterations, Backpropagation will often be able to create overly complex decision surfaces that fit noise in the training data or unrepresentative characteristics of the particular training sample

Solution for Overfitting Weight Decay – Decrease each weight by some small factor during each iteration Validation data – Cross validation – K-fold – Different test data

An Illustrative Example Training data – Image of 20 different people – 32 image per person Expression (happy, sad, angry, neutral) Direction which they were looking (L, R, S, U) Whether they were wearing sunglasses\ 624 greyscale image, each with a resolution of 120*128 Output: Which direction they were looking

Learned Hidden Representations 30*32 resolution input images Network weights after 100 iterations Network weights after 1 iterations Left Straight right up

Face Recognition Input encoding – ANN input is to be some representation of the image – Preprocess the image to extract edges, regions of uniform intensity, or other local image features. One difficulty with this design option is that it would lead to a variable number of features per image – Encode the image as a fixed set of 30*32 pixel intensity values with one network input per pixel. Values range from 0 to 255

Face Recognition The ANN must output one of four values indicating the direction in which the person is looking – Single output unit – Four distinct output unit (1-of-n) 1-of-n – More degrees of freedom to the network for representing the target function – Difference between the highest-valued output and the second-highest can be used as a measure of the confidence in the network prediction

Face Recognition Output – Four target values … – We use – Avoiding target values of 0 and 1 is that sigmoid units cannot produce these output values given finite weights – Values of 0.1 and 0.9 are achievable using a sigmoid unit with finite weight

Face Recognition Network graph structure – How many units to include in the network and how to interconnect them – Layered network with feedforward connection from every unit in one layer to every unit in the next (Two layers) – How many hidden layers 3 units, 90% accuracy with 5 minutes running time 100 units, 91%-92% accuracy with 1 hour running time – Extra hidden units above this number do not dramatically affect generalization accuracy – Increasing number of hidden units often increases the tendency to overfit the training data

Face Recognition Other learning algorithm parameters – Learning rate 0.3 and momentum a was set to 0.3 – Lower rate results in more running time – Full gradient descent was used in all these experiments – Weight are assigned to 0 at beginning