1 Mehran University of Engineering and Technology, Jamshoro Institute of Information Technology Third Term ME CSN & IT Neural Networks By Dr. Mukhtiar.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Perceptron Lecture 4.
Committee Machines and Mixtures of Experts Neural Networks 12.
Lect.3 Modeling in The Time Domain Basil Hamed

NEURAL NETWORKS Perceptron
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Visual Recognition Tutorial
2806 Neural Computation Committee Machines Lecture Ari Visa.
Simple Neural Nets For Pattern Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Ensemble Learning: An Introduction
Evaluating Hypotheses
3-1 Introduction Experiment Random Random experiment.
Radial Basis Function (RBF) Networks
Radial-Basis Function Networks
Radial Basis Function Networks
Radial Basis Function Networks
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
1 2. Independence and Bernoulli Trials Independence: Events A and B are independent if It is easy to show that A, B independent implies are all independent.
Artificial Intelligence Techniques Multilayer Perceptrons.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Benk Erika Kelemen Zsolt
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Numerical Methods.
CHAPTER 5 SIGNAL SPACE ANALYSIS
1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
NIMIA Crema, Italy1 Identification and Neural Networks I S R G G. Horváth Department of Measurement and Information Systems.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Classification Ensemble Methods 1
EEE502 Pattern Recognition
1 Mehran University of Engineering and Technology, Jamshoro Department of Electronic, Telecommunication and Bio-Medical Engineering Neural Networks Committee.
Chapter 20 Statistical Considerations Lecture Slides The McGraw-Hill Companies © 2012.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
Intro. ANN & Fuzzy Systems Lecture 13. MLP (V): Speed Up Learning.
Neural Networks 2nd Edition Simon Haykin
Computacion Inteligente Least-Square Methods for System Identification.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Chapter 2. Signals and Linear Systems
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Machine Learning: Ensemble Methods
Big data classification using neural network
CS 9633 Machine Learning Support Vector Machines
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Machine Learning Basics
Neuro-Computing Lecture 5 Committee Machine
Data Mining Practical Machine Learning Tools and Techniques
Hidden Markov Models Part 2: Algorithms
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
The normal distribution
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Neuro-Computing Lecture 2 Single-Layer Perceptrons
16. Mean Square Estimation
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

1 Mehran University of Engineering and Technology, Jamshoro Institute of Information Technology Third Term ME CSN & IT Neural Networks By Dr. Mukhtiar Ali Unar

2 Statistical Nature of the learning Process: A neural network is merely one form in which empirical knowledge about the physical phenomenon or environment of interest may be encoded through training. By empirical knowledge we mean a set of measurements that characterizes the phenomenon. To be more specific, consider the example of a stochastic phenomenon described by a random vector X consisting of a set of independent variables, and a random scalar D representing a dependent variable. Suppose that we have N realizations of the random vector X denoted by, and a corresponding set of realizations of the random scalar D denoted by.

3 These realizations (measurements) constitute the training sample denoted by Ordinarily we do not have knowledge of the exact functional relationship between X and D, so we proceed by proposing the model D = f(X) +  (2) where f(.) is a deterministic function of its argument vector, and  is a random expectational error that represents our “ignorance” about the dependence of D and X. The statistical model described by equation (2) is called a regressive model, and is depicted in the following figure: (1) f(.) + X d  Fig 1

4 The expectation error , in general, has zero mean and positive probability of occurrence. On this basis, the regression model of fig1 has two useful properties The mean value of the expectation error , given any realization X is zero; that is, E[  |X] = 0 (3) where E is the statistical expectation operator. As a corollary to this property, we may state that the regression function f(X) is the conditional mean of the model output D, given that the input X = x, as shown by f(x) = E[D|x] (4) The expectation error  is uncorrelated with the regression function f(X); that is E[  f(X)] = 0 This property is the well known Principle of Orthogonality, which states that all the information about D available to us through the input X has been encoded into the regression function f(X).

5 A neural network provides an approximation to the regressive model of Fig.1. Let the actual response of neural network produced in response to the input vector X, be denoted as Y = f(X,w) (5) where f(.,w) is the input-output function realized by the neural network. Given the training data T of equation (1), the weight vector w is obtained by minimizing the cost function: In statistical terms, The error function  (w) may be expressed as  (w) = B(w) + V(w) (7) The term B(w) is the bias term which represents the inability of the neural network to accurately approximate the regression function. The term V(w) is the variance which represents the inadequacy of the information contained in the training sample about the regression function. (6)

6 Committee Machines: Supervised learning task Supervised learning task Approach is based on a commonly used engineering principle: Approach is based on a commonly used engineering principle: Divide and Conquer. According to the principle of Divide and conquer, a complex computational task is solved by dividing it into a number of computationally simple tasks and then combining the solutions to those tasks. In supervised learning, computational simplicity is achieved by distributing the learning task among a number of experts, which in turn divide the input space into a set of subspaces. The combination of subsets is said to constitute a committee machine.

7 Committee machines are universal approximators. Classification of Committee Machines: Static Structures: In this class of Committee Machines, the responses of several predictors (experts) are combined by means of a mechanism that does not involve the input signal, hence the designation static. This category includes the following methods:   Ensemble averaging, where the outputs of different predictors are linearly combined to produce an overall output.   Boosting, where a weak algorithm is converted into one that achieves arbitrarily high accuracy.

Dynamic Structures: In this second class of Committee Machines, the input signal is directly involved in actuating the mechanism that integrates the outputs of the individual experts into an overall output, hence the designation dynamic. There are two kinds of dynamic structures:   Mixture of Experts, in which the individual responses of the responses are non-linearly combined by means of a single gating network.   Hierarchical mixture of experts, in which the individual responses of the individual experts are nolinearly combined by means of several gating networks arranged in a hierarchical fashion.

9 The mixture of experts and hierarchical mixture of experts may also be viewed as examples of Modular Networks. A formal definition of the notion of modularity. A neural network is said to be Modular if the computation performed by the network can be decomposed into two or more modules (subsystems) that operate on distinct inputs without communicating with each other. The outputs of the modules are mediated by an integrating unit that is not permitted to feed information back to the modules. In particular, the integrating unit both (1) decides how the outputs of the modules should be combined to form the final output of the system, and decides which modules should learn which training patterns.

10 Ensemble Averaging: Fig2 shows a number of differently trained neural networks (i.e. experts), which share a common input and whose individual outputs are somehow combined to produce an overall output y. To simplify the presentation, the outputs of the experts are assumed to be scalar valued. Such a representation is known as ensemble averaging method. Expert 1 Expert 2 Expert k  Input x[n] Combiner  Output y

11 The motivation for using ensemble averaging: If the combination of experts in Fig 2 were replaced by a single neural network, we would have a network with a correspondingly large number of adjustable parameters. The training time for such a large network is likely to be longer than for the case of a set of experts trained in parallel. The risk of overfitting the data increases when the number of adjustable parameters is large compared to cardinality (i.e. size of the set) of the training data. In using a committee machine the expectation is that the differently trained networks converge to different local minima on the error surface, and the overall performance is improved by combining the outputs in some way.

12 Boosting: Boosting is another method that belong to the “static” class of committee machines. Boosting is quite different from ensemble averaging. In a Committee machine based on ensemble averaging, all the experts in the machine are trained on the same data set; they may differ from each other in the choice of initial conditions used in network training. By contrast, in a boosting machine, the experts are trained on data sets with entirely different distributions; it is a general method that can be used to improve the performance of any learning algorithm.

13 Boosting can be implemented in three fundamentally different ways:   Boosting by filtering: This approach involves filtering the training examples by different versions of a weak learning algorithm. It assumes the availability of a large (in theory, infinite) source of examples, with the examples being either discarded or kept during training. An advantage of this approach is that it allows for a small memory requirement compared to the other two approaches.

14 Boosting by subsampling: This second approach works with a training sample of fixed size. The examples are “resampled” according to a given probability distribution during training. The error is calculated with respect to the fixed training sample. Boosting by reweighting: This third approach also works with a fixed training sample, but it assumes that the weak learning algorithm can receive “weighted” examples. The error is calculated with respect to the weighted examples.

15 Boosting by Filtering: In boost by filtering, the Committee machine consists of three experts. The algorithm used to train them is called a boosting algorithm. The three experts are arbitrarily labeled “first”, “second” and “third”. The three experts are individually trained as follows: The first expert is trained on a set consisting of N 1 examples. The trained first expert is used to filter another set of examples by proceeding in the following manner:   Flip a fair coin: this in effect simulates a random guess   If the result is heads, pass new patterns through the first expert, and discard correctly classified patterns until a pattern is misclassified. This misclassified pattern is added to the training set for the second expert.

16 Boosting by Filtering:   If the result is tails, do the opposite. Specifically, pass new patterns through the first expert and discard incorrectly classified patterns until a pattern is classified correctly. That correctly classified pattern is added to the training set for the second expert.   Continue this process until a total of N 1 examples have been filtered by the first expert. This set of filtered examples constitutes the training set for the second expert.   In this way, the second expert is forced to learn a distribution different from that learned by the first expert.   Once the second expert has been trained in the usual way, a third training set is formed for the third expert by proceeding in the following manner:

17 Boosting by Filtering:   Pass a new pattern through both the first and second experts. If the two experts agree in their decisions, discard the pattern. If on the other hand, they disagree,, the pattern is added to the training set for the third expert.   Continue this process until a total of N 1 examples has been filtered jointly by the first and second experts. This set of jointly filtered examples constitutes the training set for the third expert.   The third expert is then trained in the usual way, and the training of the entire committee machine is therby completed.

18 Mixture of Expert (ME) Models This configuration consists of K expert networks, or simply experts, and an integrating unit called a gating network that performs the function of a mediator among the expert networks (see fig below). It is assumed that the different experts work best in different regions of the input space. Expert 1 Expert 2 Expert K Gating Network  Input vector x g 1 g 2 gKgK Output signal y  

19 The neurons of the experts are usually linear. The Fig. Given below shows the block diagram of a single neuron constituting expert k. The output of expert k is the inner product of the input vector x and synaptic weight vector w k of this neuron, as shown by y k = w k T x k = 1,2,…,K + w k1 w k2 x1x1 x2x2 xmxm ykyk  (8)

20 The getting network consists of a single layer of K neurons, with each neuron assigned to a specific expert. The Fig. (a) below Shows the architectural graph of the gating network and Fig (b) show the block diagram of neuron k in that network. x1x1 x2x2 xmxm a k1 a k2 a km  x1x1 x2x2 xmxm Softmax gkgk (a)(b)   ukuk

21 Unlike the experts, the neurons of the gating networks are non-linear, with their activation function defined by where u k is the inner product of the input vector x and synaptic weight vector a k ; i.e u k = a k T x k = 1,2,…,k The normalized exponential function may be viewed as a multi-input generalization of the logistic function. It preserves the rank order of its input values, and is a differentiable generalization of the “winner takes all” operation of picking the maximum value. For this reason, the activation function of equation (9) is referred to as Softmax. (9)

22 Let y k denote the output of the kth expert in response to the input vector x. The overall output of the ME model is (10) Example: Consider an ME model with two experts, and a gating network with two outputs denoted by g 1 and g 2. The Output g 1 is defined by Let a 1 and a 2 denote the two weight vectors of the gating network. We may then write (11) (12)

23 and therefore rewrite equation (11) as: The other output g 2 of the gating network is (13) (14) Along the ridge defined by a 1 = a 2, we gave g 1 = g 2 = ½ and the two experts contributes equally to the output of the ME model. Away from the ridge, one or the other of the two experts assumes the dominant role.

24 Hierarchical Mixture of Experts (HME) Model: The HME model, illustrated on the next slide, is a natural extension of HE model.The illustration is for an HME model of four experts.It has two layers of gating networks. By continuing with the application of the principle of divide and conquer in a manner similar to that illustrated, we may construct an HME model with any number of levels of hierarchy. The architecture of the HME model is like a tree in which the gating networks sit at the various nonterminals of the tree and the experts sit at the leaves of the tree. The HME model differs from the ME model in that the input space is divided into a nested set of subspaces, with the information combined and redistributed among the experts under the control of several gating networks arranged in a hierarchical manner.

25 Expert 1,1 Expert 2,1 Gating Network 1  Expert 1,2 Expert 2,2 Gating Network 2  Gating Network  Input Vector x y 11 y 21 y 12 y 22 y g 1|1 g 2|1 g 1|2 g 2|2 g1g1 g2g2

26 Local Model Networks:          n  n f 1 (  f 1 ( f 2 ( f 2 ( f n ( f n (     u y  y 1 y 2 y n

27 A Local Model Network (LMN) is a set of models (experts) weighted by some activation function. The same input is fed to each model and outputs are weighted according to some variable or variables, , where y(t) is the model network output,  i (  ) is the validity (i.e. activation) function of the ith model, n is the number of models, and y i (t) is the output of the ith local mode f i (  ). The weighting or activation of each local model is calculated using an activation function which is a function of of the scheduling variable. The scheduling variable could be a system state variable, an input variable or some other system parameter. It is also feasible to schedule more than one variable and to establish a multi-dimensional LMN. (1)

28 Although any function with a locally limited activation may be applied an an activation function, Gaussian functions are applied most widely. Usually normalized validity functions are used. The validity function  i (  ) can be normalized as The individual component models f i can be of any form; they can be linear or nonlinear, have a state-space or input- output description, or be discrete or continuous time. They can be of different character, using physical models of the system for operating conditions where they are available, and parametric models for conditions where there is no physical description available. These can also be ANN models such as MLP & RBF networks.

29 The individual local models are smoothly interpolated by the validity functions  i to produce the overall model. The learning Process in LMNs can be divided into two tasks: Find the optimal number, position and shape of the validity functions, i.e. define the structure of the network Find the optimal set of parameters for the local models, i.e. define the parameters of the network. These parameters could be complete set of coefficients of a linear model, numerical parameters of a non-linear model, or even switches which alter the local model structure.

30 Advantages of LMNs: The LMN has a transparent structure which allows a direct analysis of local model properties. The LMN is less sensitive to the curse of dimentionality than many other local representations such as RBF networks. Non-linear models based on LMNs are able to capture the non-linear effects and provide accuracy over a wide operational range. The LMN framework allows the integration of a priori knowledge to define the model structure for a particular problem. This leads to more interpretable models which can be more reliably identified from a limited amount of observed data.

31 Example: Modelling of Ship Dynamics A ship is usually represented by the following mathematical equation: where  is the heading of the ship and  is the rudder angle (control signal). The parameters m, d 1 and d 2 depend upon the operating conditions which include the speed of the vessel, depth of water, loading conditions and environmental disturbances etc. The table on the next slide shows how these parameters change with the forward speed of the ship.

32 Table: Table: Variation of ship parameters with speed Speed (m/sec) md1d1 d3d

33 A Local Model Network can easily be developed to incorporate the parameter variations with respect to speed. For example four models at a speed of (say) 4 m/sec, 8 m/sec, 12 m/sec and 16 m/sec. can be interpolated together as shown on page 26. The Guassian functions at centres 4, 8, 12 and 16 m/sec can be used as validity functions and speed may be regarded as scheduling variable. Some results are shown next.

34 10 m/sec 7 m/sec

m/sec 8.2 m/sec