Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Mehran University of Engineering and Technology, Jamshoro Department of Electronic, Telecommunication and Bio-Medical Engineering Neural Networks Committee.

Similar presentations


Presentation on theme: "1 Mehran University of Engineering and Technology, Jamshoro Department of Electronic, Telecommunication and Bio-Medical Engineering Neural Networks Committee."— Presentation transcript:

1

2 1 Mehran University of Engineering and Technology, Jamshoro Department of Electronic, Telecommunication and Bio-Medical Engineering Neural Networks Committee Machines By Dr. Mukhtiar Ali Unar

3 2 Committee Machines: Supervised learning task Supervised learning task Approach is based on a commonly used engineering principle: Approach is based on a commonly used engineering principle: Divide and Conquer. According to the principle of Divide and conquer, a complex computational task is solved by dividing it into a number of computationally simple tasks and then combining the solutions to those tasks. In supervised learning, computational simplicity is achieved by distributing the learning task among a number of experts, which in turn divide the input space into a set of subspaces. The combination of subsets is said to constitute a committee machine.

4 3 Committee machines are universal approximators. Classification of Committee Machines: 1. 1. Static Structures: In this class of Committee Machines, the responses of several predictors (experts) are combined by means of a mechanism that does not involve the input signal, hence the designation static. This category includes the following methods:   Ensemble averaging, where the outputs of different predictors are linearly combined to produce an overall output.   Boosting, where a weak algorithm is converted into one that achieves arbitrarily high accuracy.

5 4 1. 1. Dynamic Structures: In this second class of Committee Machines, the input signal is directly involved in actuating the mechanism that integrates the outputs of the individual experts into an overall output, hence the designation dynamic. There are two kinds of dynamic structures:   Mixture of Experts, in which the individual responses of the responses are non-linearly combined by means of a single gating network.   Hierarchical mixture of experts, in which the individual responses of the individual experts are nolinearly combined by means of several gating networks arranged in a hierarchical fashion.

6 5 The mixture of experts and hierarchical mixture of experts may also be viewed as examples of Modular Networks. A formal definition of the notion of modularity. A neural network is said to be Modular if the computation performed by the network can be decomposed into two or more modules (subsystems) that operate on distinct inputs without communicating with each other. The outputs of the modules are mediated by an integrating unit that is not permitted to feed information back to the modules. In particular, the integrating unit both (1) decides how the outputs of the modules should be combined to form the final output of the system, and decides which modules should learn which training patterns.

7 6 Ensemble Averaging: Fig2 shows a number of differently trained neural networks (i.e. experts), which share a common input and whose individual outputs are somehow combined to produce an overall output y. To simplify the presentation, the outputs of the experts are assumed to be scalar valued. Such a representation is known as ensemble averaging method. Expert 1 Expert 2 Expert k  Input x[n] Combiner  Output y

8 7 The motivation for using ensemble averaging: If the combination of experts in Fig 2 were replaced by a single neural network, we would have a network with a correspondingly large number of adjustable parameters. The training time for such a large network is likely to be longer than for the case of a set of experts trained in parallel. The risk of overfitting the data increases when the number of adjustable parameters is large compared to cardinality (i.e. size of the set) of the training data. In using a committee machine the expectation is that the differently trained networks converge to different local minima on the error surface, and the overall performance is improved by combining the outputs in some way.

9 8 Boosting: Boosting is another method that belong to the “static” class of committee machines. Boosting is quite different from ensemble averaging. In a Committee machine based on ensemble averaging, all the experts in the machine are trained on the same data set; they may differ from each other in the choice of initial conditions used in network training. By contrast, in a boosting machine, the experts are trained on data sets with entirely different distributions; it is a general method that can be used to improve the performance of any learning algorithm.

10 9 Boosting can be implemented in three fundamentally different ways:   Boosting by filtering: This approach involves filtering the training examples by different versions of a weak learning algorithm. It assumes the availability of a large (in theory, infinite) source of examples, with the examples being either discarded or kept during training. An advantage of this approach is that it allows for a small memory requirement compared to the other two approaches.

11 10 Boosting by subsampling: This second approach works with a training sample of fixed size. The examples are “resampled” according to a given probability distribution during training. The error is calculated with respect to the fixed training sample. Boosting by reweighting: This third approach also works with a fixed training sample, but it assumes that the weak learning algorithm can receive “weighted” examples. The error is calculated with respect to the weighted examples.

12 11 Boosting by Filtering: In boost by filtering, the Committee machine consists of three experts. The algorithm used to train them is called a boosting algorithm. The three experts are arbitrarily labeled “first”, “second” and “third”. The three experts are individually trained as follows: The first expert is trained on a set consisting of N 1 examples. The trained first expert is used to filter another set of examples by proceeding in the following manner:   Flip a fair coin: this in effect simulates a random guess   If the result is heads, pass new patterns through the first expert, and discard correctly classified patterns until a pattern is misclassified. This misclassified pattern is added to the training set for the second expert.

13 12 Boosting by Filtering:   If the result is tails, do the opposite. Specifically, pass new patterns through the first expert and discard incorrectly classified patterns until a pattern is classified correctly. That correctly classified pattern is added to the training set for the second expert.   Continue this process until a total of N 1 examples have been filtered by the first expert. This set of filtered examples constitutes the training set for the second expert.   In this way, the second expert is forced to learn a distribution different from that learned by the first expert.   Once the second expert has been trained in the usual way, a third training set is formed for the third expert by proceeding in the following manner:

14 13 Boosting by Filtering:   Pass a new pattern through both the first and second experts. If the two experts agree in their decisions, discard the pattern. If on the other hand, they disagree,, the pattern is added to the training set for the third expert.   Continue this process until a total of N 1 examples has been filtered jointly by the first and second experts. This set of jointly filtered examples constitutes the training set for the third expert.   The third expert is then trained in the usual way, and the training of the entire committee machine is therby completed.

15 14 Mixture of Expert (ME) Models This configuration consists of K expert networks, or simply experts, and an integrating unit called a gating network that performs the function of a mediator among the expert networks (see fig below). It is assumed that the different experts work best in different regions of the input space. Expert 1 Expert 2 Expert K Gating Network  Input vector x g 1 g 2 gKgK Output signal y  

16 15 The neurons of the experts are usually linear. The Fig. Given below shows the block diagram of a single neuron constituting expert k. The output of expert k is the inner product of the input vector x and synaptic weight vector w k of this neuron, as shown by y k = w k T x k = 1,2,…,K + w k1 w k2 x1x1 x2x2 xmxm ykyk  (8)

17 16 The getting network consists of a single layer of K neurons, with each neuron assigned to a specific expert. The Fig. (a) below Shows the architectural graph of the gating network and Fig (b) show the block diagram of neuron k in that network. x1x1 x2x2 xmxm a k1 a k2 a km  x1x1 x2x2 xmxm Softmax gkgk (a)(b)   ukuk

18 17 Unlike the experts, the neurons of the gating networks are non-linear, with their activation function defined by where u k is the inner product of the input vector x and synaptic weight vector a k ; i.e u k = a k T x k = 1,2,…,k The normalized exponential function may be viewed as a multi-input generalization of the logistic function. It preserves the rank order of its input values, and is a differentiable generalization of the “winner takes all” operation of picking the maximum value. For this reason, the activation function of equation (9) is referred to as Softmax. (9)

19 18 Let y k denote the output of the kth expert in response to the input vector x. The overall output of the ME model is (10) Example: Consider an ME model with two experts, and a gating network with two outputs denoted by g 1 and g 2. The Output g 1 is defined by Let a 1 and a 2 denote the two weight vectors of the gating network. We may then write (11) (12)

20 19 and therefore rewrite equation (11) as: The other output g 2 of the gating network is (13) (14) Along the ridge defined by a 1 = a 2, we gave g 1 = g 2 = ½ and the two experts contributes equally to the output of the ME model. Away from the ridge, one or the other of the two experts assumes the dominant role.

21 20 Hierarchical Mixture of Experts (HME) Model: The HME model, illustrated on the next slide, is a natural extension of HE model.The illustration is for an HME model of four experts.It has two layers of gating networks. By continuing with the application of the principle of divide and conquer in a manner similar to that illustrated, we may construct an HME model with any number of levels of hierarchy. The architecture of the HME model is like a tree in which the gating networks sit at the various nonterminals of the tree and the experts sit at the leaves of the tree. The HME model differs from the ME model in that the input space is divided into a nested set of subspaces, with the information combined and redistributed among the experts under the control of several gating networks arranged in a hierarchical manner.

22 21 Expert 1,1 Expert 2,1 Gating Network 1  Expert 1,2 Expert 2,2 Gating Network 2  Gating Network  Input Vector x y 11 y 21 y 12 y 22 y g 1|1 g 2|1 g 1|2 g 2|2 g1g1 g2g2

23 22 Local Model Networks:          n  n f 1 (  f 1 ( f 2 ( f 2 ( f n ( f n (     u y  y 1 y 2 y n

24 23 A Local Model Network (LMN) is a set of models (experts) weighted by some activation function. The same input is fed to each model and outputs are weighted according to some variable or variables, , where y(t) is the model network output,  i (  ) is the validity (i.e. activation) function of the ith model, n is the number of models, and y i (t) is the output of the ith local mode f i (  ). The weighting or activation of each local model is calculated using an activation function which is a function of of the scheduling variable. The scheduling variable could be a system state variable, an input variable or some other system parameter. It is also feasible to schedule more than one variable and to establish a multi-dimensional LMN. (1)

25 24 Although any function with a locally limited activation may be applied an an activation function, Gaussian functions are applied most widely. Usually normalized validity functions are used. The validity function  i (  ) can be normalized as The individual component models f i can be of any form; they can be linear or nonlinear, have a state-space or input- output description, or be discrete or continuous time. They can be of different character, using physical models of the system for operating conditions where they are available, and parametric models for conditions where there is no physical description available. These can also be ANN models such as MLP & RBF networks.

26 25 The individual local models are smoothly interpolated by the validity functions  i to produce the overall model. The learning Process in LMNs can be divided into two tasks: 1. 1. Find the optimal number, position and shape of the validity functions, i.e. define the structure of the network. 2. 2. Find the optimal set of parameters for the local models, i.e. define the parameters of the network. These parameters could be complete set of coefficients of a linear model, numerical parameters of a non-linear model, or even switches which alter the local model structure.

27 26 Advantages of LMNs: The LMN has a transparent structure which allows a direct analysis of local model properties. The LMN is less sensitive to the curse of dimentionality than many other local representations such as RBF networks. Non-linear models based on LMNs are able to capture the non-linear effects and provide accuracy over a wide operational range. The LMN framework allows the integration of a priori knowledge to define the model structure for a particular problem. This leads to more interpretable models which can be more reliably identified from a limited amount of observed data.

28 27 Example: Modelling of Ship Dynamics A ship is usually represented by the following mathematical equation: where  is the heading of the ship and  is the rudder angle (control signal). The parameters m, d 1 and d 2 depend upon the operating conditions which include the speed of the vessel, depth of water, loading conditions and environmental disturbances etc. The table on the next slide shows how these parameters change with the forward speed of the ship.

29 28 Table: Table: Variation of ship parameters with speed Speed (m/sec) md1d1 d3d3 2 4 6 8 10 12 14 16 18 20 387.5 96.875 43.055 24.22 15.5 10.76 7.91 6.05 4.78 3.87 5 2.5 1.66 1.25 1.00 0.84 0.72 0.63 0.55 0.5 12.5 1.56 0.46 0.19 0.1 0.05 0.03 0.02 0.01

30 29 A Local Model Network can easily be developed to incorporate the parameter variations with respect to speed. For example four models at a speed of (say) 4 m/sec, 8 m/sec, 12 m/sec and 16 m/sec. can be interpolated together as shown on page 26. The Guassian functions at centres 4, 8, 12 and 16 m/sec can be used as validity functions and speed may be regarded as scheduling variable. Some results are shown next.

31 30 10 m/sec 7 m/sec

32 31 4.1 m/sec 8.2 m/sec


Download ppt "1 Mehran University of Engineering and Technology, Jamshoro Department of Electronic, Telecommunication and Bio-Medical Engineering Neural Networks Committee."

Similar presentations


Ads by Google