NIMIA Crema, Italy1 Identification and Neural Networks I S R G G. Horváth Department of Measurement and Information Systems
NIMIA Crema, Italy2 Modular networks Why modular approach Motivations Biological Learning Computational Implementation
NIMIA Crema, Italy3 Motivations Biological Biological systems are not homogenous Functional specialization Fault tolerance Cooperation, competition Scalability Extendibility
NIMIA Crema, Italy4 Motivations Complexity of learning (divide and conquer) Training of complex network (many layers) layer by layer learning Speed of learning Catastrophic interference, incremental learning Mixing supervised and unsupervised learning Hierarchical knowledge structure
NIMIA Crema, Italy5 Motivations Computational The capacity of a network The size of the network Catastrophic interference Generalization capability vs network complexity
NIMIA Crema, Italy6 Motivations Implementation (hardware) The degree of parallelism Number of connections The length of physical connections Fan out
NIMIA Crema, Italy7 Modular networks What modules The modules are disagree on some inputs every module solves the same, whole problem, different ways of solutions (different modules) every module solves different tasks (sub-tasks) task decomposition (input space, output space)
NIMIA Crema, Italy8 Modular networks How combine modules Cooperative modules simple average weighted average (fixed weights) optimal linear combination (OLC) of networks Competitive modules majority vote winner takes all Competitive/cooperative modules weighted average (input-dependent weights) mixture of experts (MOE)
NIMIA Crema, Italy9 Modular networks Construct of modular networks Task decomposition, subtask definition Training modules for solving subtasks Integration of the results (cooperation and/or competition)
NIMIA Crema, Italy10 Modular networks Cooperative networks Ensemble (average) Optimal linear combination of networks Disjoint subtasks Competitive networks Ensemble (vote) Competitive/cooperative networks Mixture of experts
NIMIA Crema, Italy11 Cooperative networks Ensemble of cooperating networks (classification/regression) The motivation Heuristic explanation Different experts together can solve a problem better Complementary knowledge Mathematical justification Accurate and diverse modules
NIMIA Crema, Italy12 Ensemble of networks Mathematical justification Ensemble output Ambiguity (diversity) Individual error Ensemble error Constraint
NIMIA Crema, Italy13 Ensemble of networks Mathematical justification (cont’d) Weighted error Weighted diversity Ensemble error Averaging over the input distribution Solution: Ensemble of accurate and diverse networks
NIMIA Crema, Italy14 Ensemble of networks How to get accurate and diverse networks different structures: more than one network structure (e.g. MLP, RBF, CCN, etc.) different size, different complexity networks (number of hidden units, number of layers, nonlinear function, etc.) different learning strategies (BP, CG, random search,etc.) batch learning, sequential learning different training algorithms, sample order, learning samples different training parameters different starting parameter values different stopping criteria
NIMIA Crema, Italy15 Linear combination of networks NN M NN 1 NN 2 α1α1 α2α2 αMαM Σ y1y1 y2y2 yMyM x NN M α 0 y 0 =1
NIMIA Crema, Italy16 Linear combination of networks Computation of optimal coefficients simple average , k depends on the input for different input domains different network (alone gives the output) optimal values using the constraint optimal values without any constraint Wiener-Hopf equation
NIMIA Crema, Italy17 Task decomposition Decomposition related to learning before learning (subtask definition) during learning (automatic task decomposition) Problem space decomposition input space (input space clustering, definition of different input regions) output space (desired response)
NIMIA Crema, Italy18 Task decomposition Decomposition into separate subproblems K-class classification K two-class problems (coarse decomposition) Complex two-class problems smaller two-class problems (fine decomposition) Integration (module combination)
NIMIA Crema, Italy19 Task decomposition A 3-class problem
NIMIA Crema, Italy20 Task decomposition 3 classes 2 small classes
NIMIA Crema, Italy21 Task decomposition 3 classes 2 classes 2 small classes
NIMIA Crema, Italy22 Task decomposition 3 classes 2 small classes
NIMIA Crema, Italy23 Task decomposition M 12 M 13 M 23 MIN C1C1 C2C2 C3C3 INV= Input
NIMIA Crema, Italy24 Task decomposition A two-class problem decomposed into subtasks
NIMIA Crema, Italy25 Task decomposition AND OR AND M 11 M 12 M 22 M 21
NIMIA Crema, Italy26 Task decomposition M 11 M 21 MIN MAX MIN C1C1 Input M 12 M 22
NIMIA Crema, Italy27 Task decomposition Training set decomposition: Original training set Training set for each of the (K) two-class problems Each of the two-class problems are divided into K-1 smaller two- class problems [using an inverter module really (K-1)/2 is enough]
NIMIA Crema, Italy28 Task decomposition input number 16 x 16 Normalization Edge detection horizontal diagonal \ diagonal / vertical Kirsch masks 4 16 x 16 feature maps 4 8 x 8 matrix input number 16 x 16 A practical example: Zip code recognition
NIMIA Crema, Italy29 Task decomposition Zip code recognition (handwritten character recognition) modular solution 45 (K*K-1)/2 neurons 10 AND gates (MIN operator) inputs
NIMIA Crema, Italy30 Mixture of Experts (MOE) Expert 2Expert 1 Gating network μ 1 μ g1g1 g2g2 x Expert M gMgM Σ
NIMIA Crema, Italy31 Mixture of Experts (MOE) The output is the weighted sum of the outputs of the experts is the parameter of the i-th expert The output of the gating network: “softmax” function is the parameter of the gating network
NIMIA Crema, Italy32 Mixture of Experts (MOE) Probabilistic interpretation The probabilistic model with true parameters a priori probability
NIMIA Crema, Italy33 Mixture of Experts (MOE) Training Training data Probability of generating output from the input The log likelihood function (maximum likelihood estimation)
NIMIA Crema, Italy34 Mixture of Experts (MOE) Training (cont’d) Gradient method The parameter of the expert network The parameter of the gating network and
NIMIA Crema, Italy35 Mixture of Experts (MOE) Training (cont’d) A priori probability A posteriori probability
NIMIA Crema, Italy36 Mixture of Experts (MOE) Training (cont’d) EM (Expectation Maximization) algorithm A general iterative technique for maximum likelihood estimation Introducing hidden variables Defining a log likelihood function Two steps: Expectation of the hidden variables Maximization of the log likelihood function
NIMIA Crema, Italy37 EM (Expectation Maximization) algorithm A simple example: estimating means of k (2) Gaussians f (y│µ 1 ) f (y│ 2 ) Measurements
NIMIA Crema, Italy38 EM (Expectation Maximization) algorithm A simple example: estimating means of k (2) Gaussians hidden variables for every observation, (x(l), z i1, z i2 ) likelihood function Log likelihood function expected value of with given
NIMIA Crema, Italy39 Mixture of Experts (MOE) A simple example: estimating means of k (2) Gaussians Expected log likelihood function where The estimate of the means
NIMIA Crema, Italy40 Mixture of Experts (MOE) Applications Simple experts: linear experts ECG diagnostics Mixture of Kalman filters Discussion: comparison to non-modular architecture
NIMIA Crema, Italy41 Support vector machines A new approach: Gives answers for questions not solved using the classical approach The size of the network The generalization capability
NIMIA Crema, Italy42 Classical neural learningSupport Vector Machine Support vector machines Optimal hyperplane Classification
NIMIA Crema, Italy43 VC dimension
NIMIA Crema, Italy44 Structural error minimization
NIMIA Crema, Italy45 Support vector machines Linearly separable two-class problem separating hyperpalne Optimal hyperplane
NIMIA Crema, Italy46 Support vector machines x d(x)d(x) x1x1 x2x2 Geometric interpretation
NIMIA Crema, Italy47 Support vector machines Criterion function, Lagrange function a constrained optimization problem conditions dual problem support vectorsoptimal hyperplane
NIMIA Crema, Italy48 Support vector machines Linearly nonseparable case separating hyperplane criterion function Lagrange function support vectors optimal hyperplane Optimal hyperplane
NIMIA Crema, Italy49 Support vector machines Nonlinear separation separating hyperplane decision surface kernel function criterion function
NIMIA Crema, Italy50 Support vector machines Examples of SVM Polynomial RBF MLP
NIMIA Crema, Italy51 Support vector machines Example: polynomial basis functions kernel function
NIMIA Crema, Italy52 Minimize: Constraint: Separable samples:Not separable samples: Constraint: Minimize: Where by minimizingwe maximize the distance of the classes, whilst we also control the VC dimension. SVR (classification)
NIMIA Crema, Italy53 SVR (regression) C()C()
NIMIA Crema, Italy54 SVR (regression) Constraints:Minimize:
NIMIA Crema, Italy55 SVR (regression) Lagrange function dual problem constraints support vectors solution
NIMIA Crema, Italy56 SVR (regression)
NIMIA Crema, Italy57 SVR (regression)
NIMIA Crema, Italy58 SVR (regression)
NIMIA Crema, Italy59 SVR (regression)
NIMIA Crema, Italy60 Support vector machines Main advantages generalization size of the network centre parameters for RBF linear-in-the-parameter structure noise immunity
NIMIA Crema, Italy61 Support vector machines Main disadavantages computation intensive (quadratic optimization) hyperparameter selection VC dimension (classification) batch processing
NIMIA Crema, Italy62 Support vector machines Variants LS SVM basic criterion function Advantages: easier to compute adaptivity,
NIMIA Crema, Italy63 Mixture of SVMs Problem of hyper-parameter selection for SVMs Different SVMs, with different hyper-parameters Soft separation of the input space
NIMIA Crema, Italy64 Mixture of SVMs
NIMIA Crema, Italy65 Boosting techniques Boosting by filtering Boosting by subsampling Boosting by reweighting
NIMIA Crema, Italy66 Boosting techniques Boosting by filtering
NIMIA Crema, Italy67 Boosting techniques Boosting by subsampling
NIMIA Crema, Italy68 Boosting techniques Boosting by reweighting
NIMIA Crema, Italy69 Other modular architectures
NIMIA Crema, Italy70 Other modular architectures
NIMIA Crema, Italy71 Other modular architectures Modular classifiers Decoupled modules Hierarchical modules Network ensemble (linear combination) Network ensemble (decision, voting)
NIMIA Crema, Italy72 Modular architectures