Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration.
3) Vector Quantization (VQ) and Learning Vector Quantization (LVQ)
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Aims: - evaluate typical properties in controlled model situations - gain general insights into machine learning problems - compare algorithms in controlled.
Pattern Recognition and Machine Learning
Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.
Chapter 5 NEURAL NETWORKS
Neural Networks Marco Loog.
Artificial Neural Networks
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Introduction to machine learning
Radial Basis Function (RBF) Networks
Radial Basis Function Networks
Radial Basis Function Networks
The Dynamics of Learning Vector Quantization, RUG, The Dynamics of Learning Vector Quantization Rijksuniversiteit Groningen Mathematics and.
MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way
Artificial Neural Networks
Biointelligence Laboratory, Seoul National University
A Shaft Sensorless Control for PMSM Using Direct Neural Network Adaptive Observer Authors: Guo Qingding Luo Ruifu Wang Limei IEEE IECON 22 nd International.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Natural Gradient Works Efficiently in Learning S Amari (Fri) Computational Modeling of Intelligence Summarized by Joon Shik Kim.
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
Introduction to Neural Networks Debrup Chakraborty Pattern Recognition and Machine Learning 2006.
Chapter 9 Neural Network.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Dynamical Analysis of LVQ type algorithms, WSOM 2005 Dynamical analysis of LVQ type learning rules Rijksuniversiteit Groningen Mathematics and Computing.
Intro. ANN & Fuzzy Systems Lecture 14. MLP (VI): Model Selection.
Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Chapter 18 Connectionist Models
Chapter 8: Adaptive Networks
Hazırlayan NEURAL NETWORKS Backpropagation Network PROF. DR. YUSUF OYSAL.
Data Mining and Decision Support
CHEE825 Fall 2005J. McLellan1 Nonlinear Empirical Models.
Chapter 6 Neural Network.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Supervised Learning – Network is presented with the input and the desired output. – Uses a set of inputs for which the desired outputs results / classes.
129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Machine Learning Supervised Learning Classification and Regression
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
Overview of Supervised Learning
CSE P573 Applications of Artificial Intelligence Neural Networks
Machine Learning Today: Reading: Maria Florina Balcan
Neuro-Computing Lecture 4 Radial Basis Function Network
CSE 573 Introduction to Artificial Intelligence Neural Networks
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Introduction to Radial Basis Function Networks
Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.
Presentation transcript:

Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg Am Hubland, D Würzburg, Germany Wiskunde & Informatica Intelligent Systems Rijksuniversiteit Groningen, Postbus 800, NL-9718 DD Groningen, The Netherlands Michael Biehl Michael Biehl Christoph Bunzmann, Robert Urbanczik

Efficient training in high-dimensional weight space Learning from examples A model situation layered neural networks student teacher scenario The dynamics of on-line learning on-line gradient descent delayed learning, plateau states Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results Summary, Outlook selected further topics prospective projects

Learning from examples choice of adjustable parameters in adaptive information processing systems · based on example data, e.g. input/output pairs in classification tasks time series prediction regression problems supervised learning · parameterizes a hypothesis e.g. for an unknown classification or regression task · guided by the optimization of an appropriate objective or cost function e.g. performance with respect to the example data · results in generalization ability e.g. the successful classification of novel data

Theory of learning processes · description of specific e.g. hand written digit recognition applications - particular training scheme - given real world problem - special set of example data... · typical properties of e.g. learning curves model scenarios - network architecture - statistics of data, noise understanding/prediction of relevant phenomena, algorithm design trade off: general validity / applicability - learning algorithm · general results - statistical properties of data - specific task - details of training procedure... independent of e.g. performance bounds

input data sigmoidal hidden activation, e.g. g(x) = erf (a x) A two-layered network: the soft committee machine input/output relation ( fixed hidden to output weights ) adaptive weights hidden units SCM+ adaptive thresholds: universal approximator

ideal situation: perfectly matching complexity Student teacher scenario unlearnable rule over-sophisticated student interesting effects relevant cases adaptive student hidden units teacher (best) rule parameterization ? ? ? ? ? ? ? 5

training based on the performance w.r.t. example data, e.g. input/output pairs: examples for the unknown function or rule (reliable) evaluation after training generalization error expected error for a novel input w.r.t. density of inputs / set of test inputs

Statistical Physics approach · consider large systems, in the thermodynamic limit N   (K,M«N) dimension of input data number of adjustable parameters N  · perform averages over stochastic training process over randomized example data, quenched disorder (technically) simplest case: reliable teacher outputs, isotropic input density: independent components with zero mean / unit variance · evaluate typical properties e.g. the learning curve · description in terms of macroscopic quantities e.g. overlap parameters student/teacher similarity measure next: e g

The generalization error (sums of many random numbers) Central Limit Theorem: correlated Gaussians for large N first and second moments: averages over  integrals over K N microscopic macroscopic ½ (K 2 +K) + K M

Dynamics of on-line gradient descent presentation of single examples weights after presentation of examples On-line learning step: novel, random example: number of examples  discrete learning time · no explicit storage of all examples ID required · little computational effort per example practical advantages: mathematical ease: typical dynamics of learning can be evaluated on average over a randomized sequence of examples  coupled ODEs for {R jm,Q ij } in time  =P/(KN)

projections recursions, e.g. large N average over latest example Gaussian mean recursions  coupled ODE in continuous time training time ~ examples per weight  learning curve

eGeG  = P/(KN) Biehl, Riegler, Wöhler J.Phys. A (1996) 4769 perfect generalization fast initial decrease example: K = M = 2,  = 1.5, R ij (0)  0 quasi-stationary plateau states with all dominate the learning process unspecialized student weights 10 learning curve aha!

example: K = M = 2, T mn =  mn,  = 1, R ij (0)  0,  R 11, R 22 Q 11, Q 22 R 12, R 21 Q 21 = Q 21 permutation symmetry of branches in the student network evolution of overlap parameters

N   Q jm mean standard deviation quantity Monte Carlo simulations self-averaging 1/N 1/  N

Plateau length if all assume randomized initialization of weight vectors examples needed for successful learning ! hidden unit specialization requires a priori knowledge (initial macroscopic overlaps) property of the learning scenario necessary phase of training or artifact of the training prescription ??? exactly self-avg.

S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications

Training by Principal Component Analysis problem: delayed specialization in ( K N ) dimensional weight space idea: A)identification (approximation) of the subspace of B) actual training within this low-dimensional space 1 eigenvector ( K-1 ) e.v. example: soft committee teacher (K=M), isotropic input density modified correlation matrix eigenvalues and eigenvectors: ( N-K ) e.v.

empirical estimate from a limited data set · optimization of w.r.t. E ( K 2  K N coefficients) ( # of examples P =  NK  K 2 ) note: required memory  N 2 does not increase with P · representation of student weights B) specialization in the K - dimensional space of · determine largest eigenvalue, e.v. (K-1) smallest eigenvalues, e.v. 1

typical properties: given a random set of P =  N K examples formal partition sum quenched free energy replica trick saddle point integration limit    typical overlap with teacher weights measures the success of teacher space identification A) B) given , determine the optimal e G achievable by a linear combination of

K = 3, Statistical Physics theory and simulations, N = 400 (  ), N = 1600 () B) P =  K N examples  c (K=2) = 4.49  c (K=3) = 8.70 large K theory:  c (K) ~ 2.94 K (N-indep.!) A) cc   B) given , determine the optimal e G achievable by a linear combination of

K = 3, theory and Monte Carlo simulations, N = 400 (  ), N = 1600 () cc   P =  K N examples  c (K=2) = 4.49  c (K=3) = 8.70 large K theory:  c (K) ~ 2.94 K (N-indep.!) A) B) Bunzmann, Biehl, Urbanczik Phys. Rev. Lett. 86, 2166 (2001) unspecialized specialized specialization without a priori knowledge (  c independent of N ) 15

spectrum of matrix C P, teacher with M = 7 hidden units K-1 = 6 smallest eigenvalues algorithm requires no prior knowledge of M PCA hints at the required model complexity potential application: model selection

· model situation, supervised learning - the soft committee machine - student teacher scenario - randomized training data · dynamics of on-line gradient descent - delayed learning due to symmetry breaking necessary specialization processes · statistical physics inspired approach - large systems - thermal (training) and disorder (data) average - typical, macroscopic properties Summary · efficient training - PCA based learning algorithm reduces dimensionality of the problem - specialization without a priori knowledge

Further topics · perceptron training (single layer) optimal stability classification dynamics of learning · unsupervised learning principal component analysis competitive learning, clustered data · specialization processes discontinuous learning curves delayed learning, plateau states · dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks · algorithm design variational method, optimal algorithms construction algorithm · non-trivial statistics of data learning from noisy data time-dependent rules

· unsupervised learning density estimation, feature detection, clustering, (Learning) Vector Quantization compression, self-organizing maps · application relevant architectures and algorithms Local Linear Model Trees Learning Vector Quantization Support Vector Machines Selected Prospective Projects · model selection estimate complexity of a rule or mixture density · algorithm design variational optimization, e.g. alternative correlation matrix