Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg Am Hubland, D Würzburg, Germany Wiskunde & Informatica Intelligent Systems Rijksuniversiteit Groningen, Postbus 800, NL-9718 DD Groningen, The Netherlands Michael Biehl Michael Biehl Christoph Bunzmann, Robert Urbanczik
Efficient training in high-dimensional weight space Learning from examples A model situation layered neural networks student teacher scenario The dynamics of on-line learning on-line gradient descent delayed learning, plateau states Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results Summary, Outlook selected further topics prospective projects
Learning from examples choice of adjustable parameters in adaptive information processing systems · based on example data, e.g. input/output pairs in classification tasks time series prediction regression problems supervised learning · parameterizes a hypothesis e.g. for an unknown classification or regression task · guided by the optimization of an appropriate objective or cost function e.g. performance with respect to the example data · results in generalization ability e.g. the successful classification of novel data
Theory of learning processes · description of specific e.g. hand written digit recognition applications - particular training scheme - given real world problem - special set of example data... · typical properties of e.g. learning curves model scenarios - network architecture - statistics of data, noise understanding/prediction of relevant phenomena, algorithm design trade off: general validity / applicability - learning algorithm · general results - statistical properties of data - specific task - details of training procedure... independent of e.g. performance bounds
input data sigmoidal hidden activation, e.g. g(x) = erf (a x) A two-layered network: the soft committee machine input/output relation ( fixed hidden to output weights ) adaptive weights hidden units SCM+ adaptive thresholds: universal approximator
ideal situation: perfectly matching complexity Student teacher scenario unlearnable rule over-sophisticated student interesting effects relevant cases adaptive student hidden units teacher (best) rule parameterization ? ? ? ? ? ? ? 5
training based on the performance w.r.t. example data, e.g. input/output pairs: examples for the unknown function or rule (reliable) evaluation after training generalization error expected error for a novel input w.r.t. density of inputs / set of test inputs
Statistical Physics approach · consider large systems, in the thermodynamic limit N (K,M«N) dimension of input data number of adjustable parameters N · perform averages over stochastic training process over randomized example data, quenched disorder (technically) simplest case: reliable teacher outputs, isotropic input density: independent components with zero mean / unit variance · evaluate typical properties e.g. the learning curve · description in terms of macroscopic quantities e.g. overlap parameters student/teacher similarity measure next: e g
The generalization error (sums of many random numbers) Central Limit Theorem: correlated Gaussians for large N first and second moments: averages over integrals over K N microscopic macroscopic ½ (K 2 +K) + K M
Dynamics of on-line gradient descent presentation of single examples weights after presentation of examples On-line learning step: novel, random example: number of examples discrete learning time · no explicit storage of all examples ID required · little computational effort per example practical advantages: mathematical ease: typical dynamics of learning can be evaluated on average over a randomized sequence of examples coupled ODEs for {R jm,Q ij } in time =P/(KN)
projections recursions, e.g. large N average over latest example Gaussian mean recursions coupled ODE in continuous time training time ~ examples per weight learning curve
eGeG = P/(KN) Biehl, Riegler, Wöhler J.Phys. A (1996) 4769 perfect generalization fast initial decrease example: K = M = 2, = 1.5, R ij (0) 0 quasi-stationary plateau states with all dominate the learning process unspecialized student weights 10 learning curve aha!
example: K = M = 2, T mn = mn, = 1, R ij (0) 0, R 11, R 22 Q 11, Q 22 R 12, R 21 Q 21 = Q 21 permutation symmetry of branches in the student network evolution of overlap parameters
N Q jm mean standard deviation quantity Monte Carlo simulations self-averaging 1/N 1/ N
Plateau length if all assume randomized initialization of weight vectors examples needed for successful learning ! hidden unit specialization requires a priori knowledge (initial macroscopic overlaps) property of the learning scenario necessary phase of training or artifact of the training prescription ??? exactly self-avg.
S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications
Training by Principal Component Analysis problem: delayed specialization in ( K N ) dimensional weight space idea: A)identification (approximation) of the subspace of B) actual training within this low-dimensional space 1 eigenvector ( K-1 ) e.v. example: soft committee teacher (K=M), isotropic input density modified correlation matrix eigenvalues and eigenvectors: ( N-K ) e.v.
empirical estimate from a limited data set · optimization of w.r.t. E ( K 2 K N coefficients) ( # of examples P = NK K 2 ) note: required memory N 2 does not increase with P · representation of student weights B) specialization in the K - dimensional space of · determine largest eigenvalue, e.v. (K-1) smallest eigenvalues, e.v. 1
typical properties: given a random set of P = N K examples formal partition sum quenched free energy replica trick saddle point integration limit typical overlap with teacher weights measures the success of teacher space identification A) B) given , determine the optimal e G achievable by a linear combination of
K = 3, Statistical Physics theory and simulations, N = 400 ( ), N = 1600 () B) P = K N examples c (K=2) = 4.49 c (K=3) = 8.70 large K theory: c (K) ~ 2.94 K (N-indep.!) A) cc B) given , determine the optimal e G achievable by a linear combination of
K = 3, theory and Monte Carlo simulations, N = 400 ( ), N = 1600 () cc P = K N examples c (K=2) = 4.49 c (K=3) = 8.70 large K theory: c (K) ~ 2.94 K (N-indep.!) A) B) Bunzmann, Biehl, Urbanczik Phys. Rev. Lett. 86, 2166 (2001) unspecialized specialized specialization without a priori knowledge ( c independent of N ) 15
spectrum of matrix C P, teacher with M = 7 hidden units K-1 = 6 smallest eigenvalues algorithm requires no prior knowledge of M PCA hints at the required model complexity potential application: model selection
· model situation, supervised learning - the soft committee machine - student teacher scenario - randomized training data · dynamics of on-line gradient descent - delayed learning due to symmetry breaking necessary specialization processes · statistical physics inspired approach - large systems - thermal (training) and disorder (data) average - typical, macroscopic properties Summary · efficient training - PCA based learning algorithm reduces dimensionality of the problem - specialization without a priori knowledge
Further topics · perceptron training (single layer) optimal stability classification dynamics of learning · unsupervised learning principal component analysis competitive learning, clustered data · specialization processes discontinuous learning curves delayed learning, plateau states · dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks · algorithm design variational method, optimal algorithms construction algorithm · non-trivial statistics of data learning from noisy data time-dependent rules
· unsupervised learning density estimation, feature detection, clustering, (Learning) Vector Quantization compression, self-organizing maps · application relevant architectures and algorithms Local Linear Model Trees Learning Vector Quantization Support Vector Machines Selected Prospective Projects · model selection estimate complexity of a rule or mixture density · algorithm design variational optimization, e.g. alternative correlation matrix