Download presentation
Presentation is loading. Please wait.
2
Efficient Training in high-dimensional weight space Theoretische Physik und Astrophysik Computational Physics Julius-Maximilians-Universität Würzburg Am Hubland, D-97074 Würzburg, Germany http://theorie.physik.uni-wuerzburg.de/~biehl Wiskunde & Informatica Intelligent Systems Rijksuniversiteit Groningen, Postbus 800, NL-9718 DD Groningen, The Netherlands biehl@cs.rug.nlbiehl@cs.rug.nl, www.cs.rug.nl/~biehl, Michael Biehl Michael Biehl Christoph Bunzmann, Robert Urbanczik
3
Efficient training in high-dimensional weight space Learning from examples A model situation layered neural networks student teacher scenario The dynamics of on-line learning on-line gradient descent delayed learning, plateau states Efficient training of multilayer networks learning by Principal Component Analysis idea, analysis, results Summary, Outlook selected further topics prospective projects
4
Learning from examples choice of adjustable parameters in adaptive information processing systems · based on example data, e.g. input/output pairs in classification tasks time series prediction regression problems supervised learning · parameterizes a hypothesis e.g. for an unknown classification or regression task · guided by the optimization of an appropriate objective or cost function e.g. performance with respect to the example data · results in generalization ability e.g. the successful classification of novel data
5
Theory of learning processes · description of specific e.g. hand written digit recognition applications - particular training scheme - given real world problem - special set of example data... · typical properties of e.g. learning curves model scenarios - network architecture - statistics of data, noise understanding/prediction of relevant phenomena, algorithm design trade off: general validity / applicability - learning algorithm · general results - statistical properties of data - specific task - details of training procedure... independent of e.g. performance bounds
6
input data sigmoidal hidden activation, e.g. g(x) = erf (a x) A two-layered network: the soft committee machine input/output relation ( fixed hidden to output weights ) adaptive weights hidden units SCM+ adaptive thresholds: universal approximator
7
ideal situation: perfectly matching complexity Student teacher scenario unlearnable rule over-sophisticated student interesting effects relevant cases adaptive student hidden units teacher (best) rule parameterization ? ? ? ? ? ? ? 5
8
training based on the performance w.r.t. example data, e.g. input/output pairs: examples for the unknown function or rule (reliable) evaluation after training generalization error expected error for a novel input w.r.t. density of inputs / set of test inputs
9
Statistical Physics approach · consider large systems, in the thermodynamic limit N (K,M«N) dimension of input data number of adjustable parameters N · perform averages over stochastic training process over randomized example data, quenched disorder (technically) simplest case: reliable teacher outputs, isotropic input density: independent components with zero mean / unit variance · evaluate typical properties e.g. the learning curve · description in terms of macroscopic quantities e.g. overlap parameters student/teacher similarity measure next: e g
10
The generalization error (sums of many random numbers) Central Limit Theorem: correlated Gaussians for large N first and second moments: averages over integrals over K N microscopic macroscopic ½ (K 2 +K) + K M
11
Dynamics of on-line gradient descent presentation of single examples weights after presentation of examples On-line learning step: novel, random example: number of examples discrete learning time · no explicit storage of all examples ID required · little computational effort per example practical advantages: mathematical ease: typical dynamics of learning can be evaluated on average over a randomized sequence of examples coupled ODEs for {R jm,Q ij } in time =P/(KN)
12
projections recursions, e.g. large N average over latest example Gaussian mean recursions coupled ODE in continuous time training time ~ examples per weight learning curve
13
100 200300 0 0.01 0.02 0.03 0.04 0.05 0 eGeG = P/(KN) Biehl, Riegler, Wöhler J.Phys. A (1996) 4769 perfect generalization fast initial decrease example: K = M = 2, = 1.5, R ij (0) 0 quasi-stationary plateau states with all dominate the learning process unspecialized student weights 10 learning curve aha!
14
example: K = M = 2, T mn = mn, = 1, R ij (0) 0, 100 200300 0 0.5 1.0 0.0 R 11, R 22 Q 11, Q 22 R 12, R 21 Q 21 = Q 21 permutation symmetry of branches in the student network evolution of overlap parameters
15
N Q jm mean standard deviation quantity Monte Carlo simulations self-averaging 1/N 1/ N
16
Plateau length if all assume randomized initialization of weight vectors examples needed for successful learning ! hidden unit specialization requires a priori knowledge (initial macroscopic overlaps) property of the learning scenario necessary phase of training or artifact of the training prescription ??? exactly self-avg.
17
S.J. Hanson, in Y. Chauvin & D. Rumelhart (Hrsg.) Backpropagation: Theory, Architectures, and Applications
18
Training by Principal Component Analysis problem: delayed specialization in ( K N ) dimensional weight space idea: A)identification (approximation) of the subspace of B) actual training within this low-dimensional space 1 eigenvector ( K-1 ) e.v. example: soft committee teacher (K=M), isotropic input density modified correlation matrix eigenvalues and eigenvectors: ( N-K ) e.v.
19
empirical estimate from a limited data set · optimization of w.r.t. E ( K 2 K N coefficients) ( # of examples P = NK K 2 ) note: required memory N 2 does not increase with P · representation of student weights B) specialization in the K - dimensional space of · determine largest eigenvalue, e.v. (K-1) smallest eigenvalues, e.v. 1
20
typical properties: given a random set of P = N K examples formal partition sum quenched free energy replica trick saddle point integration limit typical overlap with teacher weights measures the success of teacher space identification A) B) given , determine the optimal e G achievable by a linear combination of
21
K = 3, Statistical Physics theory and simulations, N = 400 ( ), N = 1600 () B) P = K N examples c (K=2) = 4.49 c (K=3) = 8.70 large K theory: c (K) ~ 2.94 K (N-indep.!) A) cc B) given , determine the optimal e G achievable by a linear combination of
22
K = 3, theory and Monte Carlo simulations, N = 400 ( ), N = 1600 () cc P = K N examples c (K=2) = 4.49 c (K=3) = 8.70 large K theory: c (K) ~ 2.94 K (N-indep.!) A) B) Bunzmann, Biehl, Urbanczik Phys. Rev. Lett. 86, 2166 (2001) unspecialized specialized specialization without a priori knowledge ( c independent of N ) 15
23
spectrum of matrix C P, teacher with M = 7 hidden units K-1 = 6 smallest eigenvalues algorithm requires no prior knowledge of M PCA hints at the required model complexity potential application: model selection
24
· model situation, supervised learning - the soft committee machine - student teacher scenario - randomized training data · dynamics of on-line gradient descent - delayed learning due to symmetry breaking necessary specialization processes · statistical physics inspired approach - large systems - thermal (training) and disorder (data) average - typical, macroscopic properties Summary · efficient training - PCA based learning algorithm reduces dimensionality of the problem - specialization without a priori knowledge
25
Further topics · perceptron training (single layer) optimal stability classification dynamics of learning · unsupervised learning principal component analysis competitive learning, clustered data · specialization processes discontinuous learning curves delayed learning, plateau states · dynamics of on-line training perceptron, unsupervised learning, two-layered feed-forward networks · algorithm design variational method, optimal algorithms construction algorithm · non-trivial statistics of data learning from noisy data time-dependent rules
26
· unsupervised learning density estimation, feature detection, clustering, (Learning) Vector Quantization compression, self-organizing maps · application relevant architectures and algorithms Local Linear Model Trees Learning Vector Quantization Support Vector Machines Selected Prospective Projects · model selection estimate complexity of a rule or mixture density · algorithm design variational optimization, e.g. alternative correlation matrix
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.