Download presentation
Presentation is loading. Please wait.
1
Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine
2
Yes, All Models Are Wrong, but… …from a CS/ML perspective this may not necessarily be less of big problem. Training: We want to gain an optimal amount of predictive accuracy per unit time. Testing: We want to engage the model that results in optimal accuracy within the time allowed to make a decision. Computer scientists are mostly interested in prediction. Example: ML do not care about identifiability (as long as the model predicts well). Computer scientist care a lot about computation. Example: ML are willing to tradeoff estimation bias for computation (if this means that we can handle bigger datasets – e.g. variational inference vs. MCMC). Fight or flight
3
Not Bayesian Nor Frequentist But Mandelbrotist… Is there a deep connection between learning, computation and chaos theory?
4
Perspective model/inductive bias Pseudo-samples Integration prediction learning inference Integration prediction herding
5
Herding Nonlinear Dynamical System. Generate pseudo-samples “S”. prediction consistency
6
Herding weights to not converge, Monte Carlo sums do Maximization does not have to be perfect (see PCT theorem). Deterministic No step-size Only very simple operations (no exponentiation, logarithms etc.)
7
Ising/Hopfield Model Network Neuron fires if input exceeds threshold Synapse depresses if pre- & postsynaptic neurons fire. Threshold depresses after neuron fires
8
Pseudo-Samples From Critical Ising Model
9
Herding as a Dynamical System S w data constant Piecewise constant fct. of W Markov process in W Infinite memory process in S
10
Example in 2-D s=[1,1,2,5,2... Itinerary: s=1 s=2 s=3 s=4 s=5 s=6
12
Convergence Translation: Choose S t such that: Then: s=1 s=2 s=3 s=4 s=5 s=6 s=[1,1,2,5,2... Equivalent to “Perceptron Cycling Theorem” (Minsky ’68)
13
Period Doubling As we change R (T) the number of fixed points change. T=0: herding “edge of chaos”
14
Applications Classification Compression Modeling Default Swaps Monte Carlo Integration Image Segmentation Natural Language Processing Social Networks
15
Example Classifier from local Image features: Classifier from boundary detection: + Herding will generate samples such that the local probabilities are respected as much as possible (project on marginal polytope) Combine with Herding
16
Topological Entropy Theorem [Goetz00] : Call W(T) the number of possible subsequences of length T, then the topological entropy for herding is: However, we are interested in the sub-extensive entropy: [Nemenman et al.] Theorem: Conjecture: (K = nr. of parameters) (for typical herding systems) S=1,3,2
17
Learning Systems Herding is not random and not IID due to negative auto-correlations. The information in its sequence is:. We can therefore represent the original (random) data sample by a much smaller subset without loss of information content (N instead of N 2 samples). These shorter herding sequences can be used to efficiently approximate averages by Monte Carlo sums. Information we learn from the random IID data.
18
Conclusions Herding is an efficient alternative for learning in MRFs. Edge of chaos dynamics provides more efficient information processing than random sampling. General principle that underlies information processing in the brain ? We advocate to explore potential interesting connections between computation, learning and the the theory of nonlinear dynamical systems and chaos. What can we learn from viewing learning as a nonlinear dynamical process?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.