Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine
Motivation Xi=1 means that pin i will fall during a Bowling round. Xi=0 means that pin i will still stand. You are given pairwise probabilities P(Xi,Xj). Task: predict the distribution Q(n), n=0,.., 10 of the total number of pins that will fall. Stock market: Xi=1 means that company i defaults. You are interested in the probability of n companies defaulting in your portfolio.
Sneak Preview Newsgroups-small (collected by S. Roweis) 100 binary features, 16,242 instances (300 shown) (Note: herding is a deterministic algorithm, no noise was added) Herding is a deterministic dynamical system that turns “moments” (average feature statistics) into “samples” which share the same moments. Quiz: which is which [top/bottom]? -data in random order. -herding sequence in order received.
Traditional Approach: Hopfield Nets & Boltzman Machines weight state value (say 0/1) Energy: Probability of a joint state: Coordinate descent on energy:
Traditional Learning Approach Use CD instead !
What’s Wrong With This? E[Xi] and E[XiXj] are intractable to compute (and you need them at every iteration of gradient descent). Slow convergence & local minima (only w/ hidden vars) Sampling can get stuck in local modes (slow mixing).
Solution in a Nutshell Nonlinear Dynamical System (sidestep learning + sampling)
Herding Dynamics no stepsize no random numbers no exponentiation no point estimates
Piston Analogy weights=pistons Pistons move up at a constant rate (proportional to observed correlations) When they gets too high, the “fuel” will combust and the piston will be pushed down (depression) “Engine driven by observed correlations”
Herding Dynamics with General Features no stepsize no random numbers no exponentiation no point estimates
Features as New Coordinates If then period is infinite thanks to Romain Thibaux
Example weights initialized in a grid red ball tracks 1 weight converence on a fractal attractor set with Hausdorf dim. 1.5
The Tipi Function gradient descend on G(w) with stepsize 1. This function is: Concave Piecewise linear Non-positive Scale free coordinate ascend replaced with full maximization. Scale free property implies that stepsize will not affect state sequence S.
Recurrence Thm: If we can find the optimal state S, then the weights will stay within a compact region. Empirical evidence: coordinate ascent is sufficient to guarantee recurrence.
Ergodicity s=1 s=2 s=3 s=4 s=5 s=6 s=[1,1,2,5,2... Thm: If the 2-norm of the weights grows slower than linear, then feature averages over trajectories converge to data averages.
Relation to Maximum Entropy Dual: Tipi function: Herding dynamics satisfies constraints but not maximal entropy
Advantages / Disadvantages Learning & Inference have merged into one dynamical system. Fully tractable – although one should monitor whether local maximization is enough to keep weights finite. Very fast: no exponentation, no random number generation. No fudge factors (learning rates, momentum, weight decay..). Very efficient mixing over all “modes” (attractor set). Moments preserved, but what is our “inductive bias”? (i.e. what happens to remaining degrees of freedom?).
Back to Bowling Data collected by P. Cotton. 10 pins, 298 bowling runs. X=1 means a pin has fallen in two subsequent bowls. H.XX uses all pairwise probabilities H.XXX uses all triplet probabilities P(total nr. pins falling)
More Results Datasets: Bowling (n=298, d=10, k=2, Ntrain=150, Ntest = 148) Abelone (n=4177, d=8, k=2, Ntrain=2000, Ntest = 2177) Newsgroup-small (n=16,242, d=100, k=2, Ntrain=10,000, Ntest = 6242) 8x8 Digits (n=2200 [3’s and 5’s], d=64, k=2, Ntrain=1600, Ntest =600) Task: given only pairwise probabilities, compute the probability of the total nr. of 1’s in a data-vector Q(n). Solution: apply herding and compute Q(n) through sample averages. Error : KL[Pdata||Pest] Task: given only pairwise probabilities, compute the classifier P(Y|X). Solution: train logistic regression (LR) classifier on herding sequence. Error : fraction of misclassified test cases. LR is too simple, PL on herding sequence also gives In higher dimensions herding looses advantage in accuracy
Conclusions Herding replaces point estimates with trajectories over attractor sets (which is not the Bayesian posterior) in a tractable manner. Model for “neural computation” – similar to dynamical synapses – Quasi-random sampling of state space (chaotic?) – Local updates – Efficient (no random numbers, exponentiation)