Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.

Slides:



Advertisements
Similar presentations
Lazy Paired Hyper-Parameter Tuning
Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Optimization Tutorial
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Change Detection in Dynamic Environments Mark Steyvers Scott Brown UC Irvine This work is supported by a grant from the US Air Force Office of Scientific.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Lecture 5: Learning models using EM
Conditional Random Fields
Expectation Maximization Algorithm
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Artificial Neural Networks
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Inference in Dynamic Environments Mark Steyvers Scott Brown UC Irvine This work is supported by a grant from the US Air Force Office of Scientific Research.
Ensemble Learning (2), Tree and Forest
Helsinki University of Technology Adaptive Informatics Research Centre Finland Variational Bayesian Approach for Nonlinear Identification and Control Matti.
PATTERN RECOGNITION AND MACHINE LEARNING
Artificial Neural Networks
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
CSC321: Introduction to Neural Networks and machine Learning Lecture 16: Hopfield nets and simulated annealing Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Hilbert Space Embeddings of Conditional Distributions -- With Applications to Dynamical Systems Le Song Carnegie Mellon University Joint work with Jonathan.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Ensemble Methods in Machine Learning
29 August 2013 Venkat Naïve Bayesian on CDF Pair Scores.
Classification Ensemble Methods 1
Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
CSC321: Neural Networks Lecture 22 Learning features one layer at a time Geoffrey Hinton.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Machine Learning Basics
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
Random Sampling over Joins Revisited
Probabilistic Models with Latent Variables
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Boltzmann Machine (BM) (§6.4)
Presentation transcript:

Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine

Motivation Xi=1 means that pin i will fall during a Bowling round. Xi=0 means that pin i will still stand. You are given pairwise probabilities P(Xi,Xj). Task: predict the distribution Q(n), n=0,.., 10 of the total number of pins that will fall. Stock market: Xi=1 means that company i defaults. You are interested in the probability of n companies defaulting in your portfolio.

Sneak Preview Newsgroups-small (collected by S. Roweis) 100 binary features, 16,242 instances (300 shown) (Note: herding is a deterministic algorithm, no noise was added) Herding is a deterministic dynamical system that turns “moments” (average feature statistics) into “samples” which share the same moments. Quiz: which is which [top/bottom]? -data in random order. -herding sequence in order received.

Traditional Approach: Hopfield Nets & Boltzman Machines weight state value (say 0/1) Energy: Probability of a joint state: Coordinate descent on energy:

Traditional Learning Approach Use CD instead !

What’s Wrong With This? E[Xi] and E[XiXj] are intractable to compute (and you need them at every iteration of gradient descent). Slow convergence & local minima (only w/ hidden vars) Sampling can get stuck in local modes (slow mixing).

Solution in a Nutshell Nonlinear Dynamical System (sidestep learning + sampling)

Herding Dynamics no stepsize no random numbers no exponentiation no point estimates

Piston Analogy weights=pistons Pistons move up at a constant rate (proportional to observed correlations) When they gets too high, the “fuel” will combust and the piston will be pushed down (depression) “Engine driven by observed correlations”

Herding Dynamics with General Features no stepsize no random numbers no exponentiation no point estimates

Features as New Coordinates If then period is infinite thanks to Romain Thibaux

Example weights initialized in a grid red ball tracks 1 weight converence on a fractal attractor set with Hausdorf dim. 1.5

The Tipi Function gradient descend on G(w) with stepsize 1. This function is: Concave Piecewise linear Non-positive Scale free coordinate ascend replaced with full maximization. Scale free property implies that stepsize will not affect state sequence S.

Recurrence Thm: If we can find the optimal state S, then the weights will stay within a compact region. Empirical evidence: coordinate ascent is sufficient to guarantee recurrence.

Ergodicity s=1 s=2 s=3 s=4 s=5 s=6 s=[1,1,2,5,2... Thm: If the 2-norm of the weights grows slower than linear, then feature averages over trajectories converge to data averages.

Relation to Maximum Entropy Dual: Tipi function: Herding dynamics satisfies constraints but not maximal entropy

Advantages / Disadvantages Learning & Inference have merged into one dynamical system. Fully tractable – although one should monitor whether local maximization is enough to keep weights finite. Very fast: no exponentation, no random number generation. No fudge factors (learning rates, momentum, weight decay..). Very efficient mixing over all “modes” (attractor set). Moments preserved, but what is our “inductive bias”? (i.e. what happens to remaining degrees of freedom?).

Back to Bowling Data collected by P. Cotton. 10 pins, 298 bowling runs. X=1 means a pin has fallen in two subsequent bowls. H.XX uses all pairwise probabilities H.XXX uses all triplet probabilities P(total nr. pins falling)

More Results Datasets: Bowling (n=298, d=10, k=2, Ntrain=150, Ntest = 148) Abelone (n=4177, d=8, k=2, Ntrain=2000, Ntest = 2177) Newsgroup-small (n=16,242, d=100, k=2, Ntrain=10,000, Ntest = 6242) 8x8 Digits (n=2200 [3’s and 5’s], d=64, k=2, Ntrain=1600, Ntest =600) Task: given only pairwise probabilities, compute the probability of the total nr. of 1’s in a data-vector Q(n). Solution: apply herding and compute Q(n) through sample averages. Error : KL[Pdata||Pest] Task: given only pairwise probabilities, compute the classifier P(Y|X). Solution: train logistic regression (LR) classifier on herding sequence. Error : fraction of misclassified test cases. LR is too simple, PL on herding sequence also gives In higher dimensions herding looses advantage in accuracy

Conclusions Herding replaces point estimates with trajectories over attractor sets (which is not the Bayesian posterior) in a tractable manner. Model for “neural computation” – similar to dynamical synapses – Quasi-random sampling of state space (chaotic?) – Local updates – Efficient (no random numbers, exponentiation)