Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.

Slides:



Advertisements
Similar presentations
Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.
Advertisements

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.
Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Artificial Spiking Neural Networks
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Lecture 14 – Neural Networks
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Artificial Neural Networks ECE 398BD Instructor: Shobha Vasudevan.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Sparse vs. Ensemble Approaches to Supervised Learning
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
Ensemble Learning: An Introduction
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
MSRC Summer School - 30/06/2009 Cambridge – UK Hybrids of generative and discriminative methods for machine learning.
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Machine Learning CMPT 726 Simon Fraser University
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Bayesian Learning Rong Jin.
Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Learning Energy-Based Models of High-Dimensional Data Geoffrey Hinton Max Welling Yee-Whye Teh Simon Osindero
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Crash Course on Machine Learning
Machine learning Image source:
PATTERN RECOGNITION AND MACHINE LEARNING
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,
Herding Dynamical Weights Max Welling Bren School of Information and Computer Science UC Irvine.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Randomized Algorithms for Bayesian Hierarchical Clustering
Non-Bayes classifiers. Linear discriminants, neural networks.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
BCS547 Neural Decoding.
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Ensemble Methods in Machine Learning
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16.
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CHEE825 Fall 2005J. McLellan1 Nonlinear Empirical Models.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Meeting 8: Features for Object Classification Ullman et al.
CSC 2535: Computation in Neural Networks Lecture 10 Learning Deterministic Energy-Based Models Geoffrey Hinton.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Learning Deep Generative Models by Ruslan Salakhutdinov
ECE 5424: Introduction to Machine Learning
ERGM conditional form Much easier to calculate delta (change statistics)
Machine Learning Today: Reading: Maria Florina Balcan
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Presentation transcript:

Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine

Yes, All Models Are Wrong, but… …from a CS/ML perspective this may not necessarily be less of big problem. Training: We want to gain an optimal amount of predictive accuracy per unit time. Testing: We want to engage the model that results in optimal accuracy within the time allowed to make a decision. Computer scientists are mostly interested in prediction. Example: ML do not care about identifiability (as long as the model predicts well). Computer scientist care a lot about computation. Example: ML are willing to tradeoff estimation bias for computation (if this means that we can handle bigger datasets – e.g. variational inference vs. MCMC). Fight or flight

Not Bayesian Nor Frequentist But Mandelbrotist… Is there a deep connection between learning, computation and chaos theory?

Perspective model/inductive bias Pseudo-samples Integration prediction learning inference Integration prediction herding

Herding Nonlinear Dynamical System. Generate pseudo-samples “S”. prediction consistency

Herding weights to not converge, Monte Carlo sums do Maximization does not have to be perfect (see PCT theorem). Deterministic No step-size Only very simple operations (no exponentiation, logarithms etc.)

Ising/Hopfield Model Network Neuron fires if input exceeds threshold Synapse depresses if pre- & postsynaptic neurons fire. Threshold depresses after neuron fires

Pseudo-Samples From Critical Ising Model

Herding as a Dynamical System S w data constant Piecewise constant fct. of W Markov process in W Infinite memory process in S

Example in 2-D s=[1,1,2,5,2... Itinerary: s=1 s=2 s=3 s=4 s=5 s=6

Convergence Translation: Choose S t such that: Then: s=1 s=2 s=3 s=4 s=5 s=6 s=[1,1,2,5,2... Equivalent to “Perceptron Cycling Theorem” (Minsky ’68)

Period Doubling As we change R (T) the number of fixed points change. T=0: herding “edge of chaos”

Applications Classification Compression Modeling Default Swaps Monte Carlo Integration Image Segmentation Natural Language Processing Social Networks

Example Classifier from local Image features: Classifier from boundary detection: + Herding will generate samples such that the local probabilities are respected as much as possible (project on marginal polytope) Combine with Herding

Topological Entropy Theorem [Goetz00] : Call W(T) the number of possible subsequences of length T, then the topological entropy for herding is: However, we are interested in the sub-extensive entropy: [Nemenman et al.] Theorem: Conjecture: (K = nr. of parameters) (for typical herding systems) S=1,3,2

Learning Systems Herding is not random and not IID due to negative auto-correlations. The information in its sequence is:. We can therefore represent the original (random) data sample by a much smaller subset without loss of information content (N instead of N 2 samples). These shorter herding sequences can be used to efficiently approximate averages by Monte Carlo sums. Information we learn from the random IID data.

Conclusions Herding is an efficient alternative for learning in MRFs. Edge of chaos dynamics provides more efficient information processing than random sampling. General principle that underlies information processing in the brain ? We advocate to explore potential interesting connections between computation, learning and the the theory of nonlinear dynamical systems and chaos. What can we learn from viewing learning as a nonlinear dynamical process?