6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM):

Slides:



Advertisements
Similar presentations
Contrastive Divergence Learning
Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Nathan Wiebe, Ashish Kapoor and Krysta Svore Microsoft Research ASCR Workshop Washington DC Quantum Deep Learning.
Exact Inference in Bayes Nets
Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)
Exact Inference (Last Class) variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Geographic Gossip: Efficient Aggregations for Sensor Networks Author: Alex Dimakis, Anand Sarwate, Martin Wainwright University: UC Berkeley Venue: IPSN.
Conditional Random Fields
Parametric Inference.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
July 3, Department of Computer and Information Science (IDA) Linköpings universitet, Sweden Minimal sufficient statistic.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
Using Fast Weights to Improve Persistent Contrastive Divergence Tijmen Tieleman Geoffrey Hinton Department of Computer Science, University of Toronto ICML.
Random Sampling, Point Estimation and Maximum Likelihood.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Chanyoung Park Raphael T. Haftka Paper Helicopter Project.
Lecture note for Stat 231: Pattern Recognition and Machine Learning 4. Maximum Likelihood Prof. A.L. Yuille Stat 231. Fall 2004.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto (Training MRFs using new algorithm.
Markov Random Fields Probabilistic Models for Images
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Maximum Likelihood Estimation Methods of Economic Investigation Lecture 17.
Continuous Variables Write message update equation as an expectation: Proposal distribution W t (x t ) for each node Samples define a random discretization.
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Tijmen Tieleman University of Toronto.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Exact Inference (Last Class) Variable elimination  polytrees (directed graph with at most one undirected path between any two vertices; subset of DAGs)
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1 Standard error Estimated standard error,s,. 2 Example 1 While measuring the thermal conductivity of Armco iron, using a temperature of 100F and a power.
Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L1a.1 Lecture 1a: Some basic statistical concepts l The use.
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Markov Random Fields & Conditional Random Fields
M.Sc. in Economics Econometrics Module I Topic 4: Maximum Likelihood Estimation Carol Newman.
Exponential random graphs and dynamic graph algorithms David Eppstein Comp. Sci. Dept., UC Irvine.
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Network Science K. Borner A.Vespignani S. Wasserman.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
Perfect recall: Every decision node observes all earlier decision nodes and their parents (along a “temporal” order) Sum-max-sum rule (dynamical programming):
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
How many iterations in the Gibbs sampler? Adrian E. Raftery and Steven Lewis (September, 1991) Duke University Machine Learning Group Presented by Iulian.
Learning Deep Generative Models by Ruslan Salakhutdinov
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
Multimodal Learning with Deep Boltzmann Machines
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Stochastic Optimization Maximization for Latent Variable Models
Regulation Analysis using Restricted Boltzmann Machines
Markov Random Fields Presented by: Vladan Radosavljevic.
≠ Particle-based Variational Inference for Continuous Systems
Expectation-Maximization & Belief Propagation
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Sequential Learning with Dependency Nets
Presentation transcript:

6. Experimental Analysis Visible Boltzmann machine with higher-order potentials: Conditional random field (CRF): Exponential random graph model (ERGM): We ran BCD on Lazega social network data. 5. Tree Structured Blocks BCD (and CL) scales exponentially with block size. Large block sizes (>15) are too computationally expensive in practice. We can use tree structured blocks. Forward-backward sampling can be performed (to obtain a blocked sample), with time complexity linear in block size. Learning with Blocks: Composite Likelihood and Contrastive Divergence Arthur Asuncion 1, Qiang Liu 1, Alexander Ihler, Padhraic Smyth Department of Computer Science, University of California, Irvine 1 Both authors contributed equally. 1. Motivation: Efficient Parameter Estimation Assume an exponential family:. Suppose we have independent observations:. Our task is to perform parameter estimation (for ). Maximum likelihood estimation (MLE) is the standard approach: Likelihood gradient: MLE has nice theoretical properties: Asymptotic consistency and normality, statistical efficiency. Difficulty: The partition function and its gradient are generally intractable for many models. Our approach: Composite likelihood + contrastive divergence. 3. Contrastive Divergence Contrastive divergence (CD) approximates the second term in the likelihood gradient using MCMC (for efficiency reasons): CD-1 corresponds to MPLE [Hyvärinen, 2006]. CD- ∞ (i.e. chain has reached equilibrium) corresponds to MLE. CD-n is an algorithmic variant between CD-1 and CD- ∞. We propose blocked contrastive divergence (BCD). 7. Conclusions Blocked contrastive divergence (which combines CL and CD) is computationally efficient and accurate, especially when there are strong dependencies between blocks of variables. Composite likelihoods allows one to trade off computation for accuracy. Tree structured blocks allow for enhanced efficiency. Come to ICML 2010 to see our paper on CD + particle filtering! 2. Pseudolikelihood and Composite Likelihood Pseudolikelihood (i.e. MPLE) approximates the (log)likelihood by using conditional probabilities: Properties:  Asymptotically consistent  Computationally fast  Not as statistically efficient as MLE  Underestimates dependency structure of the model Composite likelihood (i.e. MCLE) fills gap between MLE & MPLE: Properties:  Asymptotically consistent  Computational cost greater than MPLE and less than MLE (exponential in size of largest subset A c )  Statistical efficiency greater than MPLE and less than MLE  Generally provides more accurate solutions than MPLE 4. Blocked Contrastive Divergence The gradient of the composite likelihood is: where The second term of the gradient can be approximated using a random-scan blocked Gibbs sampler (RSBG): 1. Randomly select a data point i (from empirical data distribution). 2. Randomly select a block c (with probability 1/C). 3. Update by performing one blocked Gibbs step using. Blocked contrastive divergence (BCD) is a stochastic version of MCLE (see paper for derivation). The connection between CD and composite likelihoods allows for cross-fertilization between machine learning and statistics. Example of tree structured blocks on 2D lattice Each dot is a model with random parameters.The performance as a function of the coupling strength. edge 2-star triangle Expectation w.r.t. empirical data distribution Expectation w.r.t. model Partition function is easy to calculate We focus on conditional composite likelihoods MPLE MLE MCLE CD-1 “CD- ∞” BCD (our contribution) Spectrum of Algorithms: Expectation using samples obtained from n th step of Gibbs sampling, initialized at empirical data distribution Network statistics, e.g.: