Fast search for Dirichlet process mixture models

Slides:



Advertisements
Similar presentations
Topic models Source: Topic models, David Blei, MLSS 09.
Advertisements

Gentle Introduction to Infinite Gaussian Mixture Modeling
Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.
Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.
Hierarchical Dirichlet Processes
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood
Rob Fergus Courant Institute of Mathematical Sciences New York University A Variational Approach to Blind Image Deconvolution.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Learning Bayesian Networks (From David Heckerman’s tutorial)
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.
Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.
Randomized Algorithms for Bayesian Hierarchical Clustering
Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.
Bayesian Word Alignment for Statistical Machine Translation Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang I2R SMT-Reading Group.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
MCMC in practice Start collecting samples after the Markov chain has “mixed”. How do you know if a chain has mixed or not? In general, you can never “proof”
Gaussian Processes For Regression, Classification, and Prediction.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
NIPS 2013 Michael C. Hughes and Erik B. Sudderth
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Hierarchical Bayesian Analysis: Binomial Proportions Dwight Howard’s Game by Game Free Throw Success Rate – 2013/2014 NBA Season Data Source:
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Ranking: Compare, Don’t Score Ammar Ammar, Devavrat Shah (LIDS – MIT) Poster ( No preprint), WIDS 2011.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Introduction to Sampling based inference and MCMC
Bayesian Semi-Parametric Multiple Shrinkage
Reducing Photometric Redshift Uncertainties Through Galaxy Clustering
MCMC Output & Metropolis-Hastings Algorithm Part I
Machine Learning and Data Mining Clustering
Advanced Statistical Computing Fall 2016
Classification of unlabeled data:
Accelerated Sampling for the Indian Buffet Process
Non-Parametric Models
Bayes Net Learning: Bayesian Approaches
Artificial Intelligence
Omiros Papaspiliopoulos and Gareth O. Roberts
CS 4/527: Artificial Intelligence
CSCI 5822 Probabilistic Models of Human and Machine Learning
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
A Non-Parametric Bayesian Method for Inferring Hidden Causes
CAP 5636 – Advanced Artificial Intelligence
Markov Networks.
Hidden Markov Models Part 2: Algorithms
Akio Utsugi National Institute of Bioscience and Human-technology,
Bayesian Models in Machine Learning
Learning Markov Networks
Collapsed Variational Dirichlet Process Mixture Models
SMEM Algorithm for Mixture Models
Hierarchical Topic Models and the Nested Chinese Restaurant Process
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Bayesian Inference for Mixture Language Models
Stochastic Optimization Maximization for Latent Variable Models
CS 188: Artificial Intelligence
Pattern Recognition and Machine Learning
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Expectation-Maximization & Belief Propagation
Parametric Methods Berlin Chen, 2005 References:
Machine Learning and Data Mining Clustering
Markov Networks.
Presentation transcript:

Fast search for Dirichlet process mixture models Hal Daumé III School of Computing University of Utah me@hal3.name

Dirichlet Process Mixture Models Non-parametric Bayesian density estimation Frequently used to solve clustering problem: choose “K” Applications: vision data mining computational biology Personal Observation: Samplers slow on huge data sets (10k+ elements) Very sensitive to initialization

Chinese Restaurant Process Customers enter a restaurant sequentially The Mth customer chooses a table by: Sit at table with N customers with probability N/(α+M- 1) Sit at unoccupied table with probability α/(α+M-1)

Dirichlet Process Mixture Models θ1 θ2 θ3 Data point = customer Cluster = table Each table gets a parameter Data points are generated according to a likelihood F 𝑝 𝑋∣𝑐 = 𝑑 θ 1:𝐾 𝑘 𝐺 0 θ 𝑘 𝑛 𝐹 𝑥 𝑛 ∣ θ 𝑐 𝑛 cn = table of nth customer

Inference Summary Run MCMC sampler for a bunch of iterations Use different initialization From set of samples, choose one with highest posterior probability If all we want is the highest probability assignment, why not just try to find it directly? (If you really want to be Bayesian, use this assignment to initialize sampling)

Ordered Search Input: data, beam size, scoring function Output: clustering initialize Q, a max-queue of partial clusterings while Q is not empty remove a partial cluster c from Q if c covers all the data, return it try extending c by a single data point put all K+1 options into Q with scores if |Q| > beam size, drop elements Optimal, if: beam size = infinity scoring function overestimates true best probability

Ordered Search in Pictures x1 x2 x3 x4 x5 x4 x1 x2 x3 x1 x2 x3 x1 x2 x3 x4 x1 x2 ...

Trivial Scoring Function Only account for already-clustered data: pmax(c) can be computed exactly 𝑔 𝑇𝑟𝑖𝑣 𝑐,𝑥 = 𝑝 𝑚𝑎𝑥 𝑐 𝑘∈𝑐 𝐻 𝑥 𝑐=𝑘 𝐻 𝑋 = 𝑑 θ 𝐺 0 θ 𝑥∈𝑋 𝐹 𝑥∣θ x1 x2 x3 𝑝 𝑚𝑎𝑥 𝑐 𝐻 𝑥 1, 𝑥 2 𝐻 𝑥 3

Tighter Scoring Function Use trivial score for already-clustered data Approximate optimal score for future data: For each data point, put in existing or new cluster Then, conditioned on that choice, cluster remaining Assume each remaining point is optimally placed

An Inadmissible Scoring Function Just use marginals for unclustered points: Inadmissible because H is not monotonic in conditioning (even for exponential family) 𝑔 𝐼𝑛𝑎𝑑 𝑐,𝑥 = 𝑔 𝑇𝑟𝑖𝑣 𝑐,𝑥 𝑛=∣𝑐∣+1 𝑁 𝐻 𝑥 𝑛

Artificial Data: Likelihoods Ratio All that should be optimal, are. Sampling is surprisingly unoptimal. Inadmissible turns out to be optimal Gaussian data, drawn from a DP Varying data set sizes Average over ten random data sets Attempt to be as biased toward sampling as possible

Artificial Data: Speed (Seconds) Admissible is slow Sampling is slower Inadmissible is very fast

Artificial Data: # of Enqueued Points Admissible functions enqueue a lot Inadmissible enqueues almost nothing that is not required!

Real Data: MNIST 3k 12k 60k Search 11s 105s 15m 2.04e5 8.02e5 3.96e6 Handwritten numbers 0-9, 28x28 pixels Preprocess with PCA to 50 dimensions Run on: 3000, 12,000 and 60,000 images Use inadmissible heuristic with (large) 100 beam 3k 12k 60k Search 11s 105s 15m 2.04e5 8.02e5 3.96e6 Gibbs 40s/i 18m/i 7h/i 2.09e5 8.34e5 4.2e6 S-M 85s/i 35m/i 12h/i 2.05e5 8.15e5 4.1e6

Real Data: NIPS Papers Search 2.441e6 (32s) Gibbs 3.2e6 (1h) 1740 documents, vocabulary of 13k words Drop top 10, retain remaining top 1k Conjugate Dirichlet/Multinomial DP Order examples by increasing marginal likelihood Search 2.441e6 (32s) Gibbs 3.2e6 (1h) S-M 3.0e6 (1.5h) 2.474e6 (reverse order) 2.449e6 (random order)

code at http://hal3.name/DPsearch Discussion Sampling often fails to find MAP Search can do much better Limited to conjugate distributions Cannot re-estimate hyperparameters Can cluster 270 images / second in matlab Further acceleration possible with clever data structures Thanks! Questions? code at http://hal3.name/DPsearch

Inference I – Gibbs Sampling H is the posterior probability of x, conditioned on the set of x that fall into the proposed cluster =𝐻 𝑥 𝑛 ∣ 𝑥 𝑐=𝑘 Collapsed Gibbs sampler: Initialize clusters For a number of iterations: Assign each data point xn to an existing cluster ck with probability: or to a new cluster with probability 𝑁 𝑘 α+𝑁−1 𝑑 θ 𝐺 0 θ 𝐹 𝑥 𝑛 ∣θ 𝑚∈ 𝑐 𝑘 𝐹 𝑥 𝑚 ∣θ α α+𝑁−1 𝑑 θ 𝐺 0 θ 𝐹 𝑥 𝑛 ∣θ

Inference II – Metropolis-Hastings Collapsed Split-Merge sampler: Initialize clusters For a number of iterations: Choose two data points xn and xm at random If cn = cm, split this cluster with a Gibbs pass otherwise, merge the two clusters Then perform a collapsed Gibbs pass

Versus Variational