Fast search for Dirichlet process mixture models

Slides:

Advertisements

Similar presentations

Topic models Source: Topic models, David Blei, MLSS 09.

Advertisements

Gentle Introduction to Infinite Gaussian Mixture Modeling

Markov Chain Sampling Methods for Dirichlet Process Mixture Models R.M. Neal Summarized by Joon Shik Kim (Thu) Computational Models of Intelligence.

Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.

Hierarchical Dirichlet Processes

DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood

Rob Fergus Courant Institute of Mathematical Sciences New York University A Variational Approach to Blind Image Deconvolution.

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

CHAPTER 16 MARKOV CHAIN MONTE CARLO

Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation James Foulds 1, Levi Boyles 1, Christopher DuBois 2 Padhraic Smyth.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Memory Bounded Inference on Topic Models Paper by R. Gomes, M. Welling, and P. Perona Included in Proceedings of ICML 2008 Presentation by Eric Wang 1/9/2009.

Fast Simulators for Assessment and Propagation of Model Uncertainty* Jim Berger, M.J. Bayarri, German Molina June 20, 2001 SAMO 2001, Madrid *Project of.

Randomized Algorithms for Bayesian Hierarchical Clustering

Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.

Bayesian Word Alignment for Statistical Machine Translation Authors: Coskun Mermer, Murat Saraclar Present by Jun Lang I2R SMT-Reading Group.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.

MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.

MCMC in practice Start collecting samples after the Markov chain has “mixed”. How do you know if a chain has mixed or not? In general, you can never “proof”

Gaussian Processes For Regression, Classification, and Prediction.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

NIPS 2013 Michael C. Hughes and Erik B. Sudderth

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Hierarchical Bayesian Analysis: Binomial Proportions Dwight Howard’s Game by Game Free Throw Success Rate – 2013/2014 NBA Season Data Source:

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

Ranking: Compare, Don’t Score Ammar Ammar, Devavrat Shah (LIDS – MIT) Poster ( No preprint), WIDS 2011.

Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.

Introduction to Sampling based inference and MCMC

Bayesian Semi-Parametric Multiple Shrinkage

Reducing Photometric Redshift Uncertainties Through Galaxy Clustering

MCMC Output & Metropolis-Hastings Algorithm Part I

Machine Learning and Data Mining Clustering

Advanced Statistical Computing Fall 2016

Classification of unlabeled data:

Accelerated Sampling for the Indian Buffet Process

Non-Parametric Models

Bayes Net Learning: Bayesian Approaches

Artificial Intelligence

Omiros Papaspiliopoulos and Gareth O. Roberts

CS 4/527: Artificial Intelligence

CSCI 5822 Probabilistic Models of Human and Machine Learning

Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.

A Non-Parametric Bayesian Method for Inferring Hidden Causes

CAP 5636 – Advanced Artificial Intelligence

Markov Networks.

Hidden Markov Models Part 2: Algorithms

Akio Utsugi National Institute of Bioscience and Human-technology,

Bayesian Models in Machine Learning

Learning Markov Networks

Collapsed Variational Dirichlet Process Mixture Models

SMEM Algorithm for Mixture Models

Hierarchical Topic Models and the Nested Chinese Restaurant Process

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Bayesian Inference for Mixture Language Models

Stochastic Optimization Maximization for Latent Variable Models

CS 188: Artificial Intelligence

Pattern Recognition and Machine Learning

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Expectation-Maximization & Belief Propagation

Parametric Methods Berlin Chen, 2005 References:

Machine Learning and Data Mining Clustering

Markov Networks.

Presentation transcript:

Fast search for Dirichlet process mixture models Hal Daumé III School of Computing University of Utah me@hal3.name

Dirichlet Process Mixture Models Non-parametric Bayesian density estimation Frequently used to solve clustering problem: choose “K” Applications: vision data mining computational biology Personal Observation: Samplers slow on huge data sets (10k+ elements) Very sensitive to initialization

Chinese Restaurant Process Customers enter a restaurant sequentially The Mth customer chooses a table by: Sit at table with N customers with probability N/(α+M- 1) Sit at unoccupied table with probability α/(α+M-1)

Dirichlet Process Mixture Models θ1 θ2 θ3 Data point = customer Cluster = table Each table gets a parameter Data points are generated according to a likelihood F 𝑝 𝑋∣𝑐 = 𝑑 θ 1:𝐾 𝑘 𝐺 0 θ 𝑘 𝑛 𝐹 𝑥 𝑛 ∣ θ 𝑐 𝑛 cn = table of nth customer

Inference Summary Run MCMC sampler for a bunch of iterations Use different initialization From set of samples, choose one with highest posterior probability If all we want is the highest probability assignment, why not just try to find it directly? (If you really want to be Bayesian, use this assignment to initialize sampling)

Ordered Search Input: data, beam size, scoring function Output: clustering initialize Q, a max-queue of partial clusterings while Q is not empty remove a partial cluster c from Q if c covers all the data, return it try extending c by a single data point put all K+1 options into Q with scores if |Q| > beam size, drop elements Optimal, if: beam size = infinity scoring function overestimates true best probability

Ordered Search in Pictures x1 x2 x3 x4 x5 x4 x1 x2 x3 x1 x2 x3 x1 x2 x3 x4 x1 x2 ...

Trivial Scoring Function Only account for already-clustered data: pmax(c) can be computed exactly 𝑔 𝑇𝑟𝑖𝑣 𝑐,𝑥 = 𝑝 𝑚𝑎𝑥 𝑐 𝑘∈𝑐 𝐻 𝑥 𝑐=𝑘 𝐻 𝑋 = 𝑑 θ 𝐺 0 θ 𝑥∈𝑋 𝐹 𝑥∣θ x1 x2 x3 𝑝 𝑚𝑎𝑥 𝑐 𝐻 𝑥 1, 𝑥 2 𝐻 𝑥 3

Tighter Scoring Function Use trivial score for already-clustered data Approximate optimal score for future data: For each data point, put in existing or new cluster Then, conditioned on that choice, cluster remaining Assume each remaining point is optimally placed

An Inadmissible Scoring Function Just use marginals for unclustered points: Inadmissible because H is not monotonic in conditioning (even for exponential family) 𝑔 𝐼𝑛𝑎𝑑 𝑐,𝑥 = 𝑔 𝑇𝑟𝑖𝑣 𝑐,𝑥 𝑛=∣𝑐∣+1 𝑁 𝐻 𝑥 𝑛

Artificial Data: Likelihoods Ratio All that should be optimal, are. Sampling is surprisingly unoptimal. Inadmissible turns out to be optimal Gaussian data, drawn from a DP Varying data set sizes Average over ten random data sets Attempt to be as biased toward sampling as possible

Artificial Data: Speed (Seconds) Admissible is slow Sampling is slower Inadmissible is very fast

Artificial Data: # of Enqueued Points Admissible functions enqueue a lot Inadmissible enqueues almost nothing that is not required!

Real Data: MNIST 3k 12k 60k Search 11s 105s 15m 2.04e5 8.02e5 3.96e6 Handwritten numbers 0-9, 28x28 pixels Preprocess with PCA to 50 dimensions Run on: 3000, 12,000 and 60,000 images Use inadmissible heuristic with (large) 100 beam 3k 12k 60k Search 11s 105s 15m 2.04e5 8.02e5 3.96e6 Gibbs 40s/i 18m/i 7h/i 2.09e5 8.34e5 4.2e6 S-M 85s/i 35m/i 12h/i 2.05e5 8.15e5 4.1e6

Real Data: NIPS Papers Search 2.441e6 (32s) Gibbs 3.2e6 (1h) 1740 documents, vocabulary of 13k words Drop top 10, retain remaining top 1k Conjugate Dirichlet/Multinomial DP Order examples by increasing marginal likelihood Search 2.441e6 (32s) Gibbs 3.2e6 (1h) S-M 3.0e6 (1.5h) 2.474e6 (reverse order) 2.449e6 (random order)

code at http://hal3.name/DPsearch Discussion Sampling often fails to find MAP Search can do much better Limited to conjugate distributions Cannot re-estimate hyperparameters Can cluster 270 images / second in matlab Further acceleration possible with clever data structures Thanks! Questions? code at http://hal3.name/DPsearch

Inference I – Gibbs Sampling H is the posterior probability of x, conditioned on the set of x that fall into the proposed cluster =𝐻 𝑥 𝑛 ∣ 𝑥 𝑐=𝑘 Collapsed Gibbs sampler: Initialize clusters For a number of iterations: Assign each data point xn to an existing cluster ck with probability: or to a new cluster with probability 𝑁 𝑘 α+𝑁−1 𝑑 θ 𝐺 0 θ 𝐹 𝑥 𝑛 ∣θ 𝑚∈ 𝑐 𝑘 𝐹 𝑥 𝑚 ∣θ α α+𝑁−1 𝑑 θ 𝐺 0 θ 𝐹 𝑥 𝑛 ∣θ

Inference II – Metropolis-Hastings Collapsed Split-Merge sampler: Initialize clusters For a number of iterations: Choose two data points xn and xm at random If cn = cm, split this cluster with a Gibbs pass otherwise, merge the two clusters Then perform a collapsed Gibbs pass

Versus Variational