DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

MCMC estimation in MlwiN
Sinead Williamson, Chong Wang, Katherine A. Heller, David M. Blei
Topic models Source: Topic models, David Blei, MLSS 09.
Teg Grenager NLP Group Lunch February 24, 2005
Gentle Introduction to Infinite Gaussian Mixture Modeling
Xiaolong Wang and Daniel Khashabi
Course: Neural Networks, Instructor: Professor L.Behera.
MAD-Bayes: MAP-based Asymptotic Derivations from Bayes
Hierarchical Dirichlet Process (HDP)
A Tutorial on Learning with Bayesian Networks
Frank Wood - Training Products of Experts by Minimizing Contrastive Divergence Geoffrey E. Hinton presented by Frank Wood.
Ouyang Ruofei Topic Model Latent Dirichlet Allocation Ouyang Ruofei May LDA.
Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.
Hierarchical Dirichlet Processes
HW 4. Nonparametric Bayesian Models Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.
Nonparametric Bayes and human cognition Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh Gatsby UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF.
Presenting: Assaf Tzabari
Particle Filtering for Non- Linear/Non-Gaussian System Bohyung Han
Computer vision: models, learning and inference
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Forgetting Counts : Constant Memory Inference for a Dependent Hierarchical Pitman-Yor Process Nicholas Bartlett, David Pfau, Frank Wood Presented by Yingjian.
Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.
The Sequence Memoizer Frank Wood Cedric Archambeau Jan Gasthaus Lancelot James Yee Whye Teh UCL Gatsby HKUST Gatsby TexPoint fonts used in EMF. Read the.
Probabilistic Robotics Bayes Filter Implementations.
Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.
Inferring structure from data Tom Griffiths Department of Psychology Program in Cognitive Science University of California, Berkeley.
DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Extras From Programming Lecture … And exercise solutions.
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Timeline: A Dynamic Hierarchical Dirichlet Process Model for Recovering Birth/Death and Evolution of Topics in Text Stream (UAI 2010) Amr Ahmed and Eric.
Variational Inference for the Indian Buffet Process
1 Dirichlet Process Mixtures A gentle tutorial Graphical Models – Khalid El-Arini Carnegie Mellon University November 6 th, 2006 TexPoint fonts used.
Stick-Breaking Constructions
1 Clustering in Generalized Linear Mixed Model Using Dirichlet Process Mixtures Ya Xue Xuejun Liao April 1, 2005.
A Hierarchical Nonparametric Bayesian Approach to Statistical Language Model Domain Adaptation Frank Wood and Yee Whye Teh AISTATS 2009 Presented by: Mingyuan.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.
Gaussian Processes For Regression, Classification, and Prediction.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Nonparametric Bayesian Models. HW 4 x x Parametric Model Fixed number of parameters that is independent of the data we’re fitting.
Analysis of Social Media MLD , LTI William Cohen
Hierarchical Beta Process and the Indian Buffet Process by R. Thibaux and M. I. Jordan Discussion led by Qi An.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )
Latent Feature Models for Network Data over Time Jimmy Foulds Advisor: Padhraic Smyth (Thanks also to Arthur Asuncion and Chris Dubois)
Fast search for Dirichlet process mixture models
Nonparametric Bayesian Learning of Switching Dynamical Processes
Bayesian Generalized Product Partition Model
Advanced Statistical Computing Fall 2016
Non-Parametric Models
Omiros Papaspiliopoulos and Gareth O. Roberts
CSCI 5822 Probabilistic Models of Human and Machine Learning
A Non-Parametric Bayesian Method for Inferring Hidden Causes
Kernel Stick-Breaking Process
Collapsed Variational Dirichlet Process Mixture Models
Exact and Approximate Sum Representations for the Dirichlet Process
Multitask Learning Using Dirichlet Process
Generalized Spatial Dirichlet Process Models
Chinese Restaurant Representation Stick-Breaking Construction
Michal Rosen-Zvi University of California, Irvine
Topic Models in Text Processing
Random Variables A random variable is a rule that assigns exactly one value to each point in a sample space for an experiment. A random variable can be.
Berlin Chen Department of Computer Science & Information Engineering
Bayesian kernel mixtures for counts
Presentation transcript:

DEPARTMENT OF ENGINEERING SCIENCE Information, Control, and Vision Engineering Bayesian Nonparametrics via Probabilistic Programming Frank Wood MLSS 2014 May, 2014 Reykjavik Excellent tutorial dedicated to Bayesian nonparametrics :

Bayesian Nonparametrics  What is a Bayesian nonparametric model?  A Bayesian model reposed on an infinite-dimensional parameter space  What is a nonparametric model?  Model with an infinite dimensional parameter space  Parametric model where number of parameters grows with the data  Why are probabilistic programming languages natural for representing Bayesian nonparametric models?  Often lazy constructions exist for infinite dimensional objects  Only the parts that are needed are generated

Nonparametric Models Are Parametric  Nonparametric means “cannot be described as using a fixed set of parameters”  Nonparametric models have infinite parameter cardinality  Regularization still present  Structure  Prior  Programs with memoized thunks that wrap stochastic procedures are nonparametric

Dirichlet Process  A Bayesian nonparametric model building block  Appears in the infinite limit of finite mixture models  Formally defined as a distribution over measures  Today  One probabilistic programming representation  Stick breaking  Generalization of mem

Review : Finite Mixture Model Dirichlet process mixture model arises as infinite class cardinality limit Uses Clustering Density estimation

Review : Dirichlet Process Mixture

Review : Stick-Breaking Construction [Sethuraman 1997]

Stick-Breaking is A Lazy Construction ; sethuraman-stick-picking-procedure returns a procedure that picks ; a stick each time its called from the set of sticks lazily constructed ; via the closed-over one-parameter stick breaking rule [assume make-sethuraman-stick-picking-procedure (lambda (concentration) (begin (define V (mem (lambda (x) (beta 1.0 concentration)))) (lambda () (sample-stick-index V 1))))] ; sample-stick-index is a procedure that samples an index from ; a potentially infinite dimensional discrete distribution ; lazily constructed by a stick breaking rule [assume sample-stick-index (lambda (breaking-rule index) (if (flip (breaking-rule index)) index (sample-stick-index breaking-rule (+ index 1))))]

DP is Generalization of mem ; DPmem is a procedure that takes two arguments -- the concentration ; to a Dirichlet process and a base sampling procedure ; DPmem returns a procedure [assume DPmem (lambda (concentration base) (begin (define get-value-from-cache-or-sample (mem (lambda (args stick-index) (apply base args)))) (define get-stick-picking-procedure-from-cache (mem (lambda (args) (make-sethuraman-stick-picking-procedure concentration)))) (lambda varargs ; when the returned function is called, the first thing it does is get ; the cached stick breaking procedure for the passed in arguments ; and _calls_ it to get an index (begin (define index ((get-stick-picking-procedure-from-cache varargs))) ; if, for the given set of arguments and just sampled index ; a return value has already been computed, get it from the cache ; and return it, otherwise sample a new value (get-value-from-cache-or-sample varargs index)))))] Church [Goodman, Mansinghka, et al, 2008/2012]

Consequence  Using DPmem, coding DP mixtures and other DP-related Bayesian nonparametric models is straightforward ; base distribution [assume H (lambda () (begin (define v (/ 1.0 (gamma 1 10))) (list (normal 0 (sqrt (* 10 v))) (sqrt v))))] ; lazy DP representation [assume gaussian-mixture-model-parameters (DPmem 1.72 H)] ; data [observe-csv ”…" (apply normal (gaussian-mixture-model-parameters)) $2] ; density estimate [predict (apply normal (gaussian-mixture-model-parameters))]

Hierarchical Dirichlet Process [assume H (lambda ()…)] [assume G0 (DPmem alpha H)] [assume G1 (DPmem alpha G0)] [assume G2 (DPmem alpha G0)] [observe (apply F (G1)) x11] [observe (apply F (G1)) x12] … [observe (apply F (G2)) x21] … [predict (apply F (G1))] [predict (apply F (G2))] [Teh et al 2006]

Stick-Breaking Process Generalizations Two parameter Corresponds to Pitman-Yor process Induces power-law distribution on number of classes per number of observations [Ishwaran and James,2001] Gibbs Sampling Methods for Stick-Breaking Priors [Pitman and Yor 1997] The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator

Open Universe vs. Bayesian Nonparametrics In probabilistic programming systems we can write [import 'core] [assume K (poisson 10)] [assume J (map (lambda (x) (/ x K)) (repeat K 1))] [assume alpha 2] [assume pi (dirichlet (map (lambda (x) (* x alpha)) J))] What is the consequential difference?

Take Home  Probabilistic programming languages are expressive  Represent Bayesian nonparametric models compactly  Inference speed  Compare  Writing the program in a slow prob. prog. and waiting for answer  Deriving fast custom inference then getting answer quickly  Flexibility  Non-trivial modifications to models are straightforward

Chinese Restaurant Process

DP Mixture Code

DP Mixture Inference