Bayesian Conditional Random Fields using Power EP Tom Minka Joint work with Yuan Qi and Martin Szummer.

Slides:



Advertisements
Similar presentations
CS479/679 Pattern Recognition Dr. George Bebis
Advertisements

Introduction to Markov Random Fields and Graph Cuts Simon Prince
Computer vision: models, learning and inference
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
Variational Inference and Variational Message Passing
Belief Propagation in a Continuous World Andrew Frank 11/02/2009 Joint work with Alex Ihler and Padhraic Smyth TexPoint fonts used in EMF. Read the TexPoint.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Robust supervised image classifiers by spatial AdaBoost based on robust loss functions Ryuei Nishii and Shinto Eguchi Proc. Of SPIE Vol D-2.
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Bayesian Learning for Conditional Models Alan Qi MIT CSAIL September, 2005 Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Exercise Session 10 – Image Categorization
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Final review LING572 Fei Xia Week 10: 03/11/
Sketch Recognition for Digital Circuit Diagrams in the Classroom Christine Alvarado Harvey Mudd College March 26, 2007 Joint work with the HMC Sketchers.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Extending Expectation Propagation on Graphical Models Yuan (Alan) Qi
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
High-resolution computational models of genome binding events Yuan (Alan) Qi Joint work with Gifford and Young labs Dana-Farber Cancer Institute Jan 2007.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Randomized Algorithms for Bayesian Hierarchical Clustering
Bayesian inference for Plackett-Luce ranking models
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Boosting and Additive Trees
Perceptrons Lirong Xia.
Statistical Models for Automatic Speech Recognition
Data Mining Lecture 11.
Latent Variables, Mixture Models and EM
Statistical Learning Dong Liu Dept. EEIS, USTC.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Hidden Markov Models Part 2: Algorithms
Markov Random Fields for Edge Classification
Expectation-Propagation performs smooth gradient descent Advances in Approximate Bayesian Inference 2016 Guillaume Dehaene I’d like to thank the organizers.
Revision (Part II) Ke Chen
Learning Markov Networks
CSSE463: Image Recognition Day 27
CSSE463: Image Recognition Day 17
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Revision (Part II) Ke Chen
Overview of Machine Learning
Bayesian Inference for Mixture Language Models
CSSE463: Image Recognition Day 31
CSSE463: Image Recognition Day 27
LECTURE 07: BAYESIAN ESTIMATION
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Parametric Methods Berlin Chen, 2005 References:
Speech recognition, machine learning
Nonparametric density estimation and classification
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Markov Networks.
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Perceptrons Lirong Xia.
Speech recognition, machine learning
Presentation transcript:

Bayesian Conditional Random Fields using Power EP Tom Minka Joint work with Yuan Qi and Martin Szummer

Why should you care? New way to train Conditional Random Fields –Significant improvement on small training sets Demonstration of Bayesian methods New computational scheme for Bayesian inference: Power EP –Benefits of Bayes at little computational cost

The task Want to label structured data: –Lines of text –Hyperlinked documents –Blocks of an image –Fragments of an ink diagram QAAQAAAQAAQAAA FAQ text Ink

Independent classification Classify each site independently, based on its features and those of its neighbors Problems: –Resulting labels may not make sense jointly –Requires lots of features (self + neighbors) –Performs redundant work in examining self + neighbors Want classifiers which are local but linked

Conditional Random Field (CRF) A linked set of classifiers Object x, possible labels g measures the three-way compatibility of the labels with the features and each other w is parameter vector, E is linkage structure

Training CRFs Given labeled data we get a posterior Old way: assume w = most probable value –Easily overfits New (Bayesian) way: weight each possible w by posterior, average the results for all w’s during testing –No overfitting (no fitting at all) Can this be done efficiently? Yes! (but approximately) –Use Power EP

Bayesian procedure Training: approximate the posterior of w Testing: approximate the posterior of t Power EP

Expectation Propagation (EP) A method to approximate by

EP iteration Initial guess Delete Include Approximate

EP iteration Initial guess Delete Include Approximate

EP iteration Initial guess Delete Include Approximate

EP iteration Initial guess Delete Include Approximate

Approximating a factor Requires computing under Easy cases: Hard cases: Variational methods require –Minimizes a different error measure –“Exclusive” KL(q||p) vs. “Inclusive” KL(p||q) –Doesn’t simplify the above cases

Power EP Instead of minimizing KL, minimize alpha- divergence: Only requires Choose beta to make integrals tractable:

Power EP for CRFs Want to approximate (prior)(partition fcn)(interactions) Process partition fcn using is approximated by regular EP, where t is also a random variable

Algorithm structure Competing EP processes for numerator and denominator terms In numerator, t is known In denominator, t is inferred Helps to interleave the updates for each process, keeping them balanced –Otherwise may spiral out of control For testing, only one EP is required since denominator doesn’t come into play –t is random in the numerator terms

Synthetic experiment Each object has 3 sites, fully linked, 24 random features per edge Labels generated from a random CRF Same model trained via MAP or Bayes AlgorithmTest error 10 training objects 30 train.100 train. MAP16%1210 Bayes11109

FAQ labeling Want to label each line of a FAQ list as being part of a question or an answer 19 training files, 500 lines on average, 24 binary features per line (contains question mark, indented, etc.) AlgorithmTest error MAP1.4% Bayes0.5 Lines linked in a chain MAP/Bayes used the same model and same priors

Ink labeling Want to label ink as part of container or connector 14 training diagrams Linkage by spatial proximity, probit-type interactions (except for MAP-Exp) AlgorithmTest error MAP6.0% MAP-Exp5.2 Bayes4.4

Conclusions Bayesian methods provide significant improvement on small training sets Using power EP, additional cost is minimal Power EP also provides model evidence, for hyperparameter selection (e.g. type of interactions) Other places reciprocals arise: multiclass BPMs, Markov Random Fields –Can use power EP to train them