Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used.

Slides:



Advertisements
Similar presentations
Mean-Field Theory and Its Applications In Computer Vision1 1.
Advertisements

Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Markov Networks Alan Ritter.
Learning on the Test Data: Leveraging “Unseen” Features Ben Taskar Ming FaiWong Daphne Koller.
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Exact Inference in Bayes Nets
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Optimization Tutorial
An Introduction to Variational Methods for Graphical Models.
Convergent Message-Passing Algorithms for Inference over General Graphs with Convex Free Energies Tamir Hazan, Amnon Shashua School of Computer Science.
Introduction to Belief Propagation and its Generalizations. Max Welling Donald Bren School of Information and Computer and Science University of California.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Computer vision: models, learning and inference
1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.
A Graphical Model For Simultaneous Partitioning And Labeling Philip Cowans & Martin Szummer AISTATS, Jan 2005 Cambridge.
Distributed Message Passing for Large Scale Graphical Models Alexander Schwing Tamir Hazan Marc Pollefeys Raquel Urtasun CVPR2011.
Support Vector Machines Joseph Gonzalez TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A AA A AA.
Lecture 14 – Neural Networks
Convergent and Correct Message Passing Algorithms Nicholas Ruozzi and Sekhar Tatikonda Yale University TexPoint fonts used in EMF. Read the TexPoint manual.
Herding: The Nonlinear Dynamics of Learning Max Welling SCIVI LAB - UCIrvine.
Global Approximate Inference Eran Segal Weizmann Institute.
Conditional Random Fields
Belief Propagation, Junction Trees, and Factor Graphs
Measuring Uncertainty in Graph Cut Solutions Pushmeet Kohli Philip H.S. Torr Department of Computing Oxford Brookes University.
Today Logistic Regression Decision Trees Redux Graphical Models
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Conditional Random Fields Rahul Gupta (KReSIT, IIT Bombay)
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
CS774. Markov Random Field : Theory and Application Lecture 08 Kyomin Jung KAIST Sep
Self-paced Learning for Latent Variable Models
Adaptive CSMA under the SINR Model: Fast convergence using the Bethe Approximation Krishna Jagannathan IIT Madras (Joint work with) Peruru Subrahmanya.
1 Learning CRFs with Hierarchical Features: An Application to Go Scott Sanner Thore Graepel Ralf Herbrich Tom Minka TexPoint fonts used in EMF. Read the.
Probabilistic Graphical Models
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Markov Random Fields Probabilistic Models for Images
Fast Parallel and Adaptive Updates for Dual-Decomposition Solvers Ozgur Sumer, U. Chicago Umut Acar, MPI-SWS Alexander Ihler, UC Irvine Ramgopal Mettu,
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Daphne Koller Message Passing Belief Propagation Algorithm Probabilistic Graphical Models Inference.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
1 Mean Field and Variational Methods finishing off Graphical Models – Carlos Guestrin Carnegie Mellon University November 5 th, 2008 Readings: K&F:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Lecture 2: Statistical learning primer for biologists
Belief Propagation and its Generalizations Shane Oldenburger.
Using Combinatorial Optimization within Max-Product Belief Propagation
John Lafferty Andrew McCallum Fernando Pereira
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
Markov Random Fields & Conditional Random Fields
Mean field approximation for CRF inference
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.
Augmenting Probabilistic Graphical Models with Ontology Information: Object Classification R. Möller Institute of Information Systems University of Luebeck.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Learning Deep Generative Models by Ruslan Salakhutdinov
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Web-Mining Agents Augmenting Probabilistic Graphical Models with Ontology Information: Object Classification Prof. Dr. Ralf Möller Dr. Özgür L. Özçep Universität.
Nonparametric Semantic Segmentation
Markov Networks.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18
Learning to Combine Bottom-Up and Top-Down Segmentation
Bayesian Models in Machine Learning
Integration and Graphical Models
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Expectation-Maximization & Belief Propagation
Discriminative Probabilistic Models for Relational Data
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Markov Networks.
Mean Field and Variational Methods Loopy Belief Propagation
Presentation transcript:

Constrained Approximate Maximum Entropy Learning (CAMEL) Varun Ganapathi, David Vickrey, John Duchi, Daphne Koller Stanford University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AAAAA A

Ganapathi, Vickrey, Duchi, Koller2 Undirected Graphical Models Undirected graphical model: Random vector: (X 1, X 2, …, X N ) Graph G = (V,E) with N vertices µ : Model parameters Inference Intractable when densely connected Approximate Inference (e.g., BP) can work well How to learn µ given data? N

Ganapathi, Vickrey, Duchi, Koller3 Maximizing Likelihood with BP MRF Likelihood is convex CG/LBFGS Estimate gradient with BP* BP is finding fixed point of non-convex problem Multiple local minima Convergence Unstable double-loop learning algorithm Learning: L-BFGS µ Inference Log Likelihood L( µ ), r µ L( µ ) Update µ * Shental et al., 2003; Taskar et al., 2002; Sutton & McCallum, 2005

Ganapathi, Vickrey, Duchi, Koller4 Multiclass Image Segmentation ( Gould et al., Multi-Class Segmentation with Relative Location Prior, IJCV 2008) Simplified Example Goal: Image segmentation & labeling Model: Conditional Random Field Nodes: Superpixel class labels Edges: Dependency relations Dense network with tight loops Potentials => BP converges anyway However, BP in inner loop of learning almost never converges

Ganapathi, Vickrey, Duchi, Koller5 Our Solution Unified variational objective for parameter learning Can be applied to any entropy approximation Convergent algorithm for non-convex entropies Accomodates parameter sharing, regularization, conditional training Extends several existing objectives/methods Piecewise training (Sutton and McCallum, 2005) Unified propagation and scaling (Teh and Welling, 2002) Pseudo-moment matching (Wainwright et al, 2003) Estimating the wrong graphical model (Wainwright, 2006)

Ganapathi, Vickrey, Duchi, Koller6 Log Linear Pairwise MRFs All results apply to general MRFs Edge Potentials Node Potentials Cliques (pseudo) marginals

Ganapathi, Vickrey, Duchi, Koller7 Maximum Entropy Equivalent to maximum likelihood Intuition Regularization and conditional training can be handled easily (see paper) Q is exponential in number of variables Entropy Moment Matching Normalization Non-negativity

Ganapathi, Vickrey, Duchi, Koller8 Maximum Entropy Entropy Moment Matching Normalization Non-negativity Approximate Entropy Moment Matching Local Consistency Normalization Non-negativity Marginals

Ganapathi, Vickrey, Duchi, Koller9 CAMEL Approximate Entropy Moment Matching Local Consistency Normalization Non-negativity Concavity depends on counting numbers n c Bethe (non-concave): Singletons: n c = 1 - deg( x i ) Edge Cliques: n c = 1

Ganapathi, Vickrey, Duchi, Koller10 Simple CAMEL Approximate Entropy Moment Matching Local Consistency Normalization Non-negativity Simple concave objective: for all c, n c = 1

Ganapathi, Vickrey, Duchi, Koller11 Piecewise Training* Approximate Entropy Moment Matching Local Consistency Normalization Non-negativity Simply drop the marginal consistency constraints Dual objective is the sum of local likelihood terms of cliques * Sutton & McCallum, 2005

Ganapathi, Vickrey, Duchi, Koller12 Convex-Concave Procedure Objective: Convex(x) + Concave(x) Used by Yuille, 2003 Approximate Objective: g T x + Concave(x) Repeat: Maximize approximate objective Choose new approximation Guaranteed to converge to fixed point

Ganapathi, Vickrey, Duchi, Koller13 Algorithm Repeat Choose g to linearize about current point Solve unconstrained dual problem Approximate Entropy Moment Matching Local Consistency Normalization Non-negativity

Ganapathi, Vickrey, Duchi, Koller14 Dual Problem Sum of local likelihood terms Similar to multiclass logistic regression g is a bias term for each cluster Local consistency constraints reduce to another feature Lagrange multipliers that correspond to weights and messages Simultaneous inference and learning Avoids problem of setting convergence threshold

Ganapathi, Vickrey, Duchi, Koller15 Experiments Algorithms Compared: Double loop with BP in inner loop Residual Belief Propagation (Elidan et al., 2006) Save messages between calls Reset messages during line search 10 restarts with random messages Camel + Bethe Simple Camel Piecewise (Simple Camel w/o local consistency) All used L-BFGS (Zhu et al, 1997) BP at test time

Ganapathi, Vickrey, Duchi, Koller16 Segmentation Variable for each superpixel 7 Classes: Rhino,Polar Bear, Water, Snow, Vegetation, Sky, Ground 84 parameters Lots of loops Densely connected

Ganapathi, Vickrey, Duchi, Koller17 Named Entity Recognition Variable for each word 4 Classes: Person, Location, Organization, Misc. Skip Chain CRF (Sutton and McCallum, 2004) Words connected in a chain Long-range dependencies for repeated words ~400k features, ~3 million weights X0X0 X1X1 X2X2 X 100 X 101 X 102 Speaker John Smith ProfessorSmith will

Ganapathi, Vickrey, Duchi, Koller18 Results Small number of relinearizations (<10)

Ganapathi, Vickrey, Duchi, Koller19 Discussion Local consistency constraints add good bias NER has millions of moment-matching constraints Moment matching  learned distribution ¼ empirical  local consistency naturally satisfied Segmentation has only 84 parameters  Local consistency rarely satisified Local NER Moment Segmentation Moment Local

Ganapathi, Vickrey, Duchi, Koller20 Conclusions CAMEL algorithm unifies learning and inference Optimizes Bethe approximation to entropy Repeated convex optimization with simple form Only few iterations required (can stop early too!) Convergent Stable Our results suggest that constraints on the probability distribution are more important to learning than the entropy approximations

Ganapathi, Vickrey, Duchi, Koller21 Future Work For inference, evaluate relative benefit of approximations to entropy and constraints Learn with tighter outer bounds on marginal polytope New optimization methods to exploit structure of constraints

Ganapathi, Vickrey, Duchi, Koller22 Related Work Unified Propagation and Scaling-Teh & Welling, 2002 Similar idea in using Bethe entropy and local constraints for learning No parameter sharing, conditional training and regularization Optimization (updates one coordinate at a time) procedure does not work well when there is large amount of parameter sharing Pseudo-moment matching-Wainwright et al, 2003 No parameter sharing, conditional training, and regularization Falls out of our formulation because it corresponds to case where there is only one feasible point in the moment-matching constraints

Ganapathi, Vickrey, Duchi, Koller23 Running Time NER dataset piecewise is about twice as fast Segmentation dataset Pay large cost because you have many more dual parameters (several per edge) But you get an improvement

Ganapathi, Vickrey, Duchi, Koller24 Bethe Free Energy Constraints on pseudo-marginals Pairwise Consistency:  x ¼ ij = ¼ j Local Normalization:  ¼ i = 1 Non-negativity: ¼ i ¸ 0 LBP as Optimization

Ganapathi, Vickrey, Duchi, Koller25 Optimizing Bethe CAMEL g à r ¼ (   deg(i) H( ¼ i )) [ ¼ * ] Relinearize Solve Similar concept used in CCCP algorithm (Yuille et al, 2002)

Ganapathi, Vickrey, Duchi, Koller26 Maximizing Likelihood with BP Goal: Maximize likelihood of data Optimization difficult: Inference doesn’t converge Inference has multiple local minima CG/LBFGS fail! Init µ Done? CG/LBFGS: Update µ Loopy BP No Yes L( µ ), r µ L( µ ) Finished Loopy BP searches for a fixed point of a non-convex problem (Yedidia et. al, Generalized Belief Propagation, 2002 )