Variational Infinite Hidden Conditional Random Fields with Coupled Dirichlet Process Mixtures K. Bousmalis, S. Zafeiriou, L.-P. Morency, M. Pantic, Z.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Xiaolong Wang and Daniel Khashabi

Active Appearance Models

Gibbs Sampling Methods for Stick-Breaking priors Hemant Ishwaran and Lancelot F. James 2001 Presented by Yuting Qi ECE Dept., Duke Univ. 03/03/06.

Hierarchical Dirichlet Processes

Computer vision: models, learning and inference Chapter 8 Regression.

Nonparametric hidden Markov models Jurgen Van Gael and Zoubin Ghahramani.

Supervised Learning Recap

Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics.

The loss function, the normal equation,

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Conditional Random Fields

Measuring Uncertainty in Graph Cut Solutions Pushmeet Kohli Philip H.S. Torr Department of Computing Oxford Brookes University.

STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.

Hierarchical Dirichelet Processes Y. W. Tech, M. I. Jordan, M. J. Beal & D. M. Blei NIPS 2004 Presented by Yuting Qi ECE Dept., Duke Univ. 08/26/05 Sharing.

(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)

Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

Randomized Algorithms for Bayesian Hierarchical Clustering

Variational Inference for the Indian Buffet Process

Hierarchical Dirichlet Process and Infinite Hidden Markov Model Duke University Machine Learning Group Presented by Kai Ni February 17, 2006 Paper by Y.

A Simulated-annealing-based Approach for Simultaneous Parameter Optimization and Feature Selection of Back-Propagation Networks (BPN) Shih-Wei Lin, Tsung-Yuan.

Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.

Stick-Breaking Constructions

CS Statistical Machine learning Lecture 24

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.

Lecture 2: Statistical learning primer for biologists

Latent Dirichlet Allocation

Characterizing the Function Space for Bayesian Kernel Models Natesh S. Pillai, Qiang Wu, Feng Liang Sayan Mukherjee and Robert L. Wolpert JMLR 2007 Presented.

Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan.

Gaussian Processes For Regression, Classification, and Prediction.

Stick-breaking Construction for the Indian Buffet Process Duke University Machine Learning Group Presented by Kai Ni July 27, 2007 Yee Whye The, Dilan.

1 Optimizing Decisions over the Long-term in the Presence of Uncertain Response Edward Kambour.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.

Generalized Spatial Dirichlet Process Models Jason A. Duan Michele Guindani Alan E. Gelfand March, 2006.

Mean field approximation for CRF inference

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Probabilistic Equational Reasoning Arthur Kantor

The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.

Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Learning Deep Generative Models by Ruslan Salakhutdinov

Multimodal Learning with Deep Boltzmann Machines

Latent Variables, Mixture Models and EM

Kernel Stick-Breaking Process

Markov Networks.

Akio Utsugi National Institute of Bioscience and Human-technology,

Bayesian Models in Machine Learning

Collapsed Variational Dirichlet Process Mixture Models

Stochastic Optimization Maximization for Latent Variable Models

Markov Networks.

Presentation transcript:

Variational Infinite Hidden Conditional Random Fields with Coupled Dirichlet Process Mixtures K. Bousmalis, S. Zafeiriou, L.-P. Morency, M. Pantic, Z. Ghahramani

Hidden Conditional Random Field X1X1 X2X2 X3X3 X4X4 X5X5 Head Nod Head Shake F0 Shoulder Shrug y P(y=‘Agreement’|X) = ? s1s1 s2s2 s3s3 s4s4 s5s5 P(y=‘Disagreement’|X) = ? haha hbhb hchc Shake Shrug Hand Wag Hand Scissor … F0 Energy Hidden States haha hbhb hchc

Weights and equivalent potentials for each relationship: – hidden states and labels θ y exp{θ y } – features and hidden states θ x exp{∑f t θ x } – transitions among hidden states and labels θ e exp{θ e } X1X1 X2X2 X3X3 X4X4 X5X5 y s1s1 s2s2 s3s3 s4s4 s5s5 haha hbhb hchc Hidden States Learned HCRF Model

Number of hidden states is not intuitive for behavior problems Computationally expensive cross-validation for the number of hidden states Solution: Allow for a potentially infinite number of hidden states X1X1 X2X2 X3X3 X4X4 X5X5 y s1s1 s2s2 s3s3 s4s4 s5s5 HCRF Problems haha hbhb hchc Hidden States

Motivation and Novelty Previous work introduced infinite-state HCRFs with an efficient MCMC sampling approach (IHCRF-MCMC) This work proposes a model that is a generalization of: Finite HCRFs : in terms of its ability to automatically determine its hidden structure without cross-validation IHCRF-MCMC: in terms of its ability to handle continuous input gracefully. We present a novel variational inference method for learning Deterministic alternative to MCMC Precise learning stopping criterion

Our Framework No a priori bound on the number of hidden states, by introducing a set of random variables These are drawn by distinct processes that allow the number of hidden states to grow with the data … and are incorporated in our potential:

The HCRF-DPM Model In our model, the π-variables are driven by coupled DPs. According to the stick-breaking properties: where ω μ = {h k, y}

The HCRF-DPM Model X y s1s1 s2s2 sΤsΤ αxαx αyαy αeαe πxπx πyπy πeπe ∞ with Actual Joint Distribution

Variational Approximation We approximate all π-variables (variational parameters τ) with a truncated stick-breaking representation, which approximates the infinite number of hidden states with a finite L: If L = 5: In practice, L is large enough for the actual sum to be really small!

Model Training Objective: Find the parameters {θ, τ} that minimize  [q||p] We alternate, until convergence, between – a coordinate descent method for finding τ – a HCRF-like gradient ascent method to find θ

Experiments-Human Behavior Classification performance (F1) on 1.Agreement vs. Disagreement (ADA2) 2.Agreement vs. Disagreement vs. Neutral (ADA3) 3.Extreme Pain vs. No Pain (PAIN2) 4.Extreme vs. Moderate vs. No Pain (PAIN3)

Agreement and Disagreement Canal 9 Dataset of Political Debates – Ground truth based ONLY on verbal content – 11 debates- 28 distinct individuals – 53 episodes of agreement – 94 episodes of disagreement – 130 neutral episodes Binary Visual Features: Presence per frame of 8 gestures Continuous Auditory Feature: F0, Energy

UNBC Dataset of Pain Different levels of elicited shoulder pain in 200 sequences – 25 subjects Annotations of 12 pain-related facial action units (AUs) 2 classification problems – Extreme pain vs Minimal pain – Incl. Moderate pain

Classification Performance 10 different random initializations HCRFs cross-validated for – 2, 3, 4 and 5 hidden states – Regularization factor of 1, 10, 100 HCRF-DPM L=10 F1

No Overfitting HCRF-DPM Performance on the Canal 9 Validation Set

Node Features—HCRF-DPM, L = 50 Node Features—finite HCRF, K = 50 Hidden States Hidden States Sparsity

Future Avenues More datasets Using HDPs and Pitman-Yor processes Infinite Latent Dynamic CRFs X

Thank you! Poster Stand #46 for more details

Dirichlet Process Mixture A DPM model: a hierarchical Bayesian model that uses a DP as a nonparametric prior

HCRF-DPM π-sticks Although the π-variables are drawn by distinct processes, they are coupled together by a common latent variable assignment. Feature 1 Feature 2 Label 1 Label 2 Prev. State 1 Prev. State 2 Prev. State 3

Variational Approximation with Actual Joint Distribution Approximate Joint Distribution with

Initialize α x,α y,α e,θ,τ Initialize nbItrs, nbVarItrs itr = 0 converged = FALSE while (not converged) and (itr < nbItrs) do varItrs = 0 varConverged = FALSE while (not varConverged) and (varItr < nbVarItrs) do Compute q(s t =h k | i), q(s t =h k | y), q(s t =h k, y, s t-1 = h a ), i.e. the approximate marginals by using the forward-backward algorithm Hyperparameter posterior sampling for α x, α y, α e Calculate  [q||p](varItr) Update τ varItr = varItr + 1 end while Gradient ascent to find θ(itr) by using a quasi-Newton method and an Armijo backtracking line search with projected gradients to keep θ non-negative itr = itr + 1 end while Model Training for Variational HCRF-DPM

Performance Evaluation Classification of Agreement and Disagreement – Leave-2-debates out for testing (5 folds) – Optimal parameter choice based on 3 debates Classification of Pain Levels – Leave-1-subject out for testing (25 folds) – Optimal parameter choice based on 7 subjects

Synthetic Dataset