Preliminary Results (Synthetic Data) We generate a random 4-ary MRF and we sample training and test data. We forget the structure and start learning with.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Optimal Top-k Generation of Attribute Combinations based on Ranked Lists Jiaheng Lu, Renmin University of China Joint work with Pierre Senellart, Chunbin.
Is Random Model Better? -On its accuracy and efficiency-
Markov models and HMMs. Probabilistic Inference P(X,H,E) P(X|E=e) = P(X,E=e) / P(E=e) with P(X,E=e) = sum h P(X, H=h, E=e) P(X,E=e) = sum h,x P(X=x, H=h,
Mean-Field Theory and Its Applications In Computer Vision1 1.
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Topic models Source: Topic models, David Blei, MLSS 09.
Gated Graphs and Causal Inference
Neural Networks: Learning
Face Alignment by Explicit Shape Regression
CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 27 – Overview of probability concepts 1.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Slides from: Doug Gray, David Poole
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for 1 Lecture Notes for E Alpaydın 2010.
Deep Learning Bing-Chen Tsai 1/21.
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.
The Shape Boltzmann Machine S. M. Ali Eslami Nicolas Heess John Winn CVPR 2012 Providence, Rhode Island A Strong Model of Object Shape.
Supervised Learning Recap
Laboratory for Social & Neural Systems Research (SNS) PATTERN RECOGNITION AND MACHINE LEARNING Institute of Empirical Research in Economics (IEW)
Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Machine Learning Neural Networks
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Pattern Recognition and Machine Learning
Sparse vs. Ensemble Approaches to Supervised Learning
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
1 Computer Vision Research  Huttenlocher, Zabih –Recognition, stereopsis, restoration, learning  Strong algorithmic focus –Combinatorial optimization.
Conditional Random Fields
Belief Propagation Kai Ju Liu March 9, Statistical Problems Medicine Finance Internet Computer vision.
Sparse vs. Ensemble Approaches to Supervised Learning
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Ensemble Learning (2), Tree and Forest
Extensions to message-passing inference S. M. Ali Eslami September 2014.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,
Graphical Models over Multiple Strings Markus Dreyer and Jason Eisner Dept. of Computer Science, Johns Hopkins University EMNLP 2009 Presented by Ji Zongcheng.
Tell Me What You See and I will Show You Where It Is Jia Xu 1 Alexander G. Schwing 2 Raquel Urtasun 2,3 1 University of Wisconsin-Madison 2 University.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Approximation-aware Dependency Parsing by Belief Propagation September 19, 2015 TACL at EMNLP 1 Matt Gormley Mark Dredze Jason Eisner.
Penalized EP for Graphical Models Over Strings Ryan Cotterell and Jason Eisner.
Markov Random Fields & Conditional Random Fields
Section 6: Approximation-aware Training 1. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been.
Contextual models for object detection using boosted random fields by Antonio Torralba, Kevin P. Murphy and William T. Freeman.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Daphne Koller Overview Conditional Probability Queries Probabilistic Graphical Models Inference.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Bayesian Conditional Random Fields using Power EP Tom Minka Joint work with Yuan Qi and Martin Szummer.
Learning Deep Generative Models by Ruslan Salakhutdinov
ICS 280 Learning in Graphical Models
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Boosting and Additive Trees
Neural Network Decoders for Quantum Error Correcting Codes
Preliminaries: Distributions
Overview of Machine Learning
Presentation transcript:

Preliminary Results (Synthetic Data) We generate a random 4-ary MRF and we sample training and test data. We forget the structure and start learning with a fully connected but binary graph. For each data point we randomly designate each r.v. as either input, hidden or output. Training Procedure: Our training objective is: loss +.runtime We train by replacing the hard pruning threshold with a soft one, so we can compute the gradient of error and runtime w.r.t. the parameters. Gate features: Gate G AB that controls the factor between r.v.s A and B is conditioned on the values of A and B. Each r.v. (A or B) can be {0,1, hidden, output}. Gate features are the conjunction of the factor id and the two variable values. When both A and B are observed, we can just turn the factor off. The total number of features is 4x4-2x2=12. We compare evidence-specific MRF to a L1-regularized model. Preliminary Results (Synthetic Data) We generate a random 4-ary MRF and we sample training and test data. We forget the structure and start learning with a fully connected but binary graph. For each data point we randomly designate each r.v. as either input, hidden or output. Training Procedure: Our training objective is: loss +.runtime We train by replacing the hard pruning threshold with a soft one, so we can compute the gradient of error and runtime w.r.t. the parameters. Gate features: Gate G AB that controls the factor between r.v.s A and B is conditioned on the values of A and B. Each r.v. (A or B) can be {0,1, hidden, output}. Gate features are the conjunction of the factor id and the two variable values. When both A and B are observed, we can just turn the factor off. The total number of features is 4x4-2x2=12. We compare evidence-specific MRF to a L1-regularized model. Fast and Accurate Prediction via Evidence-Specific MRF Structure Veselin Stoyanov and Jason Eisner Johns Hopkins University Goal We want to learn evidence-specific structures for MRFs. I.e., different examples may rely on different factors to predict the outputs. Motivation 1.Data may be missing heterogeneously. Often the case with relational data. Our models can learn to rely more on the evidence and less on other inferred variables. 2.We want our models to perform fast test-time prediction. We are willing to trade-off some accuracy: We will optimize a interpolation of the loss function and the speed: loss +.runtime Approach 1. Gates The formalism of gates [Minka and Winn, 2008] allows us to capture contextual (in)dependencies. 2. ERMA (Empirical Risk Minimization under Approximations) We use the ERMA algorithm [Stoyanov, Ropson and Eisner, 2011] to jointly learn gate and MRF parameters by performing Empirical Risk Minimization. Goal We want to learn evidence-specific structures for MRFs. I.e., different examples may rely on different factors to predict the outputs. Motivation 1.Data may be missing heterogeneously. Often the case with relational data. Our models can learn to rely more on the evidence and less on other inferred variables. 2.We want our models to perform fast test-time prediction. We are willing to trade-off some accuracy: We will optimize a interpolation of the loss function and the speed: loss +.runtime Approach 1. Gates The formalism of gates [Minka and Winn, 2008] allows us to capture contextual (in)dependencies. 2. ERMA (Empirical Risk Minimization under Approximations) We use the ERMA algorithm [Stoyanov, Ropson and Eisner, 2011] to jointly learn gate and MRF parameters by performing Empirical Risk Minimization. Gates Gates [Minka and Winn, 2008] can capture conditional (in)dependencies. A gate is a random variable that turns a factor on or off: where In our model, gates are classifiers that decide whether a factor should be on/off based on the observation pattern (see results section). Gates can be partially on: we use the posterior marginal probability of a gate to damp down the messages through a given factor. A two-step test-time procedure: 1.Compute gate values: 2.Turn off (or damp) factors and run inference. If the gate marginal probability is low, we prune the factor, which improves speed at a slight cost to accuracy. Gates Gates [Minka and Winn, 2008] can capture conditional (in)dependencies. A gate is a random variable that turns a factor on or off: where In our model, gates are classifiers that decide whether a factor should be on/off based on the observation pattern (see results section). Gates can be partially on: we use the posterior marginal probability of a gate to damp down the messages through a given factor. A two-step test-time procedure: 1.Compute gate values: 2.Turn off (or damp) factors and run inference. If the gate marginal probability is low, we prune the factor, which improves speed at a slight cost to accuracy. AB G1G1 A C C E E D B F G7G7 G2G2 G3G3 G4G4 G5G5 G6G6 G1G1 G1G1 A C C E E D B F ERMA Empirical Risk Minimization under Approximations [Stoyanov, Ropson and Eisner, 2011] is an algorithm for learning in probabilistic graphical models by matching test-time conditions and performing empirical risk minimization. ERMA utilizes back-propagation of error to compute gradients of the output loss. Empirical evidence shows ERMA performs well on real- world problems that require approximations [Stoyanov and Eisner, 2012]. It is easy to extend ERMA so that it optimizes gate parameters and MRF parameters jointly. ERMA implementation at: ERMA Empirical Risk Minimization under Approximations [Stoyanov, Ropson and Eisner, 2011] is an algorithm for learning in probabilistic graphical models by matching test-time conditions and performing empirical risk minimization. ERMA utilizes back-propagation of error to compute gradients of the output loss. Empirical evidence shows ERMA performs well on real- world problems that require approximations [Stoyanov and Eisner, 2012]. It is easy to extend ERMA so that it optimizes gate parameters and MRF parameters jointly. ERMA implementation at: