Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Pattern Recognition and Machine Learning
Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab.
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Probabilistic Clustering-Projection Model for Discrete Data
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Visual Recognition Tutorial
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
1 Exploiting Parameter Domain Knowledge for Learning in Bayesian Networks Thesis Committee: Tom Mitchell (Chair) John Lafferty Andrew Moore Bharat Rao.
Learning object shape Gal Elidan Geremy Heitz Daphne Koller February 12 th, 2006 PAIL.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Scene Understanding through Transfer Learning Stephen Gould Ben Packer Geremy Heitz Daphne Koller DARPA Update September 11, 2008.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Computer vision: models, learning and inference
Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Informatics and Mathematical Modelling / Intelligent Signal Processing 1 EUSIPCO’09 27 August 2009 Tuning Pruning in Sparse Non-negative Matrix Factorization.
Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.
1 Bayesian Learning for Latent Semantic Analysis Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu Presenter: Hsuan-Sheng Chiu.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Loss-based Learning with Latent Variables M. Pawan Kumar École Centrale Paris École des Ponts ParisTech INRIA Saclay, Île-de-France Joint work with Ben.
1 / 41 Inference and Computation with Population Codes 13 November 2012 Inference and Computation with Population Codes Alexandre Pouget, Peter Dayan,
(Infinitely) Deep Learning in Vision Max Welling (UCI) collaborators: Ian Porteous (UCI) Evgeniy Bart UCI/Caltech) Pietro Perona (Caltech)
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
Modeling Latent Variable Uncertainty for Loss-based Learning Daphne Koller Stanford University Ben Packer Stanford University M. Pawan Kumar École Centrale.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Conjugate Priors Multinomial Gaussian MAP Variance Estimation Example.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
Improving Text Classification by Shrinkage in a Hierarchy of Classes Andrew McCallum Just Research & CMU Tom Mitchell CMU Roni Rosenfeld CMU Andrew Y.
1 Analytic Solution of Hierarchical Variational Bayes Approach in Linear Inverse Problem Shinichi Nakajima, Sumio Watanabe Nikon Corporation Tokyo Institute.
Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Towards Total Scene Understanding: Classification, Annotation and Segmentation in an Automatic Framework N 工科所 錢雅馨 2011/01/16 Li-Jia Li, Richard.
Machine Learning 5. Parametric Methods.
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
Extracting Simple Verb Frames from Images Toward Holistic Scene Understanding Prof. Daphne Koller Research Group Stanford University Geremy Heitz DARPA.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Crash course in probability theory and statistics – part 2 Machine Learning, Wed Apr 16, 2008.
Learning Deep Generative Models by Ruslan Salakhutdinov
Probability Theory and Parameter Estimation I
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Bayes Net Learning: Bayesian Approaches
Special Topics In Scientific Computing
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
Learning with information of features
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
Multivariate Methods Berlin Chen
Discriminative Probabilistic Models for Relational Data
Learning Incoherent Sparse and Low-Rank Patterns from Multiple Tasks
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Presentation transcript:

Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford University UAI 2008 Presented by Haojun Chen August 1 st, 2008

Outline Background and motivation Undirected transfer hierarchies Experiments Degree of transfer coefficients Experiments Summary

Background (1/2) Transfer learning Data from “similar” tasks/distributions are used to compensate for the sparsity of training data in primary class or task Example: Use rhinos to help learn elephants’ shape Resources:

Hierarchical Bayes (HB) framework Principled approach for transfer learning Background (2/2) Example of a hierarchical Bayes parameterization where : a set of related learning tasks/classes : observed data : task/class parameters Joint distribution over the observed data and all class parameters as follows: Resources:

Motivation In practice, point estimation of the MAP is desirable, for full Bayesian computations can be difficult and computationally demanding Efficient point estimation may not be achieved in many standard hierarchical Bayes models, because many common conjugate priors such as the Dirichlet or normal-inverse-Wishart are not convex with respect to the parameters In this paper, an undirected hierarchical Bayes(HB) reformulation is proposed to allow efficient point estimation

Undirected HB Reformulation : data-dependent objective : divergence function over child and parent parameters → 0 : encourages parameters to explain data →∞ : encourages parameters to be similar to parents Resources:

Purpose of Reformulation Easy to specify –F data can be likelihood, classification, or other objective –Divergence can be L1-norm, L2-norm,  -insensitive loss, KL divergence, etc. –No conjugacy or proper prior restrictions Easy to optimize –Convex over  if F data is concave and Divergence is convex Resources:

Bag-of-words model F data : Multinomial log likelihood (regularized) : frequency of word i Divergence: L2 norm Experiment: Text categorization Newsgroup20 Dataset Resources:

Text categorization Result Classification Rate Total Number of Training Instances Newsgroup Topic Classification Max Likelihood (No regularization) Regularized Max Likelihood Shrinkage Undirected HB Baseline: Maximum likelihood at each node (no hierarchy) Cross-validate regularization (no hierarchy) Shrinkage (McCallum et al. ’98, with hierarchy) Resources:

(Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Experiment: Shape Modeling Mean landmark location Covariance over landmarks Regularization Mammals Dataset (Fink, ’05) Resources:

Undirect HB Shape Modeling Result Total Number of Training Instances Delta log-loss / instance Mammal Pairs Regularized Max Likelihood Elephant-Rhino Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Resources:

Problem in Transfer Not all parameters deserve equal sharing Resources:

 is split into subcomponents with weights  and hence different strengths are allowed for different subcomponents, child-parent pairs Degrees of Transfer (DOT) → 0 : forces parameters to agree →∞ : allows parameters to be flexible Resources:

Estimation of DOT Parameters Hyper-prior approach Bayesian idea: Put prior on and add as parameter to optimization along with Concretely: inverse-Gamma prior (forced to be positive) Prior on Degree of Transfer Resources:

DOT Shape Modeling Result Total Number of Training Instances Delta log-loss / instance Mammal Pairs Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Regularized Max Likelihood Elephant-Rhino Hyperprior Resources:

Distribution of DOT coefficients 1/ Stronger transfer Weaker transfer  root Distribution of DOT coefficients using Hyperprior approach Resources:

Summary Undirected reformulation of the hierarchical Bayes framework is proposed for efficient convex point estimation Different degrees of transfer for different parameters are introduced so that some parts of the distribution can be transferred to a greater extent than others