Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan, Ben Packer, Geremy Heitz, Daphne Koller Computer Science Dept. Stanford University UAI 2008 Presented by Haojun Chen August 1 st, 2008
Outline Background and motivation Undirected transfer hierarchies Experiments Degree of transfer coefficients Experiments Summary
Background (1/2) Transfer learning Data from “similar” tasks/distributions are used to compensate for the sparsity of training data in primary class or task Example: Use rhinos to help learn elephants’ shape Resources:
Hierarchical Bayes (HB) framework Principled approach for transfer learning Background (2/2) Example of a hierarchical Bayes parameterization where : a set of related learning tasks/classes : observed data : task/class parameters Joint distribution over the observed data and all class parameters as follows: Resources:
Motivation In practice, point estimation of the MAP is desirable, for full Bayesian computations can be difficult and computationally demanding Efficient point estimation may not be achieved in many standard hierarchical Bayes models, because many common conjugate priors such as the Dirichlet or normal-inverse-Wishart are not convex with respect to the parameters In this paper, an undirected hierarchical Bayes(HB) reformulation is proposed to allow efficient point estimation
Undirected HB Reformulation : data-dependent objective : divergence function over child and parent parameters → 0 : encourages parameters to explain data →∞ : encourages parameters to be similar to parents Resources:
Purpose of Reformulation Easy to specify –F data can be likelihood, classification, or other objective –Divergence can be L1-norm, L2-norm, -insensitive loss, KL divergence, etc. –No conjugacy or proper prior restrictions Easy to optimize –Convex over if F data is concave and Divergence is convex Resources:
Bag-of-words model F data : Multinomial log likelihood (regularized) : frequency of word i Divergence: L2 norm Experiment: Text categorization Newsgroup20 Dataset Resources:
Text categorization Result Classification Rate Total Number of Training Instances Newsgroup Topic Classification Max Likelihood (No regularization) Regularized Max Likelihood Shrinkage Undirected HB Baseline: Maximum likelihood at each node (no hierarchy) Cross-validate regularization (no hierarchy) Shrinkage (McCallum et al. ’98, with hierarchy) Resources:
(Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Experiment: Shape Modeling Mean landmark location Covariance over landmarks Regularization Mammals Dataset (Fink, ’05) Resources:
Undirect HB Shape Modeling Result Total Number of Training Instances Delta log-loss / instance Mammal Pairs Regularized Max Likelihood Elephant-Rhino Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Resources:
Problem in Transfer Not all parameters deserve equal sharing Resources:
is split into subcomponents with weights and hence different strengths are allowed for different subcomponents, child-parent pairs Degrees of Transfer (DOT) → 0 : forces parameters to agree →∞ : allows parameters to be flexible Resources:
Estimation of DOT Parameters Hyper-prior approach Bayesian idea: Put prior on and add as parameter to optimization along with Concretely: inverse-Gamma prior (forced to be positive) Prior on Degree of Transfer Resources:
DOT Shape Modeling Result Total Number of Training Instances Delta log-loss / instance Mammal Pairs Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Regularized Max Likelihood Elephant-Rhino Hyperprior Resources:
Distribution of DOT coefficients 1/ Stronger transfer Weaker transfer root Distribution of DOT coefficients using Hyperprior approach Resources:
Summary Undirected reformulation of the hierarchical Bayes framework is proposed for efficient convex point estimation Different degrees of transfer for different parameters are introduced so that some parts of the distribution can be transferred to a greater extent than others