Download presentation
Presentation is loading. Please wait.
Published byAlexander Strong Modified over 10 years ago
1
Convex Point Estimation using Undirected Bayesian Transfer Hierarchies Gal Elidan Ben Packer Geremy Heitz Daphne Koller Stanford AI Lab
2
Motivation Problem: With few instances, learned models aren’t robust MEAN Principal Components Task: Shape modeling std +1 std -1 std +1 std -1 MEAN
3
Transfer Learning MEAN Principal Components Shape is stabilized, but doesn’t look like an elephant Can we use rhinos to help elephants? std +1 std -1 std +1 std -1 MEAN
4
Hierarchical Bayes root Elephant Rhino P( Elephant | root ) P( Rhino | root ) P(Data Elephant | Elephant ) P(Data Rhino | Rhino ) P( root )
5
Goals Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Automatically learn what to transfer
6
Hierarchical Bayes root Elephant Rhino Compute full posterior P( £ | D ) P( £ c | £ root ) must be conjugate to P( D | £ c ) Problem: Often can’t perform full Bayesian computations
7
Approx.: Point estimation root Elephant Rhino Empirical Bayes Point estimation Other approximations: Posterior as normal, sampling, etc. Best parameters are good enough; don’t need full distribution
8
More Issues: Multiple Levels Conjugate priors usually can’t be extended to multiple levels (e.g., Dirichlet, inverse-Wishart) Exception: Thibeaux and Jordan (‘05)
9
More Issues: Restrictive Priors Example: inverse-Wishart Pseudocount restriction >= d If d is large, N is small, signal from prior overwhelms data We show experiments with N=3, d=20 A A B B Normal-Inverse- Wishart parameters Gaussian parameters N = # samples, d = dimension Pseudocounts
10
Alternative: Shrinkage McCallum et al. (‘98) 1. Compute maximum likelihood at each node 2. “Shrink” each node toward its parent Linear combination of and parent Uses cross-validation of parent Pros: Simple to compute Handles multiple levels Cons: Naive heuristic for transfer Averaging not always appropriate parent
11
Undirected HB Reformulation Defines an undirected Markov random field model over D Probabilistic Abstraction Hierarchies (Segal et al. ‘01)
12
F data : Encourage parameters to explain data Undirected Probabilistic Model root F data Divergence : high Elephant Rhino : low Divergence: Encourage parameters to be similar to parents Divergence
13
Purpose of Reformulation Easy to specify F data can be likelihood, classification, or other objective Divergence can be L1-distance, L2-distance, -insensitive loss, KL divergence, etc. No conjugacy or proper prior restrictions Easy to optimize Convex over if F data is convex and Divergence is concave
14
Task: Categorize Documents Bag-of-words model F data : Multinomial log likelihood (regularized) µ i represents frequency of word i Divergence: L2 norm Application: Text categorization Newsgroup20 Dataset
15
Baselines 1. Maximum likelihood at each node (no hierarchy) 2. Cross-validate regularization (no hierarchy) 3. Shrinkage (McCallum et al. ’98, with hierarchy) AA BB parent parent
16
Can It Handle Multiple Levels? 75150225300375 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 Classification Rate Total Number of Training Instances Newsgroup Topic Classification Max Likelihood (No regularization) Regularized Max Likelihood Shrinkage Undirected HB
17
Task: Learn shape (Density estimation – test likelihood) Instances represented by 60 x-y coordinates of landmarks on outline Divergence: L2 norm over mean and variance Application: Shape Modeling Mean landmark location Covariance over landmarks Regularization MEAN Principal Components Mammals Dataset (Fink, ’05)
18
Does Hierarchy Help? 6102030 -350 -300 -250 -200 -150 -100 -50 0 50 Total Number of Training Instances Delta log-loss / instance Mammal Pairs Regularized Max Likelihood Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Unregularized max likelihood, shrinkage: Much worse, not shown
19
Transfer Not all parameters deserve equal sharing
20
Split into subcomponents µ i with weights Allows for different strengths for different subcomponents, child-parent pairs Degrees of Transfer How do we estimate all these parameters?
21
Learning Degrees of Transfer Bootstrap approach Hyper-prior approach Bootstrap approach If and have a consistent relationship, want to encourage them to be similar Define Want to estimate variance of across all possible datasets Select random subsets of data Let ¸ be empirical variance of If and have a consistent relationship (low variance), ¸ strongly encourages similarity
22
Degrees of Transfer Bootstrap approach If we use an L2 norm for our Div terms: Resembles a product of Gaussian priors If we were to fix the value of empirical Bayes “Undirected empirical Bayes estimation” L2 norm Degree of Transfer } Gaussian Prior
23
Learning Degrees of Transfer Bootstrap approach If and have a consistent relationship, want to encourage them to be similar Hyper-prior approach Bayesian idea: Put prior on ¸ Add ¸ as parameter to optimization along with £ Concretely: inverse-Gamma prior (forced to be positive) Prior on Degree of Transfer If likelihood is concave, entire objective is convex!
24
Do Degrees of Transfer Help? 6102030 -15 -10 -5 0 5 10 15 Total Number of Training Instances Delta log-loss / instance Mammal Pairs Bison-Rhino Elephant-Bison Elephant-Rhino Giraffe-Bison Giraffe-Elephant Giraffe-Rhino Llama-Bison Llama-Elephant Llama-Giraffe Llama-Rhino Regularized Max Likelihood Bison-Rhino Hyperprior
25
Do Degrees of Transfer Help? Const: No Degrees of Transfer Hyperprior, Bootstrap: Use Degrees of Transfer 6102030 -15 -10 -5 0 5 10 Delta log-loss / instance Total Number of Training Instances Bison-Rhino RegML Const Bootstrap Hyperprior root
26
Do Degrees of Transfer Help? 6102030 0 1 2 3 4 5 6 7 8 Const: >100 bits worse Delta log-loss / instance Total Number of Training Instances RegML Bootstrap Hyperprior Llama-Rhino root
27
Degrees of Transfer 1/ Stronger transfer Weaker transfer root 05101520253035404550 0 2 4 6 8 10 12 14 16 18 20 Distribution of DOT coefficients using Hyperprior
28
Summary Transfer between related classes Range of settings, tasks Probabilistic motivation Multilevel, complex hierarchies Simple, efficient computation Refined transfer of components
29
Future Work Non-tree hierarchies (multiple inheritance) Block degrees of transfer Structure learning General undirected model doesn’t require tree structure Part discovery Gene Ontology (GO) network WordNet Hierarchy
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.