Download presentation
Presentation is loading. Please wait.
Published byBeatrix Kelly Modified over 9 years ago
1
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman
2
Problem: No closed-form solution for ML estimation Use Expectation Maximization (EM) Problem: Stuck in inferior local Maxima Random Restarts Deterministic Simulated annealing Learning with Hidden Variables Input:Output: A model P(X,T) DATA ???????????? X 1 … X N T Params Likelihood 00.10.20.30.40.50.60.70.80.91 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 EM + information regularization for learning parameters T X2X2 X3X3 X1X1
3
Learning Parameters Input:Output: A model P (X) DATA X 1 … X N Empirical distribution Q(X) Parametrization of P P (X 1 ) = Q(X 1 ) P (X 2 |X 1 ) = Q(X 2 |X 1 ) P (X 3 |X 1 ) = Q(X 3 |X 1 ) X1X1 X2X2 X3X3
4
Empirical distribution Q(X,T) = ? Learning with Hidden Variables DATA X 1 … X N ???????????? T 4 3 2 1 M Y Parametrization for P T X2X2 X3X3 X1X1 guess of Desired structure: EM Iterations Empirical distribution Q(X,T,Y) = Empirical distribution Q(X,T,Y)=Q(X,T)Q(T|Y) Input: For each instance ID, complete value of T
5
The EM Algorithm: E-Step: Generate empirical distribution M-Step: Maximize using EM is equivalent to optimizing function of Q,P Each step increases value of functional EM Functional [Neal and Hinton, 1998]
6
Information Bottleneck EM Target: In the rest of the talk… Understanding this objective How to use it to learn better models EM target Information between hidden and ID
7
Information Regularization Motivating idea: Fit training data: Set T to be instance ID to “predict” X Generalization: “Forget” ID and keep essence of X Objective: parameter free regularization of Q [Tishby et. al, 1999] (lower bound of) Likelihood of P Compression of instance ID vs.
8
1 7 5 3 11 9 2 6 10 8 4 total compression =0 1 7 5 3 11 9 2 6 10 8 4 Clustering example EM Target Compression measure =0 EM Target Compression measure
9
Clustering example 1 7 5 3 11 9 2 6 10 8 4 total preservation =1 1 7 5 3 11 9 2 6 10 8 4 Compression measure EM Target =1 T ID
10
Clustering example 1 7 5 3 11 9 2 6 10 8 4 1 7 5 3 11 9 2 6 8 4 10 Desired =? Compression measure EM Target =? |T| = 2
11
Information Bottleneck EM Formal equivalence with Information Bottleneck at =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case] EM functional
12
Information Bottleneck EM Formal equivalence with Information Bottleneck EM functional Prediction of T using P Marginal of T in Q Normalization Maximum of Q(T|Y) is obtained when
13
The IB-EM Algorithm for fixed Iterate until convergance E-Step: Maximize L IB-EM by optimizing Q M-Step: Maximize L IB-EM by optimizing P (same as standard M-step) Each step improves L IB-EM Guaranteed to converge
14
Information Bottleneck EM Target: In the rest of the talk… Understanding this objective How to use it to learn better models EM target Information between hidden and ID
15
Continuation easy hard Follow ridge from optimum at =0 L IB-EM Q 0 1
16
Continuation Recall, if Q is a local maxima of L IB-EM then We want to follow a path in (Q, ) space so that… for all t, and y Q Local maxima for all
17
Continuation Step 1.Start at (Q, ) so that 2.Compute gradient 3.Take direction 4.Take a step in the desired direction Q 0 1 start
18
Staying on the ridge Q 0 1 start Potential problem: Direction is tangent to path miss optimum Solution: Use EM steps to regain path
19
The IB-EM Algorithm Set =0 (start at easy solution) Iterate until =1 (EM solution is reached) Iterate (stay on the ridge) E-Step: Maximize L IB-EM by optimizing Q M-Step: Maximize L IB-EM by optimizing P Step (follow the ridge) Compute gradient and direction Take the step by changing and Q
20
Q 0 1 Calibrating the step size Potential problem: Step size too small too slow Step size too large overshoot target Inferior solution
21
Non-parametric: involves only Q Can be bounded: I(T;Y) ≤ log 2 |T| Calibrating the step size 00.20.40.60.81 0.5 1 1.5 Use change in I(T;Y) 00.20.40.60.81 0.5 1 1.5 I(T;Y) Naive Recall that I(T;Y) measures compression of ID When I(T;Y) rises more of data is captured Too sparse “Interesting” area
22
The IB-EM Algorithm Set =0 Iterate until =1 (EM solution is reached) Iterate (stay on the ridge) E-Step: Maximize L IB-EM by optimizing Q M-Step: Maximize L IB-EM by optimizing P Step (follow the ridge) Compute gradient and direction Calibrate step size using I(T;Y) Take the step by changing and Q
23
The Stock Dataset Naive Bayes model Daily changes of 20 NASDAQ stocks. 1213 train, 303 test IB-EM outperforms best of EM solutions I(T;Y) follows changes of likelihood Continuation ~follows region of change ( marks evaluated ) 00.20.40.60.81 0.5 1 1.5 I(T;Y) 00.20.40.60.81 -23 -21 -19 Train likelihood IB-EM Best of EM [Boyen et. al, 1999]
24
Multiple Hidden Variables We want to learn a model with many hiddens ( ) Naive: Potentially exponential in # of hiddens Variational approximation: use factorized form (Mean Field) L IB-EM = (Variational EM) - (1- )Regularization [Friedman et. al, 2002] P P Q(T|Y) Y
25
Percentage of random runs Test log-loss / instance 20406080100 -342 -338 -334 -330 Mean Field EM 1 min/run 400 samples 21 hiddens Superior to all Mean Field EM runs Time single exact EM run The USPS Digits dataset single IB-EM 27 min exact EM 25 min/run Offers good value for your time! 3/50 EM runs are IB-EM: EM needs x17 time for similar results
26
020406080100 -151.5 -150.5 -149.5 -148.5 -147.5 Precentage of random runs Test log-loss / instance Mean Field EM ~0.5 hours Yeast Stress Response 173 experiments (variables) 6152 genes (samples) 25 hidden variables Superior to all Mean Field EM runs An order of magnitude faster then exact EM Effective when exact solution becomes intractable! IB-EM ~6 hours Exact EM >60 hours 5-24 experiments
27
Summary New framework for learning hidden variables Formal relation of Bottleneck and EM Continuation for bypassing local maxima Flexible: structure / variational approximation Learn optimal ≤1 for better generalization Explore other approximations of Q(T|Y) Model selection: learning cardinality and enrich structure Future Work
28
Relation to Weight Annealing [Elidan et. al, 2002] Y 4 3 2 1 M DATA X 1 … X N W Init: temp = hot Iterate until temp = cold Perturb w temp Use Q W and optimize Cool down Similarities: Change in empirical Q Morph towards EM solution Differences: IB-EM uses info. regulatization IB-EM uses continuation WA requires cooling policy WA applicable for wider range of problems
29
Relation to Deterministic Annealing Y 4 3 2 1 M DATA X 1 … X N Init: temp = hot Iterate until temp = cold “Insert” entropy temp into model Optimize noisy model Cool down Similarities: Use information measure Morph towards EM solution Differences: DA parameterization dependent IB-EM uses continuation DA requires cooling policy DA applicable for wider range of problems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.