Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman

Problem: No closed-form solution for ML estimation Use Expectation Maximization (EM) Problem: Stuck in inferior local Maxima  Random Restarts  Deterministic  Simulated annealing Learning with Hidden Variables Input:Output: A model P(X,T) DATA ???????????? X 1 … X N T Params Likelihood 00.10.20.30.40.50.60.70.80.91 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 EM + information regularization for learning parameters T X2X2 X3X3 X1X1

Learning Parameters Input:Output: A model P  (X) DATA X 1 … X N Empirical distribution Q(X) Parametrization  of P P  (X 1 ) = Q(X 1 ) P  (X 2 |X 1 ) = Q(X 2 |X 1 ) P  (X 3 |X 1 ) = Q(X 3 |X 1 ) X1X1 X2X2 X3X3

Empirical distribution Q(X,T) = ? Learning with Hidden Variables DATA X 1 … X N ???????????? T 4 3 2 1 M Y Parametrization  for P T X2X2 X3X3 X1X1 guess of  Desired structure: EM Iterations Empirical distribution Q(X,T,Y) = Empirical distribution Q(X,T,Y)=Q(X,T)Q(T|Y) Input: For each instance ID, complete value of T

The EM Algorithm: E-Step: Generate empirical distribution M-Step: Maximize using  EM is equivalent to optimizing function of Q,P   Each step increases value of functional EM Functional [Neal and Hinton, 1998]

Information Bottleneck EM Target: In the rest of the talk…  Understanding this objective  How to use it to learn better models EM target Information between hidden and ID

Information Regularization Motivating idea: Fit training data: Set T to be instance ID to “predict” X Generalization: “Forget” ID and keep essence of X Objective: parameter free regularization of Q [Tishby et. al, 1999] (lower bound of) Likelihood of P  Compression of instance ID vs.

1 7 5 3 11 9 2 6 10 8 4 total compression  =0 1 7 5 3 11 9 2 6 10 8 4 Clustering example EM Target Compression measure  =0 EM Target Compression measure

Clustering example 1 7 5 3 11 9 2 6 10 8 4 total preservation  =1 1 7 5 3 11 9 2 6 10 8 4 Compression measure EM Target  =1 T  ID

Clustering example 1 7 5 3 11 9 2 6 10 8 4 1 7 5 3 11 9 2 6 8 4 10 Desired  =? Compression measure EM Target  =? |T| = 2

Information Bottleneck EM Formal equivalence with Information Bottleneck at  =1 EM and Information Bottleneck coincide [Generalizing result of Slonim and Weiss for univariate case] EM functional

Information Bottleneck EM Formal equivalence with Information Bottleneck EM functional Prediction of T using P  Marginal of T in Q Normalization Maximum of Q(T|Y) is obtained when

The IB-EM Algorithm for fixed   Iterate until convergance E-Step: Maximize L IB-EM by optimizing Q M-Step: Maximize L IB-EM by optimizing P  (same as standard M-step) Each step improves L IB-EM Guaranteed to converge

Information Bottleneck EM Target: In the rest of the talk…  Understanding this objective  How to use it to learn better models EM target Information between hidden and ID

Continuation easy hard Follow ridge from optimum at  =0 L IB-EM  Q 0 1

Continuation  Recall, if Q is a local maxima of L IB-EM then  We want to follow a path in (Q,  ) space so that… for all t, and y  Q Local maxima for all 

Continuation Step 1.Start at (Q,  ) so that 2.Compute gradient 3.Take  direction  4.Take a step in the desired direction  Q 0 1 start

Staying on the ridge  Q 0 1 start Potential problem:  Direction is tangent to path miss optimum Solution: Use EM steps to regain path

The IB-EM Algorithm  Set  =0 (start at easy solution)  Iterate until  =1 (EM solution is reached)  Iterate (stay on the ridge)  E-Step: Maximize L IB-EM by optimizing Q  M-Step: Maximize L IB-EM by optimizing P   Step (follow the ridge)  Compute gradient and  direction  Take the step by changing  and Q

 Q 0 1 Calibrating the step size Potential problem:  Step size too small  too slow  Step size too large  overshoot target Inferior solution

Non-parametric: involves only Q Can be bounded: I(T;Y) ≤ log 2 |T| Calibrating the step size  00.20.40.60.81 0.5 1 1.5 Use change in I(T;Y)  00.20.40.60.81 0.5 1 1.5 I(T;Y) Naive Recall that I(T;Y) measures compression of ID When I(T;Y) rises more of data is captured Too sparse “Interesting” area

The IB-EM Algorithm  Set  =0  Iterate until  =1 (EM solution is reached)  Iterate (stay on the ridge)  E-Step: Maximize L IB-EM by optimizing Q  M-Step: Maximize L IB-EM by optimizing P   Step (follow the ridge)  Compute gradient and  direction  Calibrate step size using I(T;Y)  Take the step by changing  and Q

The Stock Dataset Naive Bayes model Daily changes of 20 NASDAQ stocks. 1213 train, 303 test IB-EM outperforms best of EM solutions I(T;Y) follows changes of likelihood Continuation ~follows region of change ( marks evaluated  )  00.20.40.60.81 0.5 1 1.5 I(T;Y) 00.20.40.60.81 -23 -21 -19 Train likelihood IB-EM Best of EM [Boyen et. al, 1999]

Multiple Hidden Variables We want to learn a model with many hiddens ( ) Naive: Potentially exponential in # of hiddens Variational approximation: use factorized form (Mean Field) L IB-EM =  (Variational EM) - (1-  )Regularization [Friedman et. al, 2002] P P  Q(T|Y)  Y

Percentage of random runs Test log-loss / instance 20406080100 -342 -338 -334 -330 Mean Field EM 1 min/run 400 samples 21 hiddens  Superior to all Mean Field EM runs  Time  single exact EM run The USPS Digits dataset single IB-EM 27 min exact EM 25 min/run Offers good value for your time! 3/50 EM runs are  IB-EM: EM needs  x17 time for similar results

020406080100 -151.5 -150.5 -149.5 -148.5 -147.5 Precentage of random runs Test log-loss / instance Mean Field EM ~0.5 hours Yeast Stress Response 173 experiments (variables) 6152 genes (samples) 25 hidden variables  Superior to all Mean Field EM runs  An order of magnitude faster then exact EM Effective when exact solution becomes intractable! IB-EM ~6 hours Exact EM >60 hours 5-24 experiments

Summary New framework for learning hidden variables Formal relation of Bottleneck and EM Continuation for bypassing local maxima Flexible: structure / variational approximation Learn optimal  ≤1 for better generalization Explore other approximations of Q(T|Y) Model selection: learning cardinality and enrich structure Future Work

Relation to Weight Annealing [Elidan et. al, 2002] Y 4 3 2 1 M DATA X 1 … X N W Init: temp = hot Iterate until temp = cold  Perturb w  temp  Use Q W and optimize  Cool down Similarities:  Change in empirical Q  Morph towards EM solution Differences:  IB-EM uses info. regulatization  IB-EM uses continuation  WA requires cooling policy  WA applicable for wider range of problems

Relation to Deterministic Annealing Y 4 3 2 1 M DATA X 1 … X N Init: temp = hot Iterate until temp = cold  “Insert” entropy  temp into model  Optimize noisy model  Cool down Similarities:  Use information measure  Morph towards EM solution Differences:  DA parameterization dependent  IB-EM uses continuation  DA requires cooling policy  DA applicable for wider range of problems

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

Similar presentations

Presentation on theme: "Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.

Similar presentations

Presentation on theme: "Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman."— Presentation transcript:

Similar presentations

About project

Feedback