Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il Dale Schuurmans University of Waterloo dale@cs.uwaterloo.ca
The Learning Problem + Learning task: search for Optimization is hard! DATA weights Hypothesis + Learning task: search for Optimization is hard! Typically resort to local optimization methods: gradient ascent, greedy hill-climbing, EM Density estimation: Classification: Logistic regression:
Escaping local maxima Local methods converge to (one of many) local optimum TABU search Random restarts Simulated annealing Score Stuck here h These methods work by step perturbation during the local search
Weight Perturbation Our Idea: Perturbation of instance weights Do until convergence Perturb instance weights Use optimizer as black box To maximize on original goal diminish magnitude of perturbation Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes
Weight Perturbation W h Score DATA LOCAL SEARCH REWEIGHT Hypothesis
Weight Perturbation Our Idea: Perturbation of instance weights Puts stronger emphasis on a subset of the instances Allows the learning procedure to escape local maxima W DATA W DATA perturb
Iterative Procedure Benefits: DATA LOCAL SEARCH REWEIGHT Hypothesis Score W DATA LOCAL SEARCH REWEIGHT Hypothesis Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes
Iterative Procedure Two methods for reweighting Random: Sampling random weights Adversarial: Directed reweighting To maximize on original goal slowly diminish magnitude of perturbations
Mean is original weight Random Reweighting Mean is original weight hot cold Wt Variance temp P(W) Wt+2 W W* When hot, model can “go” almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data Wt+1 Distance from original W
Adversarial Reweighting Idea: Challenge model by increasing w of “bad” (low scoring) instances Challenge the model by emphasizing bad samples (minimize the score using W) hot cold Wt A min-max game between re-weighting and optimizer Wt+1 W* Converge towards original distribution by constraining distance from W* Kivinen & Warmuth
Learning Bayesian Networks A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP DATA weights The Alarm network Learning task: find structure + parameters that maximize score
Alarm network: 37 variables, 1000 samples Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe) 5 10 15 20 25 30 35 40 -15.5 -15.45 -15.4 -15.35 -15.3 -15.25 -15.2 -15.15 Iterations Log-loss/instance on test TRUE STRUCTURE With similar running time: Random is superior to random re-starts Single Adversary run competes with random BASELINE Random annealing Adversary HOT COLD Alarm network: 37 variables, 1000 samples
Alarm network: 37 variables, 1000 samples Search with missing values Missing values introduce many local maxima EM combines search & parameters estimation (SEM) Log-loss/instance on test data Percent at least this good Alarm network: 37 variables, 1000 samples 10 20 30 40 50 60 70 80 90 BASELINE GENERATING MODEL -15.1 -15.08 -15.06 -15.04 -15.02 -15 -14.98 -14.96 Distance to true generating model is halved! With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best ADVERSARY RANDOM 90% of random better then baseline
With similar running time: Real-life datasets 6 real-life examples with and without missing values 0.5 Adversary 0.4 20-80% Random 0.3 With similar running time: Adversary is efficient and preferable Random takes longer for inferior results 0.2 Log-loss / instance on test data 0.1 BASELINE -0.1 -0.2 Stock Soybean Rosetta Audio Soy-M Promoter Variables Samples 20 1512 36 446 30 300 70 200 36 546 13 100
Learning Sequence Motifs DNA Promoter Sequences ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT --------- Represent using a motif Position Specific Scoring Matrix: A 0.97 0.02 C 0.01 0.99 0.2 G 0.1 0.8 T 0.03 0.98 Motif Segal et al., RECOMB 2002 Highly non-linear score optimization is hard!
PSSM: 4 letters x 20 positions, 550 sample Real-life Motifs Results Construct PSSM: find that maximize the score Experiments on 9 transcription factors (motifs) 50 Adversary 45 20-80% Random 40 With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times 35 30 25 Log-loss on test data 20 15 10 5 BASELINE -5 ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6 Motif PSSM: 4 letters x 20 positions, 550 sample
Simulated annealing Simulated annealing: allow “bad” moves with some probability Score h P(move) f(temp,) Wasteful propose, evaluate, reject cycle Needs a long time to escape local maxima WORSE then baseline on Bayesian networks!
Summary and Future Work General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP… Promising empirical results approach “achievable” maximum The BIG challenge: THEORETICAL INSIGHTS
same comparison is true of Random Vs. Bagging/Bootstrap Adversary ≠ Boosting Adversary Output: Single hypothesis Weights: Converge to original distribution Learning: ht+1 depends on ht Boosting An ensemble Diverge from original distribution ht+1 depends only on wt+1 same comparison is true of Random Vs. Bagging/Bootstrap
Other annealing methods Simulated annealing: allow “bad” moves with some probability Deterministic annealing: Change scenery by changing family of h simple hypothesis Score h Score h P(move) f(temp,) complex hypothesis Not good on Bayesian network! Is not naturally applicable!
Intuition to Adversary What happens before and after re-weighting? start here -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Score -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Score HOT HOT COLD finish here COLD start here stuck here Escaping local max. is easy! Escaping global max. is hard!
Escaping local maxima Local methods converge to (one of many) local optimum TABU search, Random restarts, Simulated annealing Our Idea: Anneal with perturbation of instance weights instead of search perturbation W W Score Score P(move) f(temp,) start here Standard Annealing: allow “bad” moves Weight Annealing: change in scenery temp
Adversarial Update Equation IMPLICIT EQUATION use exponential update in the right direction (Kivinen & Warmuth, 1997) where is the learning rate Good example: low weight Bad Example: high weight