Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il.

Slides:



Advertisements
Similar presentations
Additive Groves of Regression Trees Daria Sorokina Rich Caruana Mirek Riedewald.
Advertisements

L3S Research Center University of Hanover Germany
Is Random Model Better? -On its accuracy and efficiency-
1  1 =.
On Sequential Experimental Design for Empirical Model-Building under Interval Error Sergei Zhilin, Altai State University, Barnaul, Russia.
CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir.
Lazy Paired Hyper-Parameter Tuning
ABSTRACT: We examine how to determine the number of states of a hidden variables when learning probabilistic models. This problem is crucial for improving.
ABSTRACT: We examine how to detect hidden variables when learning probabilistic models. This problem is crucial for for improving our understanding of.
Ideal Parent Structure Learning School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan with Iftach Nachman and Nir.
G5BAIM Artificial Intelligence Methods
1 Using Bayesian Network for combining classifiers Leonardo Nogueira Matos Departamento de Computação Universidade Federal de Sergipe.
Proportions and Percents Unit rates & Proportions Unit Rate Scale Drawing and Probability Fractions Percents.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
Learning with Missing Data
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Learning: Parameter Estimation
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Gizem ALAGÖZ. Simulation optimization has received considerable attention from both simulation researchers and practitioners. Both continuous and discrete.
Graphical Models - Learning -
Bayesian Networks - Intro - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP HREKG.
Visual Recognition Tutorial
Graphical Models - Inference -
Graphical Models - Learning - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Graphical Models - Inference - Wolfram Burgard, Luc De Raedt, Kristian Kersting, Bernhard Nebel Albert-Ludwigs University Freiburg, Germany PCWP CO HRBP.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Recent Development on Elimination Ordering Group 1.
Lecture 5: Learning models using EM
. PGM: Tirgul 10 Learning Structure I. Benefits of Learning Structure u Efficient learning -- more accurate models with less data l Compare: P(A) and.
Structure Learning in Bayesian Networks
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Copyright N. Friedman, M. Ninio. I. Pe’er, and T. Pupko. 2001RECOMB, April 2001 Structural EM for Phylogentic Inference Nir Friedman Computer Science &
. Learning Bayesian Networks from Data Nir Friedman Daphne Koller Hebrew U. Stanford.
Ensemble Learning (2), Tree and Forest
Escaping local optimas Accept nonimproving neighbors – Tabu search and simulated annealing Iterating with different initial solutions – Multistart local.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
Search Methods An Annotated Overview Edward Tsang.
Simulated Annealing.
Today Ensemble Methods. Recap of the course. Classifier Fusion
COMP 538 Reasoning and Decision under Uncertainty Introduction Readings: Pearl (1998, Chapter 1 Shafer and Pearl, Chapter 1.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.
CS498-EA Reasoning in AI Lecture #10 Instructor: Eyal Amir Fall Semester 2009 Some slides in this set were adopted from Eran Segal.
Classification Ensemble Methods 1
Some Neat Results From Assignment 1. Assignment 1: Negative Examples (Rohit)
Guidance: Assignment 3 Part 1 matlab functions in statistics toolbox  betacdf, betapdf, betarnd, betastat, betafit.
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Semi-Supervised Clustering
Learning Deep Generative Models by Ruslan Salakhutdinov
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Introduction to Artificial Intelligence
Irina Rish IBM T.J.Watson Research Center
ECE 5424: Introduction to Machine Learning
Introduction to Artificial Intelligence
Bayesian Models in Machine Learning
Introduction to Artificial Intelligence
Ensemble learning Reminder - Bagging of Trees Random Forest
Stochastic Methods.
Presentation transcript:

Weight Annealing Data Perturbation for Escaping Local Maxima in Learning Gal Elidan, Matan Ninio, Nir Friedman Hebrew University {galel,ninio,nir}@cs.huji.ac.il Dale Schuurmans University of Waterloo dale@cs.uwaterloo.ca

The Learning Problem + Learning task: search for Optimization is hard! DATA weights Hypothesis + Learning task: search for Optimization is hard! Typically resort to local optimization methods: gradient ascent, greedy hill-climbing, EM Density estimation: Classification: Logistic regression:

Escaping local maxima Local methods converge to (one of many) local optimum TABU search Random restarts Simulated annealing Score Stuck here h These methods work by step perturbation during the local search

Weight Perturbation Our Idea: Perturbation of instance weights Do until convergence Perturb instance weights Use optimizer as black box To maximize on original goal diminish magnitude of perturbation Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

Weight Perturbation W h Score DATA LOCAL SEARCH REWEIGHT Hypothesis

Weight Perturbation Our Idea: Perturbation of instance weights Puts stronger emphasis on a subset of the instances  Allows the learning procedure to escape local maxima W DATA W DATA perturb

Iterative Procedure Benefits: DATA LOCAL SEARCH REWEIGHT Hypothesis Score W DATA LOCAL SEARCH REWEIGHT Hypothesis Benefits: Generality: a wide variety of learning scenarios Modularity: Search is unchanged Effectiveness: Allows global changes

Iterative Procedure Two methods for reweighting Random: Sampling random weights Adversarial: Directed reweighting To maximize on original goal  slowly diminish magnitude of perturbations

Mean is original weight Random Reweighting Mean is original weight hot cold Wt Variance  temp P(W) Wt+2 W W* When hot, model can “go” almost anywhere and local maxima are bypassed When cold, search fine- tunes to find optimum with respect to original data Wt+1 Distance from original W

Adversarial Reweighting Idea: Challenge model by increasing w of “bad” (low scoring) instances Challenge the model by emphasizing bad samples (minimize the score using W) hot cold Wt A min-max game between re-weighting and optimizer Wt+1 W* Converge towards original distribution by constraining distance from W* Kivinen & Warmuth

Learning Bayesian Networks A Bayesian network (BN) is a compact representation of a joint distribution Learning a BN is a density estimation problem PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATION PULMEMBOLUS PAP SHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTH TPR LVFAILURE ERRBLOWOUTPUT STROEVOLUME LVEDVOLUME HYPOVOLEMIA CVP BP DATA weights The Alarm network Learning task: find structure + parameters that maximize score

Alarm network: 37 variables, 1000 samples Structure Search Results Super-exponential combinatorial search space Search uses local ops: add/remove/reverse edge Optimize Bayesian Dirichlet score (BDe) 5 10 15 20 25 30 35 40 -15.5 -15.45 -15.4 -15.35 -15.3 -15.25 -15.2 -15.15 Iterations Log-loss/instance on test TRUE STRUCTURE With similar running time: Random is superior to random re-starts Single Adversary run competes with random BASELINE Random annealing Adversary HOT COLD Alarm network: 37 variables, 1000 samples

Alarm network: 37 variables, 1000 samples Search with missing values Missing values introduce many local maxima EM combines search & parameters estimation (SEM) Log-loss/instance on test data Percent at least this good Alarm network: 37 variables, 1000 samples 10 20 30 40 50 60 70 80 90 BASELINE GENERATING MODEL -15.1 -15.08 -15.06 -15.04 -15.02 -15 -14.98 -14.96 Distance to true generating model is halved! With similar running time: Over 90% of Random runs are better then normal SEM. Adversary run is best ADVERSARY RANDOM 90% of random better then baseline

With similar running time: Real-life datasets 6 real-life examples with and without missing values 0.5 Adversary 0.4 20-80% Random 0.3 With similar running time: Adversary is efficient and preferable Random takes longer for inferior results 0.2 Log-loss / instance on test data 0.1 BASELINE -0.1 -0.2 Stock Soybean Rosetta Audio Soy-M Promoter Variables Samples 20 1512 36 446 30 300 70 200 36 546 13 100

Learning Sequence Motifs DNA Promoter Sequences ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT ATCTAGCTGAGAATGCACACTGATCGAGCCCCACCATATTCTTCGGACTGCGCTATATAGACTGCAACTAGTAGAGCTCTGCTAGAAACATTACTAAGCTCTATGACTGCCGATTGCGCCGTTTGGGCGTCTGAGCTCTTTGCTCTTGACTTCCGCTTATTGATATTATCTCTCTTGCTCGTGACTGCTTTATTGTGGGGGGGACTGCTGATTATGCTGCTCATAGGAGAGACTGCGAGAGTCGTCGTAGGACTGCGTCGTCGTGATGATGCTGCTGATCGATCGGACTGCCTAGCTAGTAGATCGATGTGACTGCAGAAGAGAGAGGGTTTTTTCGCGCCGCCCCGCGCGACTGCTCGAGAGGAAGTATATATGACTGCGCGCGCCGCGCGCCGGACTGCTTTATCCAGCTGATGCATGCATGCTAGTAGACTGCCTAGTCAGCTGCGATCGACTCGTAGCATGCATCGACTGCAGTCGATCGATGCTAGTTATTGGATGCGACTGAACTCGTAGCTGTAGTTATT --------- Represent using a motif Position Specific Scoring Matrix: A 0.97 0.02 C 0.01 0.99 0.2 G 0.1 0.8 T 0.03 0.98 Motif Segal et al., RECOMB 2002 Highly non-linear score optimization is hard!

PSSM: 4 letters x 20 positions, 550 sample Real-life Motifs Results Construct PSSM: find  that maximize the score Experiments on 9 transcription factors (motifs) 50 Adversary 45 20-80% Random 40 With similar running time: Both methods are better than standard ascent Adversary is efficient and best 6/9 times 35 30 25 Log-loss on test data 20 15 10 5 BASELINE -5 ACE2 FKH1 FKH2 MBP1 MCM1 NDD1 SWI4 SWI5 SWI6 Motif PSSM: 4 letters x 20 positions, 550 sample

Simulated annealing Simulated annealing: allow “bad” moves with some probability Score h P(move)  f(temp,) Wasteful propose, evaluate, reject cycle Needs a long time to escape local maxima WORSE then baseline on Bayesian networks!

Summary and Future Work General method applicable to a variety of learning scenarios decision trees, clustering, phylogenetic trees, TSP… Promising empirical results approach “achievable” maximum The BIG challenge: THEORETICAL INSIGHTS

same comparison is true of Random Vs. Bagging/Bootstrap Adversary ≠ Boosting Adversary Output: Single hypothesis Weights: Converge to original distribution Learning: ht+1 depends on ht Boosting An ensemble Diverge from original distribution ht+1 depends only on wt+1 same comparison is true of Random Vs. Bagging/Bootstrap

Other annealing methods Simulated annealing: allow “bad” moves with some probability Deterministic annealing: Change scenery by changing family of h simple hypothesis Score h Score h P(move)  f(temp,) complex hypothesis Not good on Bayesian network! Is not naturally applicable!

Intuition to Adversary What happens before and after re-weighting? start here -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Score -2 2 4 6 8 10 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Score HOT HOT COLD finish here COLD start here stuck here Escaping local max. is easy! Escaping global max. is hard!

Escaping local maxima Local methods converge to (one of many) local optimum TABU search, Random restarts, Simulated annealing Our Idea: Anneal with perturbation of instance weights instead of search perturbation W W Score Score P(move)  f(temp,) start here   Standard Annealing: allow “bad” moves Weight Annealing: change in scenery  temp

Adversarial Update Equation IMPLICIT EQUATION use exponential update in the right direction (Kivinen & Warmuth, 1997) where  is the learning rate Good example: low weight Bad Example: high weight