First Workshop on Metalearning in 1998 at ECML

Slides:

Advertisements

Similar presentations

Detecting Statistical Interactions with Additive Groves of Trees

Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

Chapter 7 – Classification and Regression Trees

Chapter 7 – Classification and Regression Trees

Rich Caruana Alexandru Niculescu-Mizil Presented by Varun Sudhakar.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Tree-based methods, neutral networks

1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.

Chapter 7 Correlational Research Gay, Mills, and Airasian

On the Application of Artificial Intelligence Techniques to the Quality Improvement of Industrial Processes P. Georgilakis N. Hatziargyriou Schneider ElectricNational.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.

LECTURE 06B BEGINS HERE THIS IS WHERE MATERIAL FOR EXAM 3 BEGINS.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

by B. Zadrozny and C. Elkan

Chapter 9 – Classification and Regression Trees

Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Example x y We wish to check for a non zero correlation.

Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.

Tutorial I: Missing Value Analysis

Classification and Regression Trees

Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.

Estimating standard error using bootstrap

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Data Science Credibility: Evaluating What’s Been Learned

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Experience Report: System Log Analysis for Anomaly Detection

Measurement, Quantification and Analysis

CEE 6410 Water Resources Systems Analysis

Evaluating Classifiers

Artificial Neural Networks

An Empirical Comparison of Supervised Learning Algorithms

Regression Analysis Module 3.

26134 Business Statistics Week 5 Tutorial

Statistics for the Social Sciences

Making Use of Associations Tests

Erasmus University Rotterdam

Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.

Predict House Sales Price

Introduction Feature Extraction Discussions Conclusions Results

NBA Draft Prediction BIT 5534 May 2nd 2018

Employee Turnover: Data Analysis and Exploration

Categorizing networks using Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Informing the Use of Hyper-Parameter Optimization Through Metalearning

Evaluation and Its Methods

What is Regression Analysis?

Linear Model Selection and regularization

CSCI N317 Computation for Scientific Applications Unit Weka

Chapter 11 Practical Methodology

Statistical Learning Dong Liu Dept. EEIS, USTC.

Chapter 7: The Normality Assumption and Inference with OLS

Seminar in Economics Econ. 470

Regression Forecasting and Model Building

A task of induction to find patterns

Evaluation and Its Methods

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

A task of induction to find patterns

Modeling IDS using hybrid intelligent systems

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

First Workshop on Metalearning in 1998 at ECML 8 papers (2 by organizers!) About 20 attendees EU-heavy Then ICML 1999, ECML 2000, ECML 2001, etc. MLJ SI on Metalearning, 2004 First Workshops on AutoML/Metalearning in 2014 at ECAI/2015 at ECML Look at today! 29 accepted papers / ~100 attendees Wide range of origins

Informing the Use of Hyper-Parameter Optimization Through Metalearning Parker Ridd, Samantha Sanders, Christophe Giraud-Carrier

Many Metalearning Approaches Assume that for all base classification learning algorithms, the parameter setting is fixed to some default Consequently: The problem of metalearning for classifier selection is greatly simplified Such approaches attract at best serious criticism, and at worst, plain dismissal from the community WHY?

Hyper-parameter Optimization Claim Hyper-parameter optimization (HPO) has a significant impact (for the better) on the predictive accuracy of classification learning algorithms

Consequently… Two possible approaches: Two-stage metalearning Solve restricted version of the problem (i.e., with default parameter setting) Optimize hyper-parameters for selected algorithm Unrestricted metalearning Select both the classifier and its hyper-parameters at once Two-stage metalearning may be sub-optimal Unrestricted metalearning is expensive

Is This Reasonable? Almost everybody believes that the claim of hyper-parameter optimization holds Almost nobody has attempted to verify its validity either theoretically or empirically Do you see the problem? What if it does not hold in general? What if it only holds in specific cases? Shouldn’t we find out what those cases are? We use metalearning to address these questions

There Is One Study… Sun and Pfahringer considered 466 datasets and reported on the percentage of improvement of the best AUC score among 20 classification algorithms after parameter optimization (using PSO) over the best AUC score among the same classification algorithms with their default parameter settings.

Impact of HPO by Increasing Value of Improvement Impact highly variable across datasets For 19% of the datasets, HPO offers no improvement at all Improvement does not exceed 5% for 80% of the datasets Linear relationship (slope=12) from 0% to 5%

Why This Matters HPO does not improve performance uniformly across datasets For many datasets, very little improvement and thus computational overkill For some datasets, significant difference and thus beneficial Important because HPO is expensive Auto-Weka took on average 13 hours per dataset in the work of Thornton et al Much of this is wasted if significant improvement is only observed in a small fraction of cases (possibly none depending on the datasets)

Metalearning Impact of HPO – Take 1 Data preparation Remove small datasets (<= 100) Extract meta-features for each dataset (Reif’s tool) Meta-feature selection (manual and CFS) Label datasets (0 for no advantage and 1 for likely performance advantage) Select metalearner – comprehensibility and preliminary results  J48

Threshold = 1.5 Default: A=62%, P=62%, R=100% Feature selection + J48 joint entropy, NB, NN1 A=79%, P=81%, R=86% NN1 root of induced tree Good recall on test data Metamodel claims 83% of the total performance improvement available at the base level

Threshold = 2.5 Default: A=50%, P=50%, R=100% Feature selection + J48 kurtosis prep, normalized attribute entropy, joint entropy, NB, NN1 A=75%, P=71%, R=84% NN1 root of induced tree Good recall on test data Metamodel claims 74% of the total performance improvement available at the base level

An Unexpected Finding In virtually all metamodels, NN1 is the root of the induced tree Linear regression of the actual percentage of improvement on the meta-features also assigns large weight to NN1 Confirms Reif et al’s results that landmarkers are generally better than other meta-features, especially NN1 and NB NN1 is very efficient, hence the determination could be very fast

Study 1 Suggests… Two-stage approach may indeed be adequate, but the correct one consists of: Determining whether HPO is likely to yield improvement If so, solving the unrestricted version of the problem Limitations: Aggregate analysis: best default vs. best HPO All datasets binarized No test of significance Cost of HPO ignored

Metalearning Impact of HPO – Take 2 Repeat Study 1 on per-algorithm basis: When HPO improves algorithm A How long HPO takes to exceed default performance How much HPO improves A within budget T

Experiment Components Optimization Method Genetic Algorithm (24 hours to optimize) Performance Measure Multi-Class AUC Datasets 229 (raw) datasets from OpenML Algorithms MLP, SVM, Decision Tree (scikit-learn)

Hyper-parameters Decision Tree (8) split criterion, max depth, max leaves, etc. MLP (18) Hidden layer, learning rate, momentum, etc. SVM (5) C, kernel, gamma, etc.

Data Collection For default and HPO experiments: Stop: Label: 30 starts × 3 algs × 229 datasets = 20,610 data points Stop: Perfect MAUC or 24 hrs Label: 95% confidence intervals for default and HPO No overlap -> optimize, otherwise no Extract meta-features, metalearn Time (sec.) Generation Parameters Fitness (MAUC) 13.37 1 split=entropy ... 0.8684 34.45 2 split=gini ... 0.8743 63.79 3 0.8748

When HPO Helps TL;DR -- Almost always

HPO Improvement over Default

MAUC Confidence Interval Gap 7,2,2 Default yields perfect MAUC

Since Improvement Is Most Likely: How Much Does HPO Helps

Experiment Components Metalearner MLP Performance Metrics Correlation Coefficient, MAE, RMSE Dataset Meta-feature dataset (PCA) To Predict Improvement in MAUC for DT, SVM, MLP

Results Meta-feature Dataset Correlation Coefficient Mean Absolute Error Root Mean Squared Error Decision Tree 0.56 0.06 0.11 SVM 0.54 0.17 0.26 MLP 0.47 0.15 0.23

Predicting How Long it Will Take to Beat Default Settings

Experiment Components Metalearner Decision Tree Performance Metrics Accuracy, Precision, Recall, ROC Area Dataset Meta-feature dataset To Predict Will it take more or less than X minutes to exceed performance of default hyper-parameters?

MLP Dataset Results Runtime Cutoff Baseline Acc. (prediction) Accuracy Precision Recall ROC Area 30 Min 56% 90% 0.91 0.87 1 Hour 57% 0.92 3 Hours 68% 88% 0.93 0.90 Precision: Instances labeled as needing less than T to reach default hyper-parameter benchmark, actually need less than T. Recall: Instances needing less than T to reach the default hyper-parameter benchmark are actually labeled as such.

SVM Dataset Results Runtime Cutoff Baseline Acc. (prediction) Accuracy Precision Recall ROC Area 30 Min 58% 88% 0.86 0.84 1 Hour 50% 81% 0.82 0.80 0.78 3 Hours 77% 73% 0.81 0.85 0.58

Decision Tree Dataset Results Runtime Cutoff Baseline Acc. (prediction) Accuracy Precision Recall ROC Area 10 Sec 52% 89% 0.90 0.87 0.91 60 Sec 68% 91% 0.94 0.92 0.88 30 Min 97% 0.98 0.68

Our (Not So) Unexpected Friend Again For all metamodels predicting improvement within budget, NN1_time is the root of the induced tree Confirms early results Warrants further investigation into NN1 and other landmarkers

Study 2 Suggests… Statistically significant improvement of HPO over default can be achieved in almost all cases Study 1? – maybe more datasets had perfect performance with default (446/229), binarization, etc. It seems we can: Predict with some degree of accuracy, how much improvement can be expected Predict if default hyper-parameter performance can be improved upon within a given time budget

Parting Thoughts Directly related More broadly Repeat with more learning algorithms Metamodels vs. joint optimization More broadly Need continued work on metalearning (WSs, etc.) Thankless! MLJ SI on Metalearning Jan 2018 DARPA D3M Project