ECML-20011 Estimating the predictive accuracy of a classifier Hilan Bensusan Alexandros Kalousis.

Slides:



Advertisements
Similar presentations
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
Advertisements

On the application of GP for software engineering predictive modeling: A systematic review Expert systems with Applications, Vol. 38 no. 9, 2011 Wasif.
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Indian Statistical Institute Kolkata
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Assuming normally distributed data! Naïve Bayes Classifier.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
Feature Selection for Regression Problems
Three kinds of learning
Chapter 13 Forecasting.
Chapter 19 Data Analysis Overview
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. by Lale Yurttas, Texas A&M University Chapter 171 CURVE.
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Meta Learning and Active Learning: Meta Learning and Active Learning: Collaborative Knowledge Discovery in Distributed Systems Dr Yonghong Peng Department.
Business Statistics - QBM117 Statistical inference for regression.
Chapter 5 Data mining : A Closer Look.
Demand Planning: Forecasting and Demand Management
Enterprise systems infrastructure and architecture DT211 4
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Demand Management and Forecasting
Forecasting Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill.
Issues with Data Mining
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
A Few Answers Review September 23, 2010
1 FORECASTING Regression Analysis Aslı Sencer Graduate Program in Business Information Systems.
by B. Zadrozny and C. Elkan
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Presented by Tienwei Tsai July, 2005
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
Linear Trend Lines = b 0 + b 1 X t Where is the dependent variable being forecasted X t is the independent variable being used to explain Y. In Linear.
Discriminant Analysis
Computational Intelligence: Methods and Applications Lecture 30 Neurofuzzy system FSM and covering algorithms. Włodzisław Duch Dept. of Informatics, UMK.
Statistical Methods Statistical Methods Descriptive Inferential
Basic Concepts of Correlation. Definition A correlation exists between two variables when the values of one are somehow associated with the values of.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Ensemble Methods: Bagging and Boosting
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
© ELCA CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004.
Prognostic Prediction of Breast Cancer Using C5 Sakina Begum May 1, 2001.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Linear Prediction Correlation can be used to make predictions – Values on X can be used to predict values on Y – Stronger relationships between X and Y.
Ensemble Methods in Machine Learning
Validation methods.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.
Assignable variation Deviations with a specific cause or source. forecast bias or assignable variation or MSE? Click here for Hint.
Demand Management and Forecasting Chapter 11 Portions Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
Forecas ting Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Forecasting. Model with indicator variables The choice of a forecasting technique depends on the components identified in the time series. The techniques.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Data Transformation: Normalization
Part 5 - Chapter
Part 5 - Chapter 17.
Modeling Annotator Accuracies for Supervised Learning
S519: Evaluation of Information Systems
Multivariate community analysis
Part 5 - Chapter 17.
assignable variation Deviations with a specific cause or source.
Somi Jacob and Christian Bach
Presentation transcript:

ECML Estimating the predictive accuracy of a classifier Hilan Bensusan Alexandros Kalousis

Why do we need to estimate the classifier performance?  To perform model selection without a previously established pool of classifiers.  To make meta-learning more automatic and less dependent on human experts.  To gain insight into the area of expertise of different classifiers

Meta-learning  Meta-learning is the endeavour to learn something about the expected performance of a classifier from previous applications.  It depends heavily on the way datasets are characterised.  It has concentrated on predicting the suitability of a classifier and on classifier selection from a pool.

Regression to predict performance  In this paper we examine an approach to the direct estimation of performances through regression.  The work is somehow related to zooming for ranking but there no knowledge about the classifiers is gained.

Previous work includes  João Gama and Pavel Brazdil in work related to StatLog (one dataset characterisation only and poor results reported in NMSE).  So Young Sohn (with StatLog datasets with boosting but better results).  Recent paper by Christian Koepf (good results with few classifiers and artificial datasets only).

Our approach  Broaden the research by comparing different dataset characterisation strategies and different regression methods.  A metadataset for each classifier is composed by a set of dataset characterisation attributes and the performance of the classifier in each dataset.

We concentrate on 8 classifiers:  two decision tree classifiers (C5.0tree and Ltree),  Naive Bayes  Linear discriminant,  Two rule methods (C5.0rules and ripper),  Nearest neighbor,  And a combination method (C5.0boost)

Strategies of dataset characterization:  A set of information-theoretical and statistical features of the datasets developed after StatLog (dct).  A finer grained development of the StatLog characteristics, where histograms are used to describe the distributions of features computed for each attribute of a dataset (histo).  Landmarking (land).

ECML Landmarking  A characterization technique where the performance of simple, bare-bone learners in the dataset is used to characterise it.  In this paper we use seven landmarkers: Decision node, Worst node, Randomly chosen node, Naive Bayes, 1-Nearest Neighbour, Elite 1-Nearest Neighbour and Linear Discriminant.

ECML Regression on accuracies  The quality of the estimate depends on its closeness to the actual accuracy achieved by the classifier, measured by the Mean Absolute Deviation (MAD) using 10 fold xval.  MAD is defined as the sum of the absolute differences between real and predicted values divided by the number of test items.  dMAD, the MAD obtained by predicting always the mean error, is used as reference.

ECML Regression methods and datasets  We used a kernel method and Cubist for regression.  65 datasets from UCI and METAL were used.  Classifier performance is the mean of the accuracy on the 10 xval folds

ECML Estimating with kernel Classifier dct histo land dMAD C5.0boost C5.0rules C5.0tree lindisc ltree Near.Nei NaiBayes ripper

ECML Estimating with Cubist Classifier dct histo land dMAD C5.0boost C5.0rules C5.0tree lindisc ltree Near.Nei NaiBayes ripper

ECML Using estimates to rank  Rankings are compared for similarity using Spearman’s rank correlation.  Zooming cannot be applied to land since we should not use a classifier to rank itself (we use land-).  We compare the ranking estimates with the true ranking.  The default ranking is computed over all datasets (C5b,r,t,Lt,rip,nn,nb,lind downwards).

ECML Average Spearman's Correlation Coefficients with the True Ranking RankingsKernelCubistZooming Default dct histo land land

ECML Gaining insight about classifiers Example, a land rule: (34 cases, mean error 0.218) IF Rand_Node <= 0.57 Elite_Node > THEN mlcnb = Rand_Node Worst_Node Elite_Node

ECML Conclusions  Regression can be used to estimate performances.  Meta-learning needs good dataset characterisation.  Landmarking is the best dataset characterisation strategy for performance estimation but not the best one for ranking.  Future work includes further exploration of dataset characterisation strategies and of the results of combining them (as well as explaining the still odd result of landmarking in ranking).