© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004.

1 © ELCA - 01 - 2004 CGC A Frog’s Leaps 1989 1994 2001

2 © ELCA - 01 - 2004 CGC Inductive Machine Learning Design and implement algorithms that exhibit inductive capabilities Induction “involves intellectual leaps from the particular to the general” From observations:  Construct a model that explains the (labelled) observations and generalises to new, previously unobserved situations, or  Extract rules/critical features that characterize the observations

3 © ELCA - 01 - 2004 CGC Learning Tasks Prediction Learn a function that associates a data item with the value of a prediction/response variable credit worthiness Clustering / Segmentation Identify a set of (meaningful) categories or clusters to describe the data customer DB Dependency Modeling Find a model that describes significant dependencies, associations or affinity among variables market baskets Change / Anomaly Detection Discover the most significant changes in the data from previously measured or normative values fraud

4 © ELCA - 01 - 2004 CGC Classification Learning Prediction where the response variable is discrete  Wide range of applications  Most used DM technique (?) Learning algorithms  Decision trees  ID3, C5.0, OC1, etc.  Neural networks  FF/BP, RBF, ASOCS, etc.  Rule induction  CN2, ILP, etc.  Instance-based learning  kNN, NGE, etc.  Etc.

5 © ELCA - 01 - 2004 CGC The Challenge I am faced with a specific classification task (e.g., am I looking at a rock or at a mine?) I want a model with high accuracy Which classification algorithm should I use? Any of them? All of them? One in particular?

6 © ELCA - 01 - 2004 CGC Assumptions Attribute-value Language  Objects are represented by vectors of A-V pairs, where each attribute takes on only a finite number of values  There is a finite set of m possible attribute vectors Binary Classification into {0,1}  The relationship between attribute vectors and classes may be specified by an m-component class probability vector C, where C i is the probability that an object with attribute vector A i is of class 1 Data Generation  Attribute vectors are sampled with replacement according to an arbitrary distribution D and a class is assigned to each object using C (i.e., class 1 with probability C i and class 0 with probability 1- C i for an object with attribute vector A i )  Data is generated the same way for both training and testing

7 © ELCA - 01 - 2004 CGC Definitions A learning situation S is a triple (D,C,n) where D and C specify how data will be generated and n is the size of the training sample Generalization accuracy is the expected prediction performance on objects with attribute vectors not presented in the training sample The generalization accuracy of a random guesser is 0.5 for every D and C Generalization performance is equal to generalization accuracy - 0.5 GP L (S) denotes the generalization performance of a learner L in learning situation S

8 © ELCA - 01 - 2004 CGC Schaffer’s Law of Conservation For any learner L For every D and n

9 © ELCA - 01 - 2004 CGC Verification of the Law (I)  The theorem holds for any arbitrary choice of D and n, but these are fixed when the sum is computed. Hence, summing over S is equivalent to summing over C  If we exclude the possibility of noise, the components of C are drawn from {0,1}  There are 2 m class probability vectors and the theorem involves a sum, as written, of the 2 m corresponding terms

10 © ELCA - 01 - 2004 CGC Verification of the Law (II)  Restricting to binary attributes, we have m=2 a (where a is the number of attributes)  Generate all 2 m =2 2 a noise-free binary classification tasks  Compute each task's generalization performance using a leave-one-out procedure  Sum the generalization performances  Using ID3 as the learner, we performed the experiment for both a=3 and a=4  The results were … as expected!

11 © ELCA - 01 - 2004 CGC There is No Free Lunch Some learners are impossible. In particular, there is no universal learner Every demonstration that the generalization performance of an algorithm is better than chance on a set of learning tasks is an implied demonstration that it is worse than chance on an alternative set of tasks Every demonstration that the generalization performance of an algorithm is better than that of another on a set of learning tasks is an implied demonstration that it is worse on an alternative set of tasks Averaged over all learning tasks, the generalization performance of any two algorithms is exactly identical

12 © ELCA - 01 - 2004 CGC Practical Implication Either: One shows that all « real-world » learning tasks come from some subset of the universe on which some (or all) learning algorithm(s) perform well OR One must have some way of determining which algorithm(s) will perform well on one’s learning task Since it is difficult to know a priori all « real- world » learning tasks, we focus on the second alternative Note: « overfitting » UCI assumes the first option

13 © ELCA - 01 - 2004 CGC Going back to the Challenge Which learner should I use?  Any of them?  Too hazardous – you may select the wrong one  All of them?  Too onerous – it will take too long  One in particular?  Yes – but, how do I know which one is best? Note: ML/KDD practitioners often narrow in on a subset of algorithms, based on experience

14 © ELCA - 01 - 2004 CGC Finding a Mapping… Classification tasksClassification algorithms ?

15 © ELCA - 01 - 2004 CGC …through Meta-learning Basic idea: learn a selection or ranking function for learning tasks Prerequisite: a description language for tasks Several approaches have been proposed  The most popular one relies on extracting statistical and information-theoretic measures from a data set  We have developed an alternative approach to task description, called landmarking

16 © ELCA - 01 - 2004 CGC Conjecture THE PERFORMANCE OF A LEARNER ON A TASK UNCOVERS INFORMATION ABOUT THE NATURE OF THAT TASK

17 © ELCA - 01 - 2004 CGC Landmarking the Expertise Space Each learner has an area of expertise, i.e., the class of tasks on which it performs particularly well, under some reasonable measure of performance A task can be described by the collection of areas of expertise to which it belongs A landmarker is a learning mechanism whose performance is used to describe tasks Landmarking is the use of landmarkers to locate tasks in the expertise space, i.e., the space of all areas of expertise

18 © ELCA - 01 - 2004 CGC Illustration Labelled areas are areas of expertise of learners Assume the landmarkers are i1, i2 and i3 Possible inference: Problems on which both i1 and i3 perform well, but on which i2 performs poorly, are likely to be in i4's area of expertise

19 © ELCA - 01 - 2004 CGC Meta-learning with Landmarking Landmarking concentrates on « cartographic » considerations: Learners are used to signpost learners In principle, every learner's performance can signpost the location of a problem with respect to other learners' expertise The landmarkers' performance values are used as task descriptors or meta-attributes for meta- learning Exploring the meta-learning potential of landmarking amounts to investigating how well landmarkers' performances hint at the location of learning tasks in the expertise space

20 © ELCA - 01 - 2004 CGC Selecting Landmarkers Two main considerations Computational complexity  Statistical tests are expensive (up to O(n 3 ) for some) »Poor scalability »CPU time could have been alloted to a sophisticated learner of equal or better complexity  Limit ourselves to O(nlogn) and in any case do not exceed the time needed to run the target learners Bias  To adequately chart the learning space, we need landmarkers to measure different properties, at least implicitly  (the set of target learners may guide the choice of bias)

21 © ELCA - 01 - 2004 CGC Target Learners The set of target learners considered consists of the following set of 10 popular learners:  C5.0  Decision tree  Decision rules  Decision tree with boosting  Naive Bayes (MLC++ implementation)  Instance-Based Learning (MLC++ implementation)  Clementine's Multi-Layer Perceptron  Clementine's Radial Basis Function Network  RIPPER  Linear Discriminant  LTREE

22 © ELCA - 01 - 2004 CGC Landmarkers Typical landmarkers include:  Decision node  A single, most-informative decision node (based on information gain ratio)  Aims to establish closeness to linear separability  Randomly chosen node  A single decision node chosen at random  Informs, together with next one, about irrelevant attributes  Worst node  A single, least informative decision node  Further informs on linear separability (if neither the best nor the worst attribute produce a single well performing separation, it is likely that linear separation is not an adequate learning strategy  Elite 1-Nearest Neighbor  Standard 1-NN with nearest neighbor computed based on an elite subset of attributes (based on information gain ratio)  Attempts to establish whether the task is relational, that is, if it involves parity-like relationships between the attributes. In relational tasks, no single attribute is considerably more informative than all others.

23 © ELCA - 01 - 2004 CGC Statistical Meta-attributes Typical (simple) statistical meta-attributes include:  Class entropy  Average entropy of the attributes  Mutual information  Joint entropy  Equivalent number of attributes  Signal-to-noise ratio

24 © ELCA - 01 - 2004 CGC Training Set I Artificial Datasets  320 randomly generated Boolean datasets with between 5 and 12 attributes  Generalization performance, GP Li, of each of the target learners is computed using 10-fold stratified cross- validation  Each dataset is labeled as:  Learner  Lk  GP Lk = max {GP Li }  Tie  max {GP Li } – min {GP Li } < 0.1

25 © ELCA - 01 - 2004 CGC Landmarking vs Standard DC Meta-learnerLandmarkingStandard DCCombined Majority0.460 C5.0Tree0.2420.3420.314 C5.0Rules0.2390.3330.301 MLP0.3010.3170.320 RBFN0.2890.3230.304 LinDiscr0.3350.3110.301 LTree0.2700.3170.286 IB10.3290.3660.342 NB0.4290.4070.363 Ripper0.2920.3140.295 Average0.3030.3370.314

26 © ELCA - 01 - 2004 CGC Training Set II Artificial Datasets  222 randomly generated Boolean datasets with 20 attributes  Generalization performance, GP Li, of each of the target learners is computed using 10-fold stratified cross- validation  Three model classes: NN={MLP, RBFN}, R={RIPPER, C5.0 Rules}, DT={C5.0, C5.0 Boosting, LTREE}  Each dataset is labeled as:  Model class  Mk   Li  Mk  GP Li > (1.1).avg {GP Lj }  NoDiff, otherwise 18 UCI Datasets

27 © ELCA - 01 - 2004 CGC Predicting Model Classes Meta-learnerNNRDT Majority0.4400.3700.470 C5.0Tree0.3580.2330.371 C5.0Rules0.3670.2290.371 MLP0.4130.3920.454 RBFN0.3330.2250.375 LinDiscr0.3710.3790.467 LTree0.3960.2210.346 IB10.3880.2580.354 NB0.4330.421 Ripper0.3630.2210.363

28 © ELCA - 01 - 2004 CGC Measuring Performance Level Experiments show that landmarking meta-learns They do not, however, reflect the overall performance of a system whose end result is the accuracy of the selected learning model Estimate it as follows:  Train on artificial datasets  Test on UCI datasets  Report average error difference between actual best choice and meta-learner selected choice

29 © ELCA - 01 - 2004 CGC Performance Level Model ClassLossMax Loss NN0.0310.081 R0.0360.088 DT0.0210.096

30 © ELCA - 01 - 2004 CGC METAL Home Page

31 © ELCA - 01 - 2004 CGC Uploading Data

38 © ELCA - 01 - 2004 CGC Conclusions There is no universal learner Work on meta-learning holds promise Landmarking shows learner preferences may be learned Open questions Training data generation (incremental…) Choice of meta-features (landmarking, structure, …) Choice of meta-learner (higher-order, …) MLJ Special Issue on Meta-learning (2004), to appear

40 © ELCA - 01 - 2004 CGC Process View Raw Data Domain & Data Understanding Selected Data Pre-processed Data Model Building Patterns Models Interpretation & Evaluation Business Problem Formulation Dissemination & Deployment Determine credit worthiness Aggregate individual incomes into household income Learn about loans, repayments, etc.; Collect data about past performance Build a decision tree Check against hold-out set Data Pre-processing

41 © ELCA - 01 - 2004 CGC DM Tools Three types:  Research tools  Generally open source, no GUI, expert-driven  Dedicated tools  Commercial-strength, restricted to one type of tasks/algorithms, limited process support  Packaged tools  Commercial-strength, rich GUI, support for different types of tasks, rich collection of algorithms, DM process support A plethora of DM tools has emerged…

42 © ELCA - 01 - 2004 CGC Tool Selection The situation:  Increasing interest  Many tools with same basis but different content  Research talk vs business talk  Assist business users in selecting a DM package based on high-level business criteria Journal of Intelligent Data Analysis, Vol. 7, No. 3 (2003)

44 © ELCA - 01 - 2004 CGC Comprehensive Survey 59 of the most popular tools Commercial: AnswerTree, CART / MARS, Clementine, Enterprise Miner, GainSmarts, GhostMiner, Insightful Miner, intelligent Miner, KnoweldgeSTUDIO, KXEN, MATLAB Neural Network Toolbox, NeuralWorks Predict, NeuroShell, Oracle Data Mining Suite, PolyAnalyst, See5 / Cubist / Magnum Opus, SPAD, SQL Server 2000, STATISTICA Data Miner, Teradata Warehouse Data Mining, etc. Freeware: WEKA, Orange, YALE, SwissAnalyst Dynamic DB: updated regularly

46 © ELCA - 01 - 2004 CGC Comments The above dimensions characterize Data Mining tools, and NOT Data Mining algorithms With a standard schema and corresponding database, users are able to select a DM software package with respect to its ability to meet high-level business objectives Automatic advice strategies such as METAL's Data Mining Advisor (see http://www.metal-kdd.org ) or IDEA (see http://www.ifi.unizh.ch/ddis/Research/idea.htm ) can then be used to assist users further in the selection of the most appropriate algorithms / models for their specific tasks. http://www.metal-kdd.org http://www.ifi.unizh.ch/ddis/Research/idea.htm

47 © ELCA - 01 - 2004 CGC How About Combining Learners? An obvious solution: Combine learners, as in boosting, bagging, stacked generalisation, etc. However, no matter how elaborate the method, any algorithm that implements a fixed mapping from training sets to prediction models is subject to the limitations imposed by the law of conservation

© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004.

Similar presentations

Presentation on theme: "© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004.

Similar presentations

Presentation on theme: "© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004."— Presentation transcript:

Similar presentations

About project

Feedback