© ELCA CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004 CS Graduate Colloquium Series
1 © ELCA CGC A Frog’s Leaps
2 © ELCA CGC Inductive Machine Learning Design and implement algorithms that exhibit inductive capabilities Induction “involves intellectual leaps from the particular to the general” From observations: Construct a model that explains the (labelled) observations and generalises to new, previously unobserved situations, or Extract rules/critical features that characterize the observations
3 © ELCA CGC Learning Tasks Prediction Learn a function that associates a data item with the value of a prediction/response variable credit worthiness Clustering / Segmentation Identify a set of (meaningful) categories or clusters to describe the data customer DB Dependency Modeling Find a model that describes significant dependencies, associations or affinity among variables market baskets Change / Anomaly Detection Discover the most significant changes in the data from previously measured or normative values fraud
4 © ELCA CGC Classification Learning Prediction where the response variable is discrete Wide range of applications Most used DM technique (?) Learning algorithms Decision trees ID3, C5.0, OC1, etc. Neural networks FF/BP, RBF, ASOCS, etc. Rule induction CN2, ILP, etc. Instance-based learning kNN, NGE, etc. Etc.
5 © ELCA CGC The Challenge I am faced with a specific classification task (e.g., am I looking at a rock or at a mine?) I want a model with high accuracy Which classification algorithm should I use? Any of them? All of them? One in particular?
6 © ELCA CGC Assumptions Attribute-value Language Objects are represented by vectors of A-V pairs, where each attribute takes on only a finite number of values There is a finite set of m possible attribute vectors Binary Classification into {0,1} The relationship between attribute vectors and classes may be specified by an m-component class probability vector C, where C i is the probability that an object with attribute vector A i is of class 1 Data Generation Attribute vectors are sampled with replacement according to an arbitrary distribution D and a class is assigned to each object using C (i.e., class 1 with probability C i and class 0 with probability 1- C i for an object with attribute vector A i ) Data is generated the same way for both training and testing
7 © ELCA CGC Definitions A learning situation S is a triple (D,C,n) where D and C specify how data will be generated and n is the size of the training sample Generalization accuracy is the expected prediction performance on objects with attribute vectors not presented in the training sample The generalization accuracy of a random guesser is 0.5 for every D and C Generalization performance is equal to generalization accuracy GP L (S) denotes the generalization performance of a learner L in learning situation S
8 © ELCA CGC Schaffer’s Law of Conservation For any learner L For every D and n
9 © ELCA CGC Verification of the Law (I) The theorem holds for any arbitrary choice of D and n, but these are fixed when the sum is computed. Hence, summing over S is equivalent to summing over C If we exclude the possibility of noise, the components of C are drawn from {0,1} There are 2 m class probability vectors and the theorem involves a sum, as written, of the 2 m corresponding terms
10 © ELCA CGC Verification of the Law (II) Restricting to binary attributes, we have m=2 a (where a is the number of attributes) Generate all 2 m =2 2 a noise-free binary classification tasks Compute each task's generalization performance using a leave-one-out procedure Sum the generalization performances Using ID3 as the learner, we performed the experiment for both a=3 and a=4 The results were … as expected!
11 © ELCA CGC There is No Free Lunch Some learners are impossible. In particular, there is no universal learner Every demonstration that the generalization performance of an algorithm is better than chance on a set of learning tasks is an implied demonstration that it is worse than chance on an alternative set of tasks Every demonstration that the generalization performance of an algorithm is better than that of another on a set of learning tasks is an implied demonstration that it is worse on an alternative set of tasks Averaged over all learning tasks, the generalization performance of any two algorithms is exactly identical
12 © ELCA CGC Practical Implication Either: One shows that all « real-world » learning tasks come from some subset of the universe on which some (or all) learning algorithm(s) perform well OR One must have some way of determining which algorithm(s) will perform well on one’s learning task Since it is difficult to know a priori all « real- world » learning tasks, we focus on the second alternative Note: « overfitting » UCI assumes the first option
13 © ELCA CGC Going back to the Challenge Which learner should I use? Any of them? Too hazardous – you may select the wrong one All of them? Too onerous – it will take too long One in particular? Yes – but, how do I know which one is best? Note: ML/KDD practitioners often narrow in on a subset of algorithms, based on experience
14 © ELCA CGC Finding a Mapping… Classification tasksClassification algorithms ?
15 © ELCA CGC …through Meta-learning Basic idea: learn a selection or ranking function for learning tasks Prerequisite: a description language for tasks Several approaches have been proposed The most popular one relies on extracting statistical and information-theoretic measures from a data set We have developed an alternative approach to task description, called landmarking
16 © ELCA CGC Conjecture THE PERFORMANCE OF A LEARNER ON A TASK UNCOVERS INFORMATION ABOUT THE NATURE OF THAT TASK
17 © ELCA CGC Landmarking the Expertise Space Each learner has an area of expertise, i.e., the class of tasks on which it performs particularly well, under some reasonable measure of performance A task can be described by the collection of areas of expertise to which it belongs A landmarker is a learning mechanism whose performance is used to describe tasks Landmarking is the use of landmarkers to locate tasks in the expertise space, i.e., the space of all areas of expertise
18 © ELCA CGC Illustration Labelled areas are areas of expertise of learners Assume the landmarkers are i1, i2 and i3 Possible inference: Problems on which both i1 and i3 perform well, but on which i2 performs poorly, are likely to be in i4's area of expertise
19 © ELCA CGC Meta-learning with Landmarking Landmarking concentrates on « cartographic » considerations: Learners are used to signpost learners In principle, every learner's performance can signpost the location of a problem with respect to other learners' expertise The landmarkers' performance values are used as task descriptors or meta-attributes for meta- learning Exploring the meta-learning potential of landmarking amounts to investigating how well landmarkers' performances hint at the location of learning tasks in the expertise space
20 © ELCA CGC Selecting Landmarkers Two main considerations Computational complexity Statistical tests are expensive (up to O(n 3 ) for some) »Poor scalability »CPU time could have been alloted to a sophisticated learner of equal or better complexity Limit ourselves to O(nlogn) and in any case do not exceed the time needed to run the target learners Bias To adequately chart the learning space, we need landmarkers to measure different properties, at least implicitly (the set of target learners may guide the choice of bias)
21 © ELCA CGC Target Learners The set of target learners considered consists of the following set of 10 popular learners: C5.0 Decision tree Decision rules Decision tree with boosting Naive Bayes (MLC++ implementation) Instance-Based Learning (MLC++ implementation) Clementine's Multi-Layer Perceptron Clementine's Radial Basis Function Network RIPPER Linear Discriminant LTREE
22 © ELCA CGC Landmarkers Typical landmarkers include: Decision node A single, most-informative decision node (based on information gain ratio) Aims to establish closeness to linear separability Randomly chosen node A single decision node chosen at random Informs, together with next one, about irrelevant attributes Worst node A single, least informative decision node Further informs on linear separability (if neither the best nor the worst attribute produce a single well performing separation, it is likely that linear separation is not an adequate learning strategy Elite 1-Nearest Neighbor Standard 1-NN with nearest neighbor computed based on an elite subset of attributes (based on information gain ratio) Attempts to establish whether the task is relational, that is, if it involves parity-like relationships between the attributes. In relational tasks, no single attribute is considerably more informative than all others.
23 © ELCA CGC Statistical Meta-attributes Typical (simple) statistical meta-attributes include: Class entropy Average entropy of the attributes Mutual information Joint entropy Equivalent number of attributes Signal-to-noise ratio
24 © ELCA CGC Training Set I Artificial Datasets 320 randomly generated Boolean datasets with between 5 and 12 attributes Generalization performance, GP Li, of each of the target learners is computed using 10-fold stratified cross- validation Each dataset is labeled as: Learner Lk GP Lk = max {GP Li } Tie max {GP Li } – min {GP Li } < 0.1
25 © ELCA CGC Landmarking vs Standard DC Meta-learnerLandmarkingStandard DCCombined Majority0.460 C5.0Tree C5.0Rules MLP RBFN LinDiscr LTree IB NB Ripper Average
26 © ELCA CGC Training Set II Artificial Datasets 222 randomly generated Boolean datasets with 20 attributes Generalization performance, GP Li, of each of the target learners is computed using 10-fold stratified cross- validation Three model classes: NN={MLP, RBFN}, R={RIPPER, C5.0 Rules}, DT={C5.0, C5.0 Boosting, LTREE} Each dataset is labeled as: Model class Mk Li Mk GP Li > (1.1).avg {GP Lj } NoDiff, otherwise 18 UCI Datasets
27 © ELCA CGC Predicting Model Classes Meta-learnerNNRDT Majority C5.0Tree C5.0Rules MLP RBFN LinDiscr LTree IB NB Ripper
28 © ELCA CGC Measuring Performance Level Experiments show that landmarking meta-learns They do not, however, reflect the overall performance of a system whose end result is the accuracy of the selected learning model Estimate it as follows: Train on artificial datasets Test on UCI datasets Report average error difference between actual best choice and meta-learner selected choice
29 © ELCA CGC Performance Level Model ClassLossMax Loss NN R DT
30 © ELCA CGC METAL Home Page
31 © ELCA CGC Uploading Data
32 © ELCA CGC Characterising Data
33 © ELCA CGC Ranking
34 © ELCA CGC Parameter Setting
35 © ELCA CGC Results
36 © ELCA CGC Running Algorithms
37 © ELCA CGC Site Statistics
38 © ELCA CGC Conclusions There is no universal learner Work on meta-learning holds promise Landmarking shows learner preferences may be learned Open questions Training data generation (incremental…) Choice of meta-features (landmarking, structure, …) Choice of meta-learner (higher-order, …) MLJ Special Issue on Meta-learning (2004), to appear
© ELCA CGC VT 1.0 If There is Time Left…
40 © ELCA CGC Process View Raw Data Domain & Data Understanding Selected Data Pre-processed Data Model Building Patterns Models Interpretation & Evaluation Business Problem Formulation Dissemination & Deployment Determine credit worthiness Aggregate individual incomes into household income Learn about loans, repayments, etc.; Collect data about past performance Build a decision tree Check against hold-out set Data Pre-processing
41 © ELCA CGC DM Tools Three types: Research tools Generally open source, no GUI, expert-driven Dedicated tools Commercial-strength, restricted to one type of tasks/algorithms, limited process support Packaged tools Commercial-strength, rich GUI, support for different types of tasks, rich collection of algorithms, DM process support A plethora of DM tools has emerged…
42 © ELCA CGC Tool Selection The situation: Increasing interest Many tools with same basis but different content Research talk vs business talk Assist business users in selecting a DM package based on high-level business criteria Journal of Intelligent Data Analysis, Vol. 7, No. 3 (2003)
43 © ELCA CGC Schema Definition
44 © ELCA CGC Comprehensive Survey 59 of the most popular tools Commercial: AnswerTree, CART / MARS, Clementine, Enterprise Miner, GainSmarts, GhostMiner, Insightful Miner, intelligent Miner, KnoweldgeSTUDIO, KXEN, MATLAB Neural Network Toolbox, NeuralWorks Predict, NeuroShell, Oracle Data Mining Suite, PolyAnalyst, See5 / Cubist / Magnum Opus, SPAD, SQL Server 2000, STATISTICA Data Miner, Teradata Warehouse Data Mining, etc. Freeware: WEKA, Orange, YALE, SwissAnalyst Dynamic DB: updated regularly
45 © ELCA CGC Recipients
46 © ELCA CGC Comments The above dimensions characterize Data Mining tools, and NOT Data Mining algorithms With a standard schema and corresponding database, users are able to select a DM software package with respect to its ability to meet high-level business objectives Automatic advice strategies such as METAL's Data Mining Advisor (see ) or IDEA (see ) can then be used to assist users further in the selection of the most appropriate algorithms / models for their specific tasks.
47 © ELCA CGC How About Combining Learners? An obvious solution: Combine learners, as in boosting, bagging, stacked generalisation, etc. However, no matter how elaborate the method, any algorithm that implements a fixed mapping from training sets to prediction models is subject to the limitations imposed by the law of conservation