© ELCA - 01 - 2004 CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

ECG Signal processing (2)
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Decision Tree Approach in Data Mining
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
An Overview of Machine Learning
ECML Estimating the predictive accuracy of a classifier Hilan Bensusan Alexandros Kalousis.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
1 Application of Metamorphic Testing to Supervised Classifiers Xiaoyuan Xie, Tsong Yueh Chen Swinburne University of Technology Christian Murphy, Gail.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Data Mining.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Lecture 5 (Classification with Decision Trees)
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Experimental Evaluation
Part I: Classification and Bayesian Learning
Chapter 5 Data mining : A Closer Look.
Business Intelligence
Radial Basis Function Networks
Enterprise systems infrastructure and architecture DT211 4
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Issues with Data Mining
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
by B. Zadrozny and C. Elkan
COMP3503 Intro to Inductive Modeling
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Learning from Observations Chapter 18 Through
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Intelligent Database Systems Lab Advisor : Dr.Hsu Graduate : Keng-Wei Chang Author : Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Lecture Notes for Chapter 4 Introduction to Data Mining
Data Mining and Decision Support
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Meta-learning for Algorithm Recommendation Meta-learning for Algorithm Recommendation Background on Local Learning Background on Algorithm Assessment Algorithm.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Knowledge Discovery and Data Mining 19 th Meeting Course Name: Business Intelligence Year: 2009.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Transformation: Normalization
Model Discovery through Metalearning
Source: Procedia Computer Science(2015)70:
Data Mining Lecture 11.
Classification and Prediction
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
CAMCOS Report Day December 9th, 2015 San Jose State University
Machine Learning: Lecture 5
Presentation transcript:

© ELCA CGC VT 1.0 There is no Free Lunch, but you’d be surprised how far a few well invested dollars can stretch… Provo, UT, 15 January 2004 CS Graduate Colloquium Series

1 © ELCA CGC A Frog’s Leaps

2 © ELCA CGC Inductive Machine Learning Design and implement algorithms that exhibit inductive capabilities Induction “involves intellectual leaps from the particular to the general” From observations:  Construct a model that explains the (labelled) observations and generalises to new, previously unobserved situations, or  Extract rules/critical features that characterize the observations

3 © ELCA CGC Learning Tasks Prediction Learn a function that associates a data item with the value of a prediction/response variable credit worthiness Clustering / Segmentation Identify a set of (meaningful) categories or clusters to describe the data customer DB Dependency Modeling Find a model that describes significant dependencies, associations or affinity among variables market baskets Change / Anomaly Detection Discover the most significant changes in the data from previously measured or normative values fraud

4 © ELCA CGC Classification Learning Prediction where the response variable is discrete  Wide range of applications  Most used DM technique (?) Learning algorithms  Decision trees  ID3, C5.0, OC1, etc.  Neural networks  FF/BP, RBF, ASOCS, etc.  Rule induction  CN2, ILP, etc.  Instance-based learning  kNN, NGE, etc.  Etc.

5 © ELCA CGC The Challenge I am faced with a specific classification task (e.g., am I looking at a rock or at a mine?) I want a model with high accuracy Which classification algorithm should I use? Any of them? All of them? One in particular?

6 © ELCA CGC Assumptions Attribute-value Language  Objects are represented by vectors of A-V pairs, where each attribute takes on only a finite number of values  There is a finite set of m possible attribute vectors Binary Classification into {0,1}  The relationship between attribute vectors and classes may be specified by an m-component class probability vector C, where C i is the probability that an object with attribute vector A i is of class 1 Data Generation  Attribute vectors are sampled with replacement according to an arbitrary distribution D and a class is assigned to each object using C (i.e., class 1 with probability C i and class 0 with probability 1- C i for an object with attribute vector A i )  Data is generated the same way for both training and testing

7 © ELCA CGC Definitions A learning situation S is a triple (D,C,n) where D and C specify how data will be generated and n is the size of the training sample Generalization accuracy is the expected prediction performance on objects with attribute vectors not presented in the training sample The generalization accuracy of a random guesser is 0.5 for every D and C Generalization performance is equal to generalization accuracy GP L (S) denotes the generalization performance of a learner L in learning situation S

8 © ELCA CGC Schaffer’s Law of Conservation For any learner L For every D and n

9 © ELCA CGC Verification of the Law (I)  The theorem holds for any arbitrary choice of D and n, but these are fixed when the sum is computed. Hence, summing over S is equivalent to summing over C  If we exclude the possibility of noise, the components of C are drawn from {0,1}  There are 2 m class probability vectors and the theorem involves a sum, as written, of the 2 m corresponding terms

10 © ELCA CGC Verification of the Law (II)  Restricting to binary attributes, we have m=2 a (where a is the number of attributes)  Generate all 2 m =2 2 a noise-free binary classification tasks  Compute each task's generalization performance using a leave-one-out procedure  Sum the generalization performances  Using ID3 as the learner, we performed the experiment for both a=3 and a=4  The results were … as expected!

11 © ELCA CGC There is No Free Lunch Some learners are impossible. In particular, there is no universal learner Every demonstration that the generalization performance of an algorithm is better than chance on a set of learning tasks is an implied demonstration that it is worse than chance on an alternative set of tasks Every demonstration that the generalization performance of an algorithm is better than that of another on a set of learning tasks is an implied demonstration that it is worse on an alternative set of tasks Averaged over all learning tasks, the generalization performance of any two algorithms is exactly identical

12 © ELCA CGC Practical Implication Either: One shows that all « real-world » learning tasks come from some subset of the universe on which some (or all) learning algorithm(s) perform well OR One must have some way of determining which algorithm(s) will perform well on one’s learning task Since it is difficult to know a priori all « real- world » learning tasks, we focus on the second alternative Note: « overfitting » UCI assumes the first option

13 © ELCA CGC Going back to the Challenge Which learner should I use?  Any of them?  Too hazardous – you may select the wrong one  All of them?  Too onerous – it will take too long  One in particular?  Yes – but, how do I know which one is best? Note: ML/KDD practitioners often narrow in on a subset of algorithms, based on experience

14 © ELCA CGC Finding a Mapping… Classification tasksClassification algorithms ?

15 © ELCA CGC …through Meta-learning Basic idea: learn a selection or ranking function for learning tasks Prerequisite: a description language for tasks Several approaches have been proposed  The most popular one relies on extracting statistical and information-theoretic measures from a data set  We have developed an alternative approach to task description, called landmarking

16 © ELCA CGC Conjecture THE PERFORMANCE OF A LEARNER ON A TASK UNCOVERS INFORMATION ABOUT THE NATURE OF THAT TASK

17 © ELCA CGC Landmarking the Expertise Space Each learner has an area of expertise, i.e., the class of tasks on which it performs particularly well, under some reasonable measure of performance A task can be described by the collection of areas of expertise to which it belongs A landmarker is a learning mechanism whose performance is used to describe tasks Landmarking is the use of landmarkers to locate tasks in the expertise space, i.e., the space of all areas of expertise

18 © ELCA CGC Illustration Labelled areas are areas of expertise of learners Assume the landmarkers are i1, i2 and i3 Possible inference: Problems on which both i1 and i3 perform well, but on which i2 performs poorly, are likely to be in i4's area of expertise

19 © ELCA CGC Meta-learning with Landmarking Landmarking concentrates on « cartographic » considerations: Learners are used to signpost learners In principle, every learner's performance can signpost the location of a problem with respect to other learners' expertise The landmarkers' performance values are used as task descriptors or meta-attributes for meta- learning Exploring the meta-learning potential of landmarking amounts to investigating how well landmarkers' performances hint at the location of learning tasks in the expertise space

20 © ELCA CGC Selecting Landmarkers Two main considerations Computational complexity  Statistical tests are expensive (up to O(n 3 ) for some) »Poor scalability »CPU time could have been alloted to a sophisticated learner of equal or better complexity  Limit ourselves to O(nlogn) and in any case do not exceed the time needed to run the target learners Bias  To adequately chart the learning space, we need landmarkers to measure different properties, at least implicitly  (the set of target learners may guide the choice of bias)

21 © ELCA CGC Target Learners The set of target learners considered consists of the following set of 10 popular learners:  C5.0  Decision tree  Decision rules  Decision tree with boosting  Naive Bayes (MLC++ implementation)  Instance-Based Learning (MLC++ implementation)  Clementine's Multi-Layer Perceptron  Clementine's Radial Basis Function Network  RIPPER  Linear Discriminant  LTREE

22 © ELCA CGC Landmarkers Typical landmarkers include:  Decision node  A single, most-informative decision node (based on information gain ratio)  Aims to establish closeness to linear separability  Randomly chosen node  A single decision node chosen at random  Informs, together with next one, about irrelevant attributes  Worst node  A single, least informative decision node  Further informs on linear separability (if neither the best nor the worst attribute produce a single well performing separation, it is likely that linear separation is not an adequate learning strategy  Elite 1-Nearest Neighbor  Standard 1-NN with nearest neighbor computed based on an elite subset of attributes (based on information gain ratio)  Attempts to establish whether the task is relational, that is, if it involves parity-like relationships between the attributes. In relational tasks, no single attribute is considerably more informative than all others.

23 © ELCA CGC Statistical Meta-attributes Typical (simple) statistical meta-attributes include:  Class entropy  Average entropy of the attributes  Mutual information  Joint entropy  Equivalent number of attributes  Signal-to-noise ratio

24 © ELCA CGC Training Set I Artificial Datasets  320 randomly generated Boolean datasets with between 5 and 12 attributes  Generalization performance, GP Li, of each of the target learners is computed using 10-fold stratified cross- validation  Each dataset is labeled as:  Learner  Lk  GP Lk = max {GP Li }  Tie  max {GP Li } – min {GP Li } < 0.1

25 © ELCA CGC Landmarking vs Standard DC Meta-learnerLandmarkingStandard DCCombined Majority0.460 C5.0Tree C5.0Rules MLP RBFN LinDiscr LTree IB NB Ripper Average

26 © ELCA CGC Training Set II Artificial Datasets  222 randomly generated Boolean datasets with 20 attributes  Generalization performance, GP Li, of each of the target learners is computed using 10-fold stratified cross- validation  Three model classes: NN={MLP, RBFN}, R={RIPPER, C5.0 Rules}, DT={C5.0, C5.0 Boosting, LTREE}  Each dataset is labeled as:  Model class  Mk   Li  Mk  GP Li > (1.1).avg {GP Lj }  NoDiff, otherwise 18 UCI Datasets

27 © ELCA CGC Predicting Model Classes Meta-learnerNNRDT Majority C5.0Tree C5.0Rules MLP RBFN LinDiscr LTree IB NB Ripper

28 © ELCA CGC Measuring Performance Level Experiments show that landmarking meta-learns They do not, however, reflect the overall performance of a system whose end result is the accuracy of the selected learning model Estimate it as follows:  Train on artificial datasets  Test on UCI datasets  Report average error difference between actual best choice and meta-learner selected choice

29 © ELCA CGC Performance Level Model ClassLossMax Loss NN R DT

30 © ELCA CGC METAL Home Page

31 © ELCA CGC Uploading Data

32 © ELCA CGC Characterising Data

33 © ELCA CGC Ranking

34 © ELCA CGC Parameter Setting

35 © ELCA CGC Results

36 © ELCA CGC Running Algorithms

37 © ELCA CGC Site Statistics

38 © ELCA CGC Conclusions There is no universal learner Work on meta-learning holds promise Landmarking shows learner preferences may be learned Open questions Training data generation (incremental…) Choice of meta-features (landmarking, structure, …) Choice of meta-learner (higher-order, …) MLJ Special Issue on Meta-learning (2004), to appear

© ELCA CGC VT 1.0 If There is Time Left…

40 © ELCA CGC Process View Raw Data Domain & Data Understanding Selected Data Pre-processed Data Model Building Patterns Models Interpretation & Evaluation Business Problem Formulation Dissemination & Deployment Determine credit worthiness Aggregate individual incomes into household income Learn about loans, repayments, etc.; Collect data about past performance Build a decision tree Check against hold-out set Data Pre-processing

41 © ELCA CGC DM Tools Three types:  Research tools  Generally open source, no GUI, expert-driven  Dedicated tools  Commercial-strength, restricted to one type of tasks/algorithms, limited process support  Packaged tools  Commercial-strength, rich GUI, support for different types of tasks, rich collection of algorithms, DM process support A plethora of DM tools has emerged…

42 © ELCA CGC Tool Selection The situation:  Increasing interest  Many tools with same basis but different content  Research talk vs business talk  Assist business users in selecting a DM package based on high-level business criteria Journal of Intelligent Data Analysis, Vol. 7, No. 3 (2003)

43 © ELCA CGC Schema Definition

44 © ELCA CGC Comprehensive Survey 59 of the most popular tools Commercial: AnswerTree, CART / MARS, Clementine, Enterprise Miner, GainSmarts, GhostMiner, Insightful Miner, intelligent Miner, KnoweldgeSTUDIO, KXEN, MATLAB Neural Network Toolbox, NeuralWorks Predict, NeuroShell, Oracle Data Mining Suite, PolyAnalyst, See5 / Cubist / Magnum Opus, SPAD, SQL Server 2000, STATISTICA Data Miner, Teradata Warehouse Data Mining, etc. Freeware: WEKA, Orange, YALE, SwissAnalyst Dynamic DB: updated regularly

45 © ELCA CGC Recipients

46 © ELCA CGC Comments The above dimensions characterize Data Mining tools, and NOT Data Mining algorithms With a standard schema and corresponding database, users are able to select a DM software package with respect to its ability to meet high-level business objectives Automatic advice strategies such as METAL's Data Mining Advisor (see ) or IDEA (see ) can then be used to assist users further in the selection of the most appropriate algorithms / models for their specific tasks.

47 © ELCA CGC How About Combining Learners? An obvious solution: Combine learners, as in boosting, bagging, stacked generalisation, etc. However, no matter how elaborate the method, any algorithm that implements a fixed mapping from training sets to prediction models is subject to the limitations imposed by the law of conservation