A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.

Slides:



Advertisements
Similar presentations
Market Research Ms. Roberts 10/12. Definition: The process of obtaining the information needed to make sound marketing decisions.
Advertisements

Random Forest Predrag Radenković 3237/10
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Decision Tree Approach in Data Mining
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Decision Tree Rong Jin. Determine Milage Per Gallon.
Data Mining Techniques Outline
Ensemble Learning: An Introduction
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Basic Data Mining Techniques
Three kinds of learning
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Experimental Evaluation
Genetic Algorithm What is a genetic algorithm? “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
The Market Research Process
Decision Tree Models in Data Mining
For Better Accuracy Eick: Ensemble Learning
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
AP Statistics Overview and Basic Vocabulary. Key Ideas The Meaning of Statistics Quantitative vs. Qualitative Data Descriptive vs. Inferential Statistics.
The Market Research Process
Data Mining Techniques
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Chapter 33 Conducting Marketing Research. The Marketing Research Process 1. Define the Problem 2. Obtaining Data 3. Analyze Data 4. Rec. Solutions 5.
Agenda for Define Key Terms Read & Take Notes The Persuaders
Biostatistics Case Studies 2005 Peter D. Christenson Biostatistician Session 5: Classification Trees: An Alternative to Logistic.
Solving the Concave Cost Supply Scheduling Problem Xia Wang, Univ. of Maryland Bruce Golden, Univ. of Maryland Edward Wasil, American Univ. Presented at.
Chapter 9 – Classification and Regression Trees
Bug Localization with Machine Learning Techniques Wujie Zheng
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
GATree: Genetically Evolved Decision Trees 전자전기컴퓨터공학과 데이터베이스 연구실 G 김태종.
Benk Erika Kelemen Zsolt
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
The Colorful Traveling Salesman Problem Yupei Xiong, Goldman, Sachs & Co. Bruce Golden, University of Maryland Edward Wasil, American University Presented.
APPLICATION OF DATAMINING TOOL FOR CLASSIFICATION OF ORGANIZATIONAL CHANGE EXPECTATION Şule ÖZMEN Serra YURTKORU Beril SİPAHİ.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
DATA MINING By Cecilia Parng CS 157B.
Chapter 9 Genetic Algorithms.  Based upon biological evolution  Generate successor hypothesis based upon repeated mutations  Acts as a randomized parallel.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
1-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Decision Tree Algorithms Rule Based Suitable for automatic generation.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
The Minimum Label Spanning Tree Problem: Illustrating the Utility of Genetic Algorithms Yupei Xiong, Univ. of Maryland Bruce Golden, Univ. of Maryland.
Metaheuristics for the New Millennium Bruce L. Golden RH Smith School of Business University of Maryland by Presented at the University of Iowa, March.
Chapter Seventeen Copyright © 2004 John Wiley & Sons, Inc. Multivariate Data Analysis.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.
Chapter 29 Conducting Market Research. Objectives  Explain the steps in designing and conducting market research  Compare primary and secondary data.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Ensemble Classifiers.
Bell Ringer List five reasons why you think that some new businesses have almost immediate success while others fail miserably.
Privacy-Preserving Data Mining
CSCI N317 Computation for Scientific Applications Unit Weka
Solving the Minimum Labeling Spanning Tree Problem
Yupei Xiong Bruce Golden Edward Wasil INFORMS Annual Meeting
Presentation transcript:

A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of Maryland S. Raghavan, University of Maryland Edward Wasil, American University Presented at INFORMS National Meeting Pittsburgh, November 2006

2 A Definition of Data Mining  Exploration and analysis of large quantities of data  By automatic or semi-automatic means  To discover meaningful patterns and rules  These patterns allow a company to  Better understand its customers  Improve its marketing, sales, and customer support operations

3 Data Mining Activities  Discussed in this session  Classification Decision trees  Visualization Discrete models Sammon maps Multidimensional scaling

4 Classification  Given a number of prespecified classes  Examine a new object, record, or individual and assign it, based on a model, to one of these classes  Examples  Which credit applicants are low, medium, high risk?  Which hotel customers are likely, unlikely to return?  Which residents are likely, unlikely to vote?

5 Background  A decision tree is a popular method for discovering meaningful patterns and classification rules in a data set  With very large data sets, it might be impractical, impossible, or time consuming to construct a decision tree using all of the data points  Idea: combine C4.5, statistical sampling, and genetic algorithms to generate decision trees of high quality  We call our approach GAIT- a Genetic Algorithm for Intelligent Trees

6 Flow Chart of GAIT

7  The initial population of trees is generated by C4.5 on the subsets from the partitioned random sample  Trees are randomly selected for the crossover and mutation operations, proportional to tree fitness  Fitness = the percentage of correctly classified observations in the scoring data set  Crossover and mutation are illustrated next GAIT Terminology

8 Crossover Operations Subtree-to-subtree Crossover Parent 1 Child 2Child 1 Parent 2 Parent 1 Child 2Child 1 Subtree-to-leaf Crossover

9 Mutation Operations Subtree-to-subtree Mutation Subtree-to-leaf Mutation

10 Feasibility Check

11 Pruning

12  Real-life marketing data set in transportation services industry  Approx. 440,000 customers with demographic and usage information  Key question: Do the customers reuse the firm’s services following a marketing promotion?  The goal, then, is to predict repeat customers  In the data set, 21.3% are repeat customers Computational Experiments

13  We focused on 11 variables  We seek to identify customers who will respond positively to the promotion  The primary performance measure is the overall classification accuracy  We coded GAIT in Microsoft Visual C  Experiments were run on a Windows 95 PC with a 400 MHz Pentium II processor and 128 MB of RAM Computational Experiments - - continued

14  Training set of 3,000 points  Scoring set of 1,500 points  Test set of 1,000 points  The test set is not available in the development of any classifiers  The combined size of the training and scoring sets is approx. 1% of the original data  We designed three different experiments Experimental Design

15  Partition the 3,000 points in the training set into 50 subsets of 60 points each  From each subset, obtain a decision tree using C4.5  The GA runs for 10 generations  Save the best 50 trees at the end of each generation  The GAIT tree is the one with the highest fitness at the end  Finally, we computed the accuracy of the best GAIT tree on the test set Experiment 1

16 Experiments 2 and 3  Experiment 2  Fewer initial trees  10 subsets of 60 points each ⇒ 10 initial trees  In the end, we compute the accuracy of the best generated decision tree on the test set  Experiment 3  Low-quality initial trees  Take the 10 lowest scoring trees (of 50) from the first generation of Experiment 1  In the end, we compute the accuracy of the best generated decision tree on the test set

17 Details of Experiments  Two performance measures  Classification accuracy on the test set  Computing time for training and scoring  Five different decision trees  Logistic Regression (SAS)  Whole-Training  Best-Initial  Aggregate Initial  GAIT

18  The Whole-Training tree is the C4.5 tree generated from the entire training set  The Best-Initial tree is the best tree (w.r.t. scoring set) from the first generation of trees  The Aggregate-Initial classifier uses a majority voting rule from the first generation of trees  The GAIT tree is the result of our genetic algorithm Decision Tree Classifiers

19 Classification Accuracy on the Test Set SizeMethodExperiment 1Experiment 2Experiment Logistic Regression Whole-Training Best-Initial Aggregate-Initial GAIT Logistic Regression Whole-Training Best-Initial Aggregate-Initial GAIT Logistic Regression Whole-Training Best-Initial Aggregate-Initial GAIT Note: Average accuracy from ten replications. The left-most column gives the size of the scoring set.

20 Computing Time (in Seconds) for Training and Scoring SizeMethodExperiment 1Experiment 2Experiment Logistic Regression Whole-Training Best-Initial Aggregate-Initial GAIT Logistic Regression Whole-Training Best-Initial Aggregate-Initial GAIT Logistic Regression Whole-Training Best-Initial Aggregate-Initial GAIT Note: Average time from ten replications. The left-most column gives the size of the scoring set.

21  In general, GAIT outperforms Aggregate-Initial which outperforms Whole-Training which outperforms Logistic Regression which outperforms Best-Initial  The improvement of GAIT over non-GAIT procedures is statistically significant in all three experiments  Regardless of where you start, GAIT produces highly accurate decision trees  We experimented with a second data set with approx. 50,000 observations and 14 demographic variables and the results were the same Computational Results

22  We increase the size of the training and scoring sets while the size of the test set remains the same  Six combinations are used from the marketing data set Scalability Percent (%) Size Training SetScoring Set , , ,000 80,000 10,000 3, ,000 96,000 64,000 4,000 1,500

23 Classification Accuracy vs. Training/Scoring Set Size

24 Computing Time for Training and Scoring

25  GAIT generates more accurate decision trees than Logistic Regression, Whole-Training, Aggregate- Initial, and Best-Initial  GAIT scales up reasonably well  GAIT (using only 3% of the data) outperforms Logistic Regression, Best-Initial, and Whole Training (using 99% of the data) and takes less computing time Computational Results

26  GAIT generates high-quality decision trees  GAIT can be used effectively on very large data sets  The key to the success of GAIT seems to be the combined use of sampling, genetic algorithms, and C4.5 (a very fast decision-tree package from the machine-learning community)  References  INFORMS Journal on Computing, 15(1), 2003  Operations Research, 51(6), 2003  Computers & Operations Research, 33(11), 2006 Conclusions