A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.
Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.
On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
An Improved Categorization of Classifiers Sensitivity on Sample Selection Bias Wei Fan Ian Davidson Bianca Zadrozny Philip S. Yu.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
Experience with Simple Approaches Wei Fan Erheng Zhong Sihong Xie Yuzhao Huang Kun Zhang $ Jing Peng # Jiangtao Ren IBM T. J. Watson Research Center Sun.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
ICS 178 Intro Machine Learning
Random Forest Predrag Radenković 3237/10
CHAPTER 9: Decision Trees
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Chapter 7 – Classification and Regression Trees
CMPUT 466/551 Principal Source: CMU
Chapter 7 – Classification and Regression Trees
Decision Tree Rong Jin. Determine Milage Per Gallon.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Learning From Data Chichang Jou Tamkang University.
Three kinds of learning
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
ICS 273A Intro Machine Learning
Sparse vs. Ensemble Approaches to Supervised Learning
Chapter 5 Data mining : A Closer Look.
Ensemble Learning (2), Tree and Forest
For Better Accuracy Eick: Ensemble Learning
Introduction to Directed Data Mining: Decision Trees
by B. Zadrozny and C. Elkan
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Presentor:
Chapter 9 – Classification and Regression Trees
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Learning from Observations Chapter 18 Through
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Ensemble Methods in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
Data Mining and Decision Support
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Chapter 7. Classification and Prediction
Trees, bagging, boosting, and stacking
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Bayesian Averaging of Classifiers and the Overfitting Problem
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Decision Trees By Cole Daily CSCI 446.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ch13. Ensemble method (draft)
Presentation transcript:

A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department of Defense Philip Yu, IBM T.J.Watson

Three DM Problems Classification: Label: given set of labels in training data. Probability Estimation: Similar to the above setting: estimate the probability that x is an example of class y. Difference: no truth is given, i.e., no true probability Regression: Target value: continuous values.

Model Approximation True model or correct model. Generates y for each x with probability P(y|x). Normally never known in reality. Perfect model: never makes mistakes or has the same prediction as the true model. Not always possible due to: Stochastic nature of the problem Noise in training data Data is insufficient

Optimal Model Loss function L(t,y) to evaluate performance. Optimal decision decision y* is the label that minimizes expected loss when x is sampled repeatedly: Examples 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the empirical risk. If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud MSE or mean square error: predict average

How we look for optimal models? Don t impose exact forms : Decision Trees, Classification based on Association rules, Production rules Learner estimate structure as well as parameters NP-hard for most model representation Impose exact forms : logistic regression functions, linear regression model, etc Learners estimate parameter ONLY. Structure is pre-fixed Inductive Bias. Decision tree is rather flexible, efficient yet powerful representation.

Consider Decision Tree Compromise between accuracy and model complexity We think that simplest-structured hypothesis that fits the data is the best. We employ all kinds of heuristics to look for it. info gain, gini index, Kearns-Mansour, etc pruning: MDL pruning, reduced error-pruning, cost- based pruning. Reality: tractable, but still pretty expensive Truth: none of purity check functions guarantee accuracy over testing data.

Random Decision Tree - classification, regression, probability estimation Key characteristics: Structure is randomly picked. Statistics are summarized from training data. At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Continued We stop when one of the following happens: A node becomes too small. Or the total height of the tree exceeds some limits: Such as the total number of features.

Node Statistics Classification and Probability Estimation: Each node of the tree keeps the number of examples belonging to each class. Regression: Each node of the tree keeps the mean value of examples sorted into the node

Classification/Prob Estimatimation During classification, each tree outputs posterior probability: B1 < 0.5 Y B2 > 0.7B1 > 0.3 P1: 200 P2: 10 N Y N P1: 30 P2: 70 Y … P(P1|x)=0.3

Regression During classification, each tree average value of training examples that falls within each node Age >30 Y Capt> 70%Edu=PhD Avg AGI=100K N Y N Avg AGI=150K Y …

Classification The prediction from multiple random trees are averaged as the final output. Classification: loss function is needed.

A few words about some of its advantage Training can be very efficient. Particularly true for very large datasets. Natural multi-class probability. Natural multi-label classification and probability estimation. Imposes very little about the structures of the model.

Number of trees Sampling theory: The random decision tree can be thought as sampling from a large (infinite when continuous features exist) population of trees. Unless the data is highly skewed, 30 to 50 gives pretty good estimate with reasonably small variance. In most cases, 10 are usually enough. Worst scenario Only one feature is relevant. All the rest are noise. Probability: Variance Deduction:

Donation Dataset - classification and prob estimation Decide whom to send charity solicitation letter. It costs $0.68 to send a letter. Loss function

Result

Credit Card Fraud -classification and prob estimation Detect if a transaction is a fraud There is an overhead to detect a fraud, {$60, $70, $80, $90} Loss Function

Result

Comparing with Boosting Don t handle multi-class problems naturally, ECOC Do not output probabilities. Inefficient. Boosting rounds is tricky. Sometimes, more rounds can lead to overfitting. Inefficient. Implementation needs careful numerical manipulation.

Comparing with Bagging Could be very inefficient particularly for very large dataset i.e., bootstrap sampling needs linear scan of the data. Do not output reliable probabilities.

Probability Estimation

Overfitting

Non-overfitting of RDT

Selectivity

Tolerance to data insufficiency

GUIDE Age >30 Y Capt> 70%Edu=PhD MLR N Y N Y … MLR y = a+a1*x1+a2*x2 + … ak*xk

Regression: single independent variable

RDT

Depend on combination of 5 independent variables

RDT

It grows like …

Comparing with GUIDE Need to decide grouping variables and independent variables. A non-trivial task. If all variables are categorical, GUIDE becomes a single CART regression tree. Strong assumption and greedy-based search. Sometimes, can lead to very unexpected results, like the one given earlier

Conclusion Imposing a particular form of model is not a good idea to train highly-accurate models. It may not even be efficient for some forms of models. RDT has been show to solve all three major problems in data mining, classification, probability estimation and regressions, simply, efficiently and accurately.

Selected Bibliography of RDT ICDM 03: Is random model better? On its accuracy and efficiency (Fan, Wang, Yu and Ma) AAAI 04: On the Optimality of Posterior Probability Estimation by Random Decision Tree (Fan) ICDM 05: Effective Estimation of Posterior Probabilities: Explaining the Accuracy of Randomized Decision Tree Approaches (Fan, Greengrass, McCloskey, Yu, and Drummey) ICDM 05: Learning through Changes: An Empirical Study of Dynamic Behaviors of Probability Estimation Trees (Zhang, Buckles, Peng, and Xu) Master Thesis by Tony Liu, supervised by Kai Ming Ting, The Utility of Randomness in Decision Tree Construction, Monash University, 2005 KDD 06: A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees