On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson.

Slides:



Advertisements
Similar presentations
Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey
Advertisements

Inductive Learning in Less Than One Sequential Data Scan Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Shaw-hwa Lo Columbia University.
Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Forecasting Skewed Biased Stochastic Ozone Days: Analyses and Solutions Kun Zhang,
A Fully Distributed Framework for Cost-sensitive Data Mining Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Salvatore J. Stolfo.
A General Framework for Fast and Accurate Regression by Data Summarization in Random Decision Trees Wei Fan, IBM T.J.Watson Joe McCloskey, US Department.
A Framework for Scalable Cost- sensitive Learning Based on Combining Probabilities and Benefits Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson Salvatore.
Pruning and Dynamic Scheduling of Cost-sensitive Ensembles Wei Fan, Haixun Wang, and Philip S. Yu IBM T.J.Watson, Hawthorne, New York Fang Chu UCLA, Los.
Systematic Data Selection to Mine Concept Drifting Data Streams Wei Fan IBM T.J.Watson.
Is Random Model Better? -On its accuracy and efficiency-
Making Data Mining Models Useful to Model Non-paying Customers of Exchange Carriers Wei Fan, IBM T.J.Watson Janek Mathuria, and Chang-tien Lu Virginia.
Decision Tree Evolution using Limited number of Labeled Data Items from Drifting Data Streams Wei Fan 1, Yi-an Huang 2, and Philip S. Yu 1 1 IBM T.J.Watson.
Experience with Simple Approaches Wei Fan Erheng Zhong Sihong Xie Yuzhao Huang Kun Zhang $ Jing Peng # Jiangtao Ren IBM T. J. Watson Research Center Sun.
When Efficient Model Averaging Out-Perform Bagging and Boosting Ian Davidson, SUNY Albany Wei Fan, IBM T.J.Watson.
Decision Trees Decision tree representation ID3 learning algorithm
Multi-label Classification without Multi-label Cost - Multi-label Random Decision Tree Classifier 1.IBM Research – China 2.IBM T.J.Watson Research Center.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Longin Jan Latecki Temple University
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Lecture outline Classification Decision-tree classification.
Sparse vs. Ensemble Approaches to Supervised Learning
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
Ensemble Learning: An Introduction
Three kinds of learning
Learning….in a rather broad sense: improvement of performance on the basis of experience Machine learning…… improve for task T with respect to performance.
Bayesian Learning Rong Jin.
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning (2), Tree and Forest
By Wang Rui State Key Lab of CAD&CG
Fall 2004 TDIDT Learning CS478 - Machine Learning.
Machine Learning Chapter 3. Decision Tree Learning
For Wednesday No new reading Homework: –Chapter 18, exercises 3, 4, 7.
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.
Learning from Observations Chapter 18 Through
Decision Trees DefinitionDefinition MechanismMechanism Splitting FunctionSplitting Function Issues in Decision-Tree LearningIssues in Decision-Tree Learning.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
For Wednesday No reading Homework: –Chapter 18, exercise 6.
CLASSIFICATION: Ensemble Methods
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.
CS 8751 ML & KDDDecision Trees1 Decision tree representation ID3 learning algorithm Entropy, Information gain Overfitting.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
CS 5751 Machine Learning Chapter 3 Decision Tree Learning1 Decision Trees Decision tree representation ID3 learning algorithm Entropy, Information gain.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Ensemble Methods in Machine Learning
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Learning From Observations Inductive Learning Decision Trees Ensembles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Machine Learning: Ensemble Methods
DECISION TREES An internal node represents a test on an attribute.
Trees, bagging, boosting, and stacking
Supervised Learning Seminar Social Media Mining University UC3M
Issues in Decision-Tree Learning Avoiding overfitting through pruning
Introduction to Data Mining, 2nd Edition by
Classification and Prediction
Introduction to Data Mining, 2nd Edition by
Data Mining Practical Machine Learning Tools and Techniques
Machine Learning Chapter 3. Decision Tree Learning
CSE 573 Introduction to Artificial Intelligence Decision Trees
Machine Learning Chapter 3. Decision Tree Learning
CSCI N317 Computation for Scientific Applications Unit Weka
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning Reminder - Bagging of Trees Random Forest
Ch13. Ensemble method (draft)
Presentation transcript:

On the Optimality of Probability Estimation by Random Decision Trees Wei Fan IBM T.J.Watson

Some important facts about inductive learning Given a set of labeled data items, such as, (amt, merchant category, outstanding balance, date/time, ……,) and the label is whether it is a fraud or non- fraud. Inductive model: predict if a transaction is a fraud or non-fraud. Perfect model: never makes mistakes. Not always possible due to: Stochastic nature of the problem Noise in training data Data is insufficient

Optimal Model Loss function L(t,y) to evaluate performance. t is true label and y is prediction Optimal decision decision y* is the label that minimizes the expected loss when x is sampled many times: 0-1 loss: y* is the label that appears the most often, i.e., if P(fraud|x) > 0.5, predict fraud cost-sensitive loss: the label that minimizes the empirical risk. If P(fraud|x) * $1000 > $90 or p(fraud|x) > 0.09, predict fraud

How we look for optimal models? NP-hard for most model representation We think that simplest hypothesis that fits the data is the best. We employ all kinds of heuristics to look for it. info gain, gini index, etc pruning: MDL pruning, reduced error- pruning, cost-based pruning. Reality: tractable, but still very expensive

How many optimal models out there? 0-1 loss binary problem: Truth: P(positive|x) > 0.5, we predict x to be positive. P(positive|x) = 0.6, P(positive|x) = 0.9 makes no difference in final prediction! Cost-sensitive problems: Truth: P(fraud|x) * $1000 > $90, we predict x to be fraud. Re-write it P(fraud|x) > 0.09 P(fraud|x) = 1.0 and P(fraud|x) = makes no difference. There are really many many optimal models out there.

Random Decision Tree: Outline Train multiple trees. Details to follow. Each tree outputs posterior probability when classifying an example x. The probability outputs of many trees are averaged as the final probability estimation. Loss function and probability are used to make the best prediction.

Training At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Training At each node, an un-used feature is chosen randomly A discrete feature is un-used if it has never been chosen previously on a given decision path starting from the root to the current node. A continuous feature can be chosen multiple times on the same decision path, but each time a different threshold value is chosen

Example Gender? MF Age>30 y n P: 100 N: 150 P: 1 N: 9 … …… … Age> 25

Training: Continued We stop when one of the following happens: A node becomes empty. Or the total height of the tree exceeds a threshold, currently set as the total number of features. Each node of the tree keeps the number of examples belonging to each class.

Classification Each tree outputs membership probability p(fraud|x) = n_fraud/(n_fraud + n_normal) If a leaf node is empty (very likely for when discrete feature is tested at the end): Use the parent nodes probability estimate but do not output 0 or NaN The membership probability from multiple random trees are averaged to approximate as the final output Loss function is required to make a decision 0-1 loss: p(fraud|x) > 0.5, predict fraud cost-sensitive loss: p(fraud|x) $1000 > $90

Credit Card Fraud Detect if a transaction is a fraud There is an overhead to detect a fraud, {$60, $70, $80, $90} Loss Function

Result

Donation Dataset Decide whom to send charity solicitation letter. About 5% positive. It costs $0.68 to send a letter. Loss function

Result

Independent study and implementation of random decision tree Kai Ming Ting and Tony Liu from U of Monash, Australia on UCI datasets Edward Greengrass from DOD on their data sets. 100 to 300 features. Both categorical and continuous features. Some features have a lot of values to 3000 examples. Both binary and multiple class problem (16 and 25)

Why random decision tree works? Original explanation: Error tolerance property. Truth: P(positive|x) > 0.5, we predict x to be positive. P(positive|x) = 0.6, P(positive|x) = 0.9 makes no difference in final prediction! New discovery: Posterior probability, such as P(positive|x), is a better estimate than the single best tree.

Credit Card Fraud

Adult Dataset

Donation

Overfitting

Non-overfitting

Selectivity

Tolerance to data insufficiency

Other related applications of random decision tree n-fold cross-validation Stream Mining. Multiple class probability estimation

Implementation issues When there is not an astronomical number of features and feature values, we can build some empty tree structures and feed the data in one simple scan to finalize the construction. Otherwise, build the tree iteratively just like traditional tree construction WITHOUT any expensive purity function check. Both ways are very efficient since we do not check any expensive purity function

On the other hand Occam s Razor s interpretation: two hypotheses with the same loss, we should prefer the simpler one. Very complicated hypotheses that are highly accurate: Meta-learning Boosting (weighted voting) Bagging (sampling without replacement) None of purity functions really obeys Occam s razor Their philosophy is: simpler is better, but we hope simpler brings high accuracy. That is not true!