IDSL, Intelligent Database System Lab

IDSL, Intelligent Database System Lab
Learning and Making Decisions When Costs and Probabilities are Both Uknown Authors:Bianca Zadrozny, Charles Elkan Advisor:Dr. Hsu Graduate:Yu-Wei Su 2019/2/25 IDSL, Intelligent Database System Lab

Outline Motivation Objective Introduction MetaCost vs. direct cost-sensitive decision-making a testbed:The KDD’98 charitable donations dataset Probability estimation methods Estimaition donation amounts Experimental results Conclusion opinion 2019/2/25 IDSL, Intelligent Database System Lab

Motivation Misclassification costs are different for different examples, in the same way of probabilities Problems of data unbalance in real world dataset 2019/2/25 IDSL, Intelligent Database System Lab

Objective To make optimal decisions given cost and probabilities Solution of sample bias based on Nobel prize-winning economist, James Heckman 2019/2/25 IDSL, Intelligent Database System Lab

Introduction Most supervised learning algorithms assume all errors(incorrect predictions) are equal—not true Cost-sensitive learning lead to the lowest expected cost Non cost-sensitive learning classified as accurate To present an alternative method call direct cost-sensitive decision-making 2019/2/25 IDSL, Intelligent Database System Lab

MetaCost vs. direct cost-sensitive decision-making
Each example x is associated with a cost C(i,j,x) of predicting class i for x when the true class of x is j The optimal decision concerning x is the class i that leads to the lowest expected cost 2019/2/25 IDSL, Intelligent Database System Lab

MetaCost vs. direct cost-sensitive decision-making( cont.)
Direct cost-sensitive decsion-making has the same central idea but two difference MetaCost is based on the assumption that costs are known in advance and are the same for all examples do not estimate probabilities using bagging, using simpler method based on single decison tree 2019/2/25 IDSL, Intelligent Database System Lab

A testbed:the KDD’98 charitable donations dataset
Training set consists of records with known classes;test set consists of records without known classes The overall percentage of donors among population is about 5% The donation amount for persons who respond varies from $1 to $200 2019/2/25 IDSL, Intelligent Database System Lab

A testbed:the KDD’98 charitable donations dataset( cont.)
In donation domain it is easier to talk consistently about benefit than than cost The optimal predicted label for example x is the class i that maximizes(j=1 mean the person does donate;j=0 not donate) 2019/2/25 IDSL, Intelligent Database System Lab

A testbed:the KDD’98 charitable donations dataset( cont.)
The optimal policy 2019/2/25 IDSL, Intelligent Database System Lab

Probability estimation methods
Deficiencies of decison tree methods Smoothing Curtailment Calibrating naive Bayes classifier scores Averaging probability estimates 2019/2/25 IDSL, Intelligent Database System Lab

Deficiencies of decison tree methods
Standard decision tree methods assign by default the raw training frequency p=k/n These are not accurate conditional probability estimate for at least two reasons High bias High variance Pruning methods can alleviate it but it is not suitable for unbalanced datasets 2019/2/25 IDSL, Intelligent Database System Lab

Deficiencies of decison tree methods( cont.)
The solution use C4.5 without pruning and without collapsing to obtain raw scores that can be transformed into accurate class membership probabilities 2019/2/25 IDSL, Intelligent Database System Lab

Smoothing Using the Laplace correction method For a two-class problem, it replaces the conditional probability estimate p=k/n by p’=(k+1)/(n+2) that adjusts probabilities estimates to be closer to ½ With donation it replace the probability p=k/n by p’=(k+bm)/(n+m),where b is the base rate of the positive class and m is a parameter 2019/2/25 IDSL, Intelligent Database System Lab

Smoothing( cont.) For example, a leaf contains four examples, one of which is positive, the raw C4.5 score of this leaf is 0.25. The smoothed score with m=200 and b=0.05 is 2019/2/25 IDSL, Intelligent Database System Lab

Smoothing( cont.) 2019/2/25 IDSL, Intelligent Database System Lab

Curtailment To overcome the problem of overfit Curtailment is not equivalent to any type of pruning 2019/2/25 IDSL, Intelligent Database System Lab

Curtailment( cont.) 2019/2/25 IDSL, Intelligent Database System Lab

Calibrating naive Bayes classifier scores
Using a histogram method to obtain calibrated probabilityestimates from a naive Bayesian classifier Sort the training examples acording to their scores and divide the sorted set into b equal size bins Given a test example x, place it in a bin according to its score n(x) and then estimate the corrected probability 2019/2/25 IDSL, Intelligent Database System Lab

Averaging probability estimates
Combining the probability estimates given by different classifiers throught averaging can reduce the variance of the probability estimates[ Tumer and Ghosh,1995] Where is the variance of each original clasifier,N is the number of classifiers and is the correlatin factor among all classifiers 2019/2/25 IDSL, Intelligent Database System Lab

Estimaition donation amounts
For non-donors in the training set it should impute a donation amount of zero since their actual donation amount is zero as analogous to donation probability It is also wrong to using the same donation estimate for all test examples means that the decision about donate is based on the probability 2019/2/25 IDSL, Intelligent Database System Lab

Estimaition donation amounts( cont.)
These costs or benefits must be estimated for each example Using least-squares multiple linear regression(MLB) to estimate donaition Lastgift:dollar amount of most recent gift Ampergift:average gift amount in responses to the last 22 promotions 2019/2/25 IDSL, Intelligent Database System Lab

The problem of sample selection bias Donation amounts estimated by the regression equation tend to be too low for test examples that have a low probability of donation 2019/2/25 IDSL, Intelligent Database System Lab

Heckman correction To learn a probit linear model to estimate conditional probabilities P(j=1|x) To estimate y(x) by llinear regression using only the training examples x for which j(x)=1,but including value of P(j=1|x) Second step of Heckman’s procedure in this paper is obtain by decision tree or a navie Bayes classifier 2019/2/25 IDSL, Intelligent Database System Lab

Experimental results 2019/2/25 IDSL, Intelligent Database System Lab

Conclusion The method of cost-sensitive learning that performs systematically better than MetaCost in experiments To provide a solution to the fundamental problem of costs being different for different examples To identify and solve the problem of sample selection bias 2019/2/25 IDSL, Intelligent Database System Lab

Opinion Frequency is not the only metric Positive and negative classes are not 1 and 0 question 2019/2/25 IDSL, Intelligent Database System Lab

IDSL, Intelligent Database System Lab

Similar presentations

Presentation on theme: "IDSL, Intelligent Database System Lab"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IDSL, Intelligent Database System Lab

Similar presentations

Presentation on theme: "IDSL, Intelligent Database System Lab"— Presentation transcript:

Similar presentations

About project

Feedback