What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003.

What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003

Bogosity about learning with unbalanced data 1. The goal is yes/no classification. No: ranking, or probability estimation Often, P(c=minority|x) < 0.5 for all examples x 2. Decision trees and C4.5 are well-suited –No: model each class separately, then use Bayes’ rule P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] –No: avoid small disjuncts With naïve Bayes:P(x|c) =  P(x i | c) 3. Under/over-sampling are appropriate –No: do cost-based example-specific sampling, then bagging 4. ROC curves and AUC are important

Learning to predict contact maps 3D protein distance map binary contact map (Source: Paolo Frasconi et al.)

Issues in contact map prediction 1. An ML researcher sees O(n 2 ) non-contacts and O(n) contacts. 2. But to a biologist, the concept “an example of a non- contact” is far from natural. 3. Moreover, there is no natural probability distribution defining the population of “all” proteins. 4. A statistician sees simply O(n 2 ) distance measures— but s/he finds least-squares regression is useless!

For the rooftop detection task … We used […] BUDDS, to extract candidate rooftops (I.e. parallelograms) from six-large area images. Such processing resulted in 17,289 candidates, which an expert labeled as 781 positive examples and 17,048 negative examples of the concept “rooftop.” (Source: Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown, Marcus Maloof, this workshop.)

How to detect faces in real-time? Viola and Jones, CVPR ‘01: Slide window over image 45396 features per window Learn boosted decision- stump classifier

UCI datasets are small and not highly unbalanced DATA SETSIZEFEATURESMINORITY FRACTION PIMA76880.35 PHONEME548450.29 SATIMAGE643536 0.10 MAMMOG.1118360.02 KRKOPT2805660.01 (Source: C4.5 and Imbalanced Data Sets, Nitin Chawla, this workshop.)

Features of the DMEF and similar datasets 1. At least 10 5 examples and 10 2.5 features. 2. No single well-defined target class. 3. Interesting cases have frequency < 0.01. 4. Much information on costs and benefits, but no overall model of profit/loss. 5. Different cost matrices for different examples 6. Most cost matrix entries are unknown.

Example-dependent costs and benefits Observations: 1. Loss or profit depends on the transaction size x. 2. Figuring out the full profit/loss model is hard. 3. Opportunity costs are confusing. 4. Creative management transforms costs into benefits. 5. How do we account for long-term costs and benefits? actual predicted legitimatefraudulent legitimate+ 0.01x- x fraudulent- 20- 10

Correct decisions require correct probabilities Let p = P(legitimate). The optimal decision is “approve” iff 0.01xp – (1-p)x > (-20)p + (-10)(1-p) This calculation requires well-calibrated estimates of p. actual predicted legitimatefraudulent legitimate+ 0.01x- x fraudulent- 20- 10

1. “AUC can give a general idea of the quality of the probabilistic estimates produced by the model” No, AUC only evaluates the ranking produced. 1. “Cost curves are equivalent to ROC curves” No, a single point on the ROC curve is optimal only if costs are the same for all examples. Advice: Use $ profit to compare methods. Issue: When is $ difference statistically significant? ROC curves considered harmful (Source: Medical College of Georgia.)

Usually we must learn a model to estimate costs Cost matrix for soliciting donors to a charity. The donation amount x is always unknown for test examples, so we must use the training data to learn a regression model to predict x. actual predicted donornon-donor solicitx – $0.68- $0.68 ignore00

So, we learn a model to estimate costs … Issue: The subset in the training set with x > 0 is a skewed sample for learning a model to estimate x. Reason: Donation amount x and probability of donation p are inversely correlated. Hence, the training set contains too few examples of large donations, compared to small ones. actual predicted donornon-donor solicitx – $0.68- $0.68 ignore00

The “reject inference” problem 1. Let humans make credit grant/deny decisions. 2. Collect data about repay/write-off, but only for people to whom credit is granted. 3. Learn a model from this training data. 4. Apply the model to all future applicants. Issue: “All future applicants” is a sample from a different population than “people to whom credit is granted.”

Selection bias makes training labels incorrect In the Wisconsin Prognostic Breast Cancer Database, average survival time with chemotherapy is lower (58.9 months) than without (63.1)! Historical actions are not optimal, but they are not chosen randomly either. (Source: William H. Wolberg, M.D.)

Sequences of training sets 1. Use data collected in 2000 to learn a model; apply this model to select inside the 2001 population. 2. Use data about the individuals selected in 2001 to learn a new model; apply this model in 2002. 3. And so on… Each time a new model is learned, its training set is has been created using a different selection bias.

Let’s use the word “unbalanced” in the future Google: Searched the web for imbalanced. … about 53,800.imbalanced Searched the web for unbalanced. … about 465,000.unbalanced

C. Elkan. The Foundations of Cost-Sensitive Learning IJCAI'01, pp. 973- 978.The Foundations of Cost-Sensitive Learning B. Zadrozny and C. Elkan. Learning and Making Decisions When Costs and Probabilities are Both Unknown KDD'01, pp. 204-213.Learning and Making Decisions When Costs and Probabilities are Both Unknown B. Zadrozny and C. Elkan. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers ICML'01, pp. 609-616.Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian Classifiers N. Abe et al. Empirical Comparison of Various Reinforcement Learning Strategies for Sequential Targeted Marketing ICDM'02.Empirical Comparison of Various Reinforcement Learning Strategies for Sequential Targeted Marketing B. Zadrozny, J. Langford, and N. Abe. Cost-Sensitive Learning by Cost- Proportionate Example Weighting ICDM’03.

What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003.

Similar presentations

Presentation on theme: "What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003.

Similar presentations

Presentation on theme: "What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003."— Presentation transcript:

Similar presentations

About project

Feedback