Download presentation
Presentation is loading. Please wait.
Published byMelissa Chandler Modified over 9 years ago
1
Data Mining: A Closer Look Chapter 2 1
2
2.1 Data Mining Strategies 2
3
3
4
Supervised learning Output attributes: dependent variables Input attributes: independent variables 4
5
Classification: Characteristics Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior. 5
6
Classification: examples Individuals who suffered a heart attack or not A profile of a successful person Credit card fraud purchase Car loan applicant 6
7
Estimation Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior. 7
8
examples A thunderstorm future location The salary of a sports car owner The likelihood that a credit card has been stolen Can be transformed to classification Using probability 8
9
Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric. 9
10
examples The total number of touchdowns an NFL running back will score Take advantage of a special offer made available with their credit card billing Stock price Telephone subscribers are likely to change providers 10
11
The Cardiology Patient Dataset 11
12
12
13
13
14
14
15
A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55% 15 Production Rules Rule accuracy is a between-class measure. Rule coverage is a within-class measure.
16
A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17% 16
17
Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers. 17
18
U.C. on the Cardiology Patient Data Flag Concept Class as unused Examine the output of U.C. to determine if the instances from Concept Class naturally cluster together If no, repeatedly apply U.C. with alternative attribute choices 18
19
Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms. 19
20
2.2 Supervised Data Mining Techniques 20
21
21
22
A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. 22
23
A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% 23 IF Sex = Male & Income Range = 40-50K THEN Life Insurance Promotion = No Rule Accuracy: 100.00% Rule Coverage: 50.00%
24
Neural Networks 24
25
25
26
Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) - 0.5455 (sex) + 0.7727 26
27
27
28
2.3 Association Rules 28
29
29
30
An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes 30 IF Sex = Male & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = No IF Sex = Female & Age = over40 THEN Credit Card Insurance = No & Life Insurance Promotion = Yes
31
Association rule advantages Have one or several output attributes An output attribute for one rule can be an input attribute for another rule 31
32
2.4 Clustering Techniques 32
33
33
34
34
35
Rule for the 3 rd cluster IF Sex=Female & 43 >=Age>=35 & Credit Card Insurance = No THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67% 35
36
2.5 Evaluating Performance 36
37
Evaluating Supervised Learner Models 37
38
General questions Will the benefits received from a data mining project more than offset the cost of the data mining process? ◦ Require the business model knowledge How do we interpret the results of a data mining session? Can we use the results of a data mining process with confidence? 38
39
Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors. 39
40
40
41
Two-Class Error Analysis 41
42
42
43
43 Which is better? Given that credit card purchases are unsecured, choose Model B. 寧缺勿濫
44
Evaluating Numeric Output Mean absolute error Mean squared error Root mean squared error = SD, standard deviation 44 When the output attribute is numeric
45
Comparing Models by Measuring Lift 45
46
46
47
Computing Lift 47
48
48
49
49 Which is better? 540/24000 = 450/20000 4000 fewer mailings vs. 90 fewer sales
50
Unsupervised Model Evaluation 50 Perform an unsupervised clustering. Assign each cluster an arbitrary name. ex. C1, C2, and C3 Choose a random sample of instances from each of the classes formed as a result of the instance clustering. Build a supervised learner model
51
Basic Data Mining Techniques Chapter 3 51
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.