Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2.

Data Mining: A Closer Look Chapter 2 1

2.1 Data Mining Strategies 2

Supervised learning Output attributes: dependent variables Input attributes: independent variables 4

Classification: Characteristics Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior. 5

Classification: examples Individuals who suffered a heart attack or not A profile of a successful person Credit card fraud purchase Car loan applicant 6

Estimation Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior. 7

examples A thunderstorm future location The salary of a sports car owner The likelihood that a credit card has been stolen Can be transformed to classification Using probability 8

Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric. 9

examples The total number of touchdowns an NFL running back will score Take advantage of a special offer made available with their credit card billing Stock price Telephone subscribers are likely to change providers 10

The Cardiology Patient Dataset 11

A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55% 15 Production Rules Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17% 16

Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers. 17

U.C. on the Cardiology Patient Data Flag Concept Class as unused Examine the output of U.C. to determine if the instances from Concept Class naturally cluster together If no, repeatedly apply U.C. with alternative attribute choices 18

Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms. 19

2.2 Supervised Data Mining Techniques 20

A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. 22

A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% 23 IF Sex = Male & Income Range = 40-50K THEN Life Insurance Promotion = No Rule Accuracy: 100.00% Rule Coverage: 50.00%

Neural Networks 24

Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) - 0.5455 (sex) + 0.7727 26

2.3 Association Rules 28

An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes 30 IF Sex = Male & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = No IF Sex = Female & Age = over40 THEN Credit Card Insurance = No & Life Insurance Promotion = Yes

Association rule advantages Have one or several output attributes An output attribute for one rule can be an input attribute for another rule 31

2.4 Clustering Techniques 32

Rule for the 3 rd cluster IF Sex=Female & 43 >=Age>=35 & Credit Card Insurance = No THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67% 35

2.5 Evaluating Performance 36

Evaluating Supervised Learner Models 37

General questions Will the benefits received from a data mining project more than offset the cost of the data mining process? ◦ Require the business model knowledge How do we interpret the results of a data mining session? Can we use the results of a data mining process with confidence? 38

Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors. 39

Two-Class Error Analysis 41

43 Which is better? Given that credit card purchases are unsecured, choose Model B. 寧缺勿濫

Evaluating Numeric Output Mean absolute error Mean squared error Root mean squared error = SD, standard deviation 44 When the output attribute is numeric

Comparing Models by Measuring Lift 45

Computing Lift 47

49 Which is better? 540/24000 = 450/20000 4000 fewer mailings vs. 90 fewer sales

Unsupervised Model Evaluation 50 Perform an unsupervised clustering. Assign each cluster an arbitrary name. ex. C1, C2, and C3 Choose a random sample of instances from each of the classes formed as a result of the instance clustering. Build a supervised learner model

Basic Data Mining Techniques Chapter 3 51

Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2.

Similar presentations

Presentation on theme: "Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2.

Similar presentations

Presentation on theme: "Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2."— Presentation transcript:

Similar presentations

About project

Feedback