Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2.

Similar presentations


Presentation on theme: "Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2."— Presentation transcript:

1 Data Mining: A Closer Look Chapter 2 1

2 2.1 Data Mining Strategies 2

3 3

4 Supervised learning Output attributes: dependent variables Input attributes: independent variables 4

5 Classification: Characteristics Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior. 5

6 Classification: examples Individuals who suffered a heart attack or not A profile of a successful person Credit card fraud purchase Car loan applicant 6

7 Estimation Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior. 7

8 examples A thunderstorm future location The salary of a sports car owner The likelihood that a credit card has been stolen Can be transformed to classification Using probability 8

9 Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric. 9

10 examples The total number of touchdowns an NFL running back will score Take advantage of a special offer made available with their credit card billing Stock price Telephone subscribers are likely to change providers 10

11 The Cardiology Patient Dataset 11

12 12

13 13

14 14

15 A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55% 15 Production Rules Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

16 A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17% 16

17 Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers. 17

18 U.C. on the Cardiology Patient Data Flag Concept Class as unused Examine the output of U.C. to determine if the instances from Concept Class naturally cluster together If no, repeatedly apply U.C. with alternative attribute choices 18

19 Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms. 19

20 2.2 Supervised Data Mining Techniques 20

21 21

22 A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. 22

23 A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% 23 IF Sex = Male & Income Range = 40-50K THEN Life Insurance Promotion = No Rule Accuracy: 100.00% Rule Coverage: 50.00%

24 Neural Networks 24

25 25

26 Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) - 0.5455 (sex) + 0.7727 26

27 27

28 2.3 Association Rules 28

29 29

30 An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes 30 IF Sex = Male & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = No IF Sex = Female & Age = over40 THEN Credit Card Insurance = No & Life Insurance Promotion = Yes

31 Association rule advantages Have one or several output attributes An output attribute for one rule can be an input attribute for another rule 31

32 2.4 Clustering Techniques 32

33 33

34 34

35 Rule for the 3 rd cluster IF Sex=Female & 43 >=Age>=35 & Credit Card Insurance = No THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67% 35

36 2.5 Evaluating Performance 36

37 Evaluating Supervised Learner Models 37

38 General questions Will the benefits received from a data mining project more than offset the cost of the data mining process? ◦ Require the business model knowledge How do we interpret the results of a data mining session? Can we use the results of a data mining process with confidence? 38

39 Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors. 39

40 40

41 Two-Class Error Analysis 41

42 42

43 43 Which is better? Given that credit card purchases are unsecured, choose Model B. 寧缺勿濫

44 Evaluating Numeric Output Mean absolute error Mean squared error Root mean squared error = SD, standard deviation 44 When the output attribute is numeric

45 Comparing Models by Measuring Lift 45

46 46

47 Computing Lift 47

48 48

49 49 Which is better? 540/24000 = 450/20000 4000 fewer mailings vs. 90 fewer sales

50 Unsupervised Model Evaluation 50 Perform an unsupervised clustering. Assign each cluster an arbitrary name. ex. C1, C2, and C3 Choose a random sample of instances from each of the classes formed as a result of the instance clustering. Build a supervised learner model

51 Basic Data Mining Techniques Chapter 3 51


Download ppt "Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2."

Similar presentations


Ads by Google