Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 chen@jepson.gonzaga.edu

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 2.1 Data Mining Strategies

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 3 Figure 2.1 - A hierarchy of data mining strategies Data Mining Strategies Unsupervised Clustering Supervised Learning Market Basket Analysis Classification Estimation Prediction Categorical/discrete (current behavior) Numeric Future outcome (categorical/numeric) No output attributes

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Classification Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior. Estimation Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior. Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 5 The Cardiology Patient Dataset

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 6 Table 2.1 Cardiology Patient Data

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 7 Table 2.2 Most and Least Typical Instances from the Cardiology Domain

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55% A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17%

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 10 Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms. The results of a market basket analysis help retailers –Design promotions, –Arrange shelf or catalog items, and –Develop cross- marketing strategies

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 11 2.2 Supervised Data Mining Techniques

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 12 The Credit Card Promotion Database

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 13 Table 2.3 The Credit Card Promotion Database

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 14 Data file: CreditCardPromotion.xls

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67% Question: Can we assume that two-thirds of all females in the specified age range will take advantage of the promotion? Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 17 Neural Networks A neural network is a set of interconnected nodes designed to imitate the functioning of the human brain. Two phases of operations –Learning phase at the input layer until it reaches a predetermined minimum error rate –Fixing weights and recompute output values for new instances A major shortcoming of the neural network approach is a lack of explanation about what has been learned.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 18

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 19 Table 2.3 The Credit Card Promotion Database (Note that blue: input attributes, red: output attributes) Therefore, there four input nodes and one output node and chose five hidden-layer nodes.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 20 Table 2.4 - Neural Network Training: Actual and Computed Output

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 21 Table 2.4 - Neural Network Training: Actual and Computed Output

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 22 Statistical Regression Statistical regression is a supervised learning technique that generalizes a set of numeric data by creating a mathematical equation relating one or more input attributes to a single numeric output attributes. Linear regression model is characterized by an output attribute whose value is determined by a linear sum of weighted input attribute values.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Statistical Regression Life insurance promotion = 0.5909 (credit card insurance) - 0.5455 (sex) + 0.7727 Example, a female who does not have credit card insurance is a likely candidate for the life insurance promotion. Life insurance promotion = 0.5909 (0) – 0.5455 (0) + 0.7727 = 0.7727 Because the value 0.7727 is close to 1.0, we conclude that the individual is likely to take advantage of the promotional offer.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 24 2.3 Association Rules Association rule mining techniques are used to discover interesting associations between attributes contained in a database. Unlike traditional association rules, association rules can have one or several output attributes. An output attribute for one rule can be an input attribute for another rule. A popular technique for ‘market basket’ analysis. Problem: we may have several rules with little value.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 26 2.4 Clustering Techniques By applying unsupervised clustering to the instances of the Acme Credit Card Company database, we will find a subset of input attributes that differentiate card holders who have taken advantage of the life insurance promotion from those cardholders who have not accepted the promotion offer.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 27 IF Sex=Female & 43>=Age>=35 & Credit Card Insurance=NO THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67%

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 28 End here for now!

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 29 2.5 Evaluating Performance Performance evaluation is probably the most critical of all the steps in the data mining process. Three general questions: –1. Will the benefits received from a data mining project more than offset the cost of the data mining process? –2. How do we interpret the results of a data mining session? –3. Can we use the results of a data mining process with confidence?

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 30 Evaluating Supervised Learner Models Supervised learner models are designed to classify, estimate, and/or predict future outcome. Applications on classification correctness: –Develop a model to accept to reject credit card applications –Develop a model to accept or reject home mortgage applicants –Develop a model to decide whether or not to drill for oil

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors.  Classification correctness is best calculated by presenting previously unseen data in the form of a test to the model being evaluated.  A confusion matrix is of little use for evaluating supervised learner models offering numeric output.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 32 Table 2.5 A Three-Class Confusion Matrix Computed Decision C 1 C 2 C 3 C 1 C 11 C 12 C 13 C 2 C 21 C 22 C 23 C 3 C 31 C 32 C 33 Rule 2: for C 2, C 21, C 22, C 23 are all actually members of C 2; but C 21 and C 23 are incorrectly classified as members of another class. Rule 3: for C 2, C 12 and C 32 are instances are incorrectly classified as members of class C 2. Rule 1: Values along the main diagonal represent correct classification e.g., C 11 represents the total number of class C 1 instance correctly classified by the model Computation questions: #1

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 33 Table 2.6 A Simple Confusion Matrix Two-Class Error Analysis The more the better The less the better

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 34 Table 2.7 Two Confusion Matrices Each Showing a 10% Error Rate Which model is better?

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Evaluating Numeric Mean absolute error Mean squared error Root mean squared error

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 36 Comparing Models by Measuring Lift Marketing applications that focus on response rates from mass mailings are less concerned with test set classification error and more interested in building models able to extract bias samples from large populations. Supervised learner models designed for extracting bias samples from a general population are often evaluated by a measure that comes directly from marketing known as lift.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Computing Lift C i is the class of al zero-balance customers who, given the opportunity, will take advantage of the promotional offer.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 38 Figure 2.4 – A lift chart (Targeted vs. mass mailing)

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 39 Table 2.8 Two Confusion Matrices: No Model and an Ideal Model Table 2.9 Two Confusion Matrices for Alternative Models with Lift Equal to 2.25

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 40 Unsupervised Model Evaluation Evaluating unsupervised data mining is, in general, a more difficult task than supervised evaluation. This is true because the goals of an unsupervised data mining session are frequently not as clear as the goals for supervised learning.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 41

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.

Similar presentations

Presentation on theme: "Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.

Similar presentations

Presentation on theme: "Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration."— Presentation transcript:

Similar presentations

About project

Feedback