Data Mining: A Closer Look Chapter 2 1. 2.1 Data Mining Strategies 2.

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

1. Abstract 2 Introduction Related Work Conclusion References.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Basic Data Mining Techniques Chapter Decision Trees.
Part II Tools for Knowledge Discovery. Knowledge Discovery in Databases Chapter 5.
Basic Data Mining Techniques
Data Mining By Archana Ketkar.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Data Mining: A Closer Look Chapter Data Mining Strategies.
StatMaster – An Update Kartik Vishwanath Chintan Patel Yugyung Lee UMKC William Drake Richard Stroup Steve Simon Childrens Mercy Hospital, Kansas City,
Genetic Algorithm Genetic Algorithms (GA) apply an evolutionary approach to inductive learning. GA has been successfully applied to problems that are difficult.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
1 An Excel-based Data Mining Tool Chapter The iData Analyzer.
Data Mining – Intro.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
Enterprise systems infrastructure and architecture DT211 4
Evaluating Performance for Data Mining Techniques
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Basic Data Mining Techniques
Fundamentals of Statistical Analysis DR. SUREJ P JOHN.
Data Mining Dr. Chang Liu. What is Data Mining Data mining has been known by many different terms Data mining has been known by many different terms Knowledge.
1 Data Mining DT211 4 Refer to Connolly and Begg 4ed.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Forecast Anything! The Seven Data Mining Models Andy Cheung ISV Developer Evangelist Microsoft Hong Kong.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Inductive learning Simplest form: learn a function from examples
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
Chapter 9 Neural Network.
Chapter 12 – Discriminant Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
Discriminant Analysis Discriminant analysis is a technique for analyzing data when the criterion or dependent variable is categorical and the predictor.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
CRM - Data mining Perspective. Predicting Who will Buy Here are five primary issues that organizations need to address to satisfy demanding consumers:
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS.
DM.Lab in University of Seoul Data Mining Laboratory April 24 th, 2008 Summarized by Sungjick Lee An Excel-Based Data Mining Tool iData Analyzer.
Data Mining and Decision Support
Data Mining Copyright KEYSOFT Solutions.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining – Intro.
DATA MINING © Prentice Hall.
Data mining and statistical learning, lecture 1b
An Excel-based Data Mining Tool
Week 11 Knowledge Discovery Systems & Data Mining :
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Presentation transcript:

Data Mining: A Closer Look Chapter 2 1

2.1 Data Mining Strategies 2

3

Supervised learning Output attributes: dependent variables Input attributes: independent variables 4

Classification: Characteristics Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior. 5

Classification: examples Individuals who suffered a heart attack or not A profile of a successful person Credit card fraud purchase Car loan applicant 6

Estimation Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior. 7

examples A thunderstorm future location The salary of a sports car owner The likelihood that a credit card has been stolen Can be transformed to classification Using probability 8

Prediction The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric. 9

examples The total number of touchdowns an NFL running back will score Take advantage of a special offer made available with their credit card billing Stock price Telephone subscribers are likely to change providers 10

The Cardiology Patient Dataset 11

12

13

14

A Healthy Class Rule for the Cardiology Patient Dataset IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55% 15 Production Rules Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

A Sick Class Rule for the Cardiology Patient Dataset IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17% 16

Unsupervised Clustering Determine if concepts can be found in the data. Evaluate the likely performance of a supervised model. Determine a best set of input attributes for supervised learning. Detect Outliers. 17

U.C. on the Cardiology Patient Data Flag Concept Class as unused Examine the output of U.C. to determine if the instances from Concept Class naturally cluster together If no, repeatedly apply U.C. with alternative attribute choices 18

Market Basket Analysis Find interesting relationships among retail products. Uses association rule algorithms. 19

2.2 Supervised Data Mining Techniques 20

21

A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. 22

A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: % Rule Coverage: 66.67% 23 IF Sex = Male & Income Range = 40-50K THEN Life Insurance Promotion = No Rule Accuracy: % Rule Coverage: 50.00%

Neural Networks 24

25

Statistical Regression Life insurance promotion = (credit card insurance) (sex)

27

2.3 Association Rules 28

29

An Association Rule for the Credit Card Promotion Database IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes 30 IF Sex = Male & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = No IF Sex = Female & Age = over40 THEN Credit Card Insurance = No & Life Insurance Promotion = Yes

Association rule advantages Have one or several output attributes An output attribute for one rule can be an input attribute for another rule 31

2.4 Clustering Techniques 32

33

34

Rule for the 3 rd cluster IF Sex=Female & 43 >=Age>=35 & Credit Card Insurance = No THEN Class = 3 Rule Accuracy: 100% Rule Coverage: 66.67% 35

2.5 Evaluating Performance 36

Evaluating Supervised Learner Models 37

General questions Will the benefits received from a data mining project more than offset the cost of the data mining process? ◦ Require the business model knowledge How do we interpret the results of a data mining session? Can we use the results of a data mining process with confidence? 38

Confusion Matrix A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors. 39

40

Two-Class Error Analysis 41

42

43 Which is better? Given that credit card purchases are unsecured, choose Model B. 寧缺勿濫

Evaluating Numeric Output Mean absolute error Mean squared error Root mean squared error = SD, standard deviation 44 When the output attribute is numeric

Comparing Models by Measuring Lift 45

46

Computing Lift 47

48

49 Which is better? 540/24000 = 450/ fewer mailings vs. 90 fewer sales

Unsupervised Model Evaluation 50 Perform an unsupervised clustering. Assign each cluster an arbitrary name. ex. C1, C2, and C3 Choose a random sample of instances from each of the classes formed as a result of the instance clustering. Build a supervised learner model

Basic Data Mining Techniques Chapter 3 51