Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS.

Slides:



Advertisements
Similar presentations
Data Mining Tools Overview Business Intelligence for Managers.
Advertisements

5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Introduction to Data Mining with XLMiner
Data Mining: A Closer Look Chapter Data Mining Strategies.
Chapter 9 Business Intelligence Systems
Basic Data Mining Techniques Chapter Decision Trees.
Neural Networks. R & G Chapter Feed-Forward Neural Networks otherwise known as The Multi-layer Perceptron or The Back-Propagation Neural Network.
AMTRA Presentation – 3/14/09 Presented by: Anne Geraci.
Part II Tools for Knowledge Discovery. Knowledge Discovery in Databases Chapter 5.
Basic Data Mining Techniques
Neural Networks Chapter Feed-Forward Neural Networks.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Genetic Algorithm Genetic Algorithms (GA) apply an evolutionary approach to inductive learning. GA has been successfully applied to problems that are difficult.
BUSINESS DRIVEN TECHNOLOGY
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 An Excel-based Data Mining Tool Chapter The iData Analyzer.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
Data Mining: A Closer Look
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Go to Table of ContentTable of Content Analysis of Variance: Randomized Blocks Farrokh Alemi Ph.D. Kashif Haqqi M.D.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Clustering analysis workshop Clustering analysis workshop CITM, Lab 3 18, Oct 2014 Facilitator: Hosam Al-Samarraie, PhD.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Evaluating Performance for Data Mining Techniques
Advanced Tables Lesson 9. Objectives Creating a Custom Table When a table template doesn’t suit your needs, you can create a custom table in Design view.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 11 Regression.
Basic Data Mining Techniques
Computer Literacy BASICS
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Data Mining and Application Part 1: Data Mining Fundamentals Part 2: Tools for Knowledge Discovery Part 3: Advanced Data Mining Techniques Part 4: Intelligent.
Inductive learning Simplest form: learn a function from examples
Introduction to SPSS Edward A. Greenberg, PhD
Chapter 3 PART 2 - SPREADSHEET CMPF 112 : COMPUTING SKILLS CALC FOR LINUX.
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
FCS - AAO - DM COMPE/SE/ISE 492 Senior Project 2 System/Software Test Documentation (STD) System/Software Test Documentation (STD)
Chapter 9 Neural Network.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Project 6 Using The Analysis ToolPak To Analyze Sales Transactions Jason C. H. Chen, Ph.D. Professor of Management Information Systems School of Business.
Introducing Excel Jason C. H. Chen, Ph.D. Professor of Management Information Systems School of Business Administration Gonzaga University Spokane, WA.
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Amer Kanj Data Mining For Business Professionals.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Storing Organizational Information - Databases
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor.
BUSINESS PERFORMANCE MANAGEMENT
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
1 Doing Statistics for Business Doing Statistics for Business Data, Inference, and Decision Making Marilyn K. Pelosi Theresa M. Sandifer Chapter 12 Multiple.
DM.Lab in University of Seoul Data Mining Laboratory April 24 th, 2008 Summarized by Sungjick Lee An Excel-Based Data Mining Tool iData Analyzer.
Sampling and Nested Data in Practice-Based Research Stephen Zyzanski, PhD Department of Family Medicine Case Western Reserve University School of Medicine.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
NSF DUE ; Wen M. Andrews J. Sargeant Reynolds Community College Richmond, Virginia.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 6 The Data Warehouse Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
An Excel-based Data Mining Tool Chapter The iData Analyzer.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
EMPA Statistical Analysis
Data Mining: Concepts and Techniques
An Excel-based Data Mining Tool
CSCI N317 Computation for Scientific Applications Unit Weka
Instructor Materials Chapter 5: Ensuring Integrity
Presentation transcript:

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 4 An Excel-based Data Mining Tool (iData Analyzer) Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 2 Objectives This chapter will introduce you the iData Analyzer(iDA) and how to use two of learner models contained in your iDA software of data mining tools. In Section 4.1 overviews the iDA Model for Knowledge Discovery. In Section 4.2, introduces an exemplar-based data mining tool, ESX, capable of both supervised learning and unsupervised clustering. The way of representing datasets and how to use ESX to perform unsupervised clustering and building supervised learning models and others will be also introduced in this chapter.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining The iData Analyzer iDA provides support for the business or technical analyst by offering a visual learning environment, an integrated tool set, and data mining process support. iDA consists of the following components: –Preprocessor –Heuristic agent (for larger Large Dataset) –ESX –Neural Network –Rule Maker –Report Generator See p.107 and Appendix A-2 for the instructions of installation

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 4 Limitations The commercial version of iDA is bounded by the size of a single MS Excel spreadsheet, i.e., up to 65,536 rows and 256 columns The iDA input format uses the first three rows of a spreadsheet to house information about individual attributes –Up to 65,533 data instances in attribute-value format can be mined –The student version allows a maximum of 7,000 data instances (i.e., 7003 rows) After completing the installation if the security setting is high, you should change it to medium and click OK.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 5 Figure 4.1 – The iDA system architecture

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 6

7 4.2 ESX: A Multipurpose Tool for Data Mining ESX can help create target data, find irregularities in data, perform data mining, and offer insight about the practical value of discovered knowledge. Features of ESX learner model are: –It supports both supervised learning and unsupervised clustering –It supports an automated method for dealing with missing attribute value –It does not make statistical assumptions about the nature of data to be processed –It can point out inconsistencies and unusual values in data

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 8 Figure 4.3 An ESX concept hierarch

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining iDAV Format for Data Mining Second Row: C: categorical; R: real-valued Third Row (see Table 4.2 below) CharacterUsage I The attribute is used as an input attribute UThe attribute is not used (categorical attribute with several unique values are of little predictive value) DThe attribute is not used for classification or clustering, but attribute value summary information is displayed in all output reports OThe attribute is used as an output attribute. For supervised learning with ESX exactly one categorical attribute is selected as the output attribute. Table 4.2 – Values for Attribute Usage

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 10 Table 4.1 – Credit Card Promotion Database: iDAV Format

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 4.4 A Five-step Approach for Unsupervised Clustering Step 1: Enter the Data to be Mined Step 2: Perform a Data Mining Session Step 3: Read and Interpret Summary Results Step 4: Read and Interpret Individual Class Results Step 5: Visualize Individual Class Rules

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 12 Step 1: Enter The Data To Be Mined

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 13 Step 2: Perform A Data Mining Session

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 14 Figure 4.5 – Unsupervised settings for ESX (#4,p.116) Value for instance similarity: A value closer to 100 encourages the formation of new clusters A value closer to 0 favors new instances to enter existing clusters The real-valued tolerance setting helps determine the similarity criteria for real-valued attributes. A setting of 1.0 is usually appropriate.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 15 #6 A message box indicating that eight clusters were formed. This tells us the data has been successfully mine.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 16 #6, #7 (p.116)As a general rule, an unsupervised clustering of more than five or six clusters is likely to be less than optimal.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 17 #8 and #9, Repeat steps 1-4. For step 5, set the similarity value to 55

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 18 Re-rule feature Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20% Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class. Attribute significance (start with 80-90): values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation. Covering set rules: RuleMaker will generate a set of best-defining rules for each class.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 19 #10 (p.117) Set minimum rule coverage at Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20% Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class. Attribute significance: values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining A Production Rule for the Credit Card Promotion Database IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: % Rule Coverage: 66.67% Question: Can we assume that two-thirds of all females in the specified age range will take advantage of the promotion? Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 21 Output Reports: Unsupervised Clustering RES SUM: This sheet contains summary statistics about attribute values and offers several heuristics to help us determine the quality of a data mining session. RES CLS: this sheet has information about the clusters formed as a result of an unsupervised mining session RUL TYP: Instances are listed by their cluster number. The typicality of instance i is the average similarity of i to the other members of its cluster. RES RUL: The production rules generated for each cluster are contained in this sheet.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 22 #10 (p.117) Set minimum rule coverage at 30

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 23 Figure 4.7 Rules for the credit card promotion database

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 24 Step 3: Read and Interpret Summary Results (p.117) (Sheet1 RES SUM) Class Resemblance Scores (RES) Domain Resemblance Score Domain Predictability

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 25 Step 3: Read and Interpret Summary Results (p.119) Similarity value (within the class) In general, the within-class RES scores should be higher than the domain RES. It should be true for most of the classes. Instances of Class 1 have a best within- class fit

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 26 Figure Step 3: Read and Interpret Summary Results (cont.)

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 27 Figure Statistics for numerical attributes and common categorical attribute values Step 3: Read and Interpret Summary Results (cont.)

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 28 Step 4: Read and Interpret Individual Class Results (p.121) (Sheet1 RES CLS) Typicality –is defined as the average similarity of an instance to all other members of its cluster or class Class Predictability is a within-class measure. –the percent of class instances having a particular value for a categorical attribute Class Predictiveness is a between-class measure –it is defined as probability an instance resides in a specified class given the instance has the value for the chosen attribute

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 29 Figure 4.10 – Class 3 Summary Results within-classbetween-class

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 30 Figure 4.11 – Necessary and sufficient attribute values for Class 3

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 31 Step 5: Visualize Individual Class Rules IF life ins Promo = Yes THEN Class = 3 :rule accuracy 77.78% :rule coverage %

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining A Six-Step Approach for Supervised Learning Step 1: Choose an Output Attribute –Launch a fresh life insurance promotion Step 2: Perform the Mining Session Step 3: Read and Interpret Summary Results Step 4: Read and Interpret Test Set Results Step 5: Read and Interpret Class Results Step 6: Visualize and Interpret Class Rules

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 33 Step 2: Perform the Mining Session O: output; D: Display-Only Filename: CreditCardPromotion-supervised.xls

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 34 Step 2(#4): Select the number of instances for training and a real-valued tolerance setting (p.127)

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 35 Step 3 – Read and Interpret Summary Results Domain statistics for categorical attributes tells us that 80% of the training instances represent individuals without credit card insurance.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 36 Step 3 – Read and Interpret Summary Results (cont.)

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 37 Step 4 - Read and Interpret Test Set Results

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 38 Step 5 - Read and Interpret Results for Individual Classes (p.130)

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 39 Sheet1 RUL TYP In Class Yes (Life Ins. Promo) Instances of Credit Card Ins = Yes is 40% (2/5)

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 40 Step 6 – Visualize and Interpret Class Rules (p.130) Re-rule feature Minimum correctness rule (50-100): if 80, the rules generated must have an error rate less than or equal to 20% Minimum coverage (10-100): if 10, RuleMaker will generate rules that cover 10% or more of the instances in each class. Attribute significance (start with 80-90): values close to 100 will allow RuleMaker to consider only those attribute values most highly predictive of class membership for rule generation. Covering set rules: RuleMaker will generate a set of best-defining rules for each class.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 4.6 Techniques for Generating Rules 1.Define the scope of the rules. 2.Choose the instances. 3.Set the minimum rule correctness. 4.Define the minimum rule coverage. 5.Choose an attribute significance value.

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Typicality Scores Identify prototypical and outlier instances. Select a best set of training instances. Used to compute individual instance classification confidence scores. 4.7 Instance Typicality

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 43 Figure 4.13 Instance Typicality

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Special Considerations and Features Avoid Mining Delays The Quick Mine Feature –Supervised with more than 2000 training set instances, “quick mine” feature will be asked –Unsupervised with more than 2000 data instances. ESX is given a random selection of 500 instances. Erroneous and Missing Data

Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining 45 Homework Use EXS (and iDA) to perform a supervised data mining session using the CardiologyCategorical.xls data file. Save output file as CardiologyCategorical-supervised.xls Lab#4 (p.141) Turn in –1. Spreadsheet file (CardiologyCategorical- supervised.xls) that contains the outcome of data mining session –2. Word file that includes (and explains) answers to all questions (a. thru n.)