Introduction to Data Mining with XLMiner

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Florida International University COP 4770 Introduction of Weka.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Lecture Notes for Chapter 4 Introduction to Data Mining
Chapter 7 – K-Nearest-Neighbor
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Big data analytics with R and Hadoop Chapter 5 Learning Data Analytics with R and Hadoop 데이터마이닝연구실 김지연.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Chapter 2 Overview of the Data Mining Process 1. Introduction Data Mining – Predictive analysis Tasks of Classification & Prediction Core of Business.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Data Mining Techniques
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Overview DM for Business Intelligence.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Chapter 3 Data Exploration and Dimension Reduction 1.
1 Lazy Learning – Nearest Neighbor Lantz Ch 3 Wk 2, Part 1.
Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
COMP3503 Intro to Inductive Modeling
Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages.
Data Mining – Input: Concepts, instances, attributes Chapter 2.
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Introduction to SQL Server Data Mining Nick Ward SQL Server & BI Product Specialist Microsoft Australia Nick Ward SQL Server & BI Product Specialist Microsoft.
Chapter 6 Data Mining 1. Introduction The increase in the use of data-mining techniques in business has been caused largely by three events. The explosion.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
June 21, Objectives  Enable the Data Analysis Add-In  Quickly calculate descriptive statistics using the Data Analysis Add-In  Create a histogram.
Project 1 FINA B. Group of 5. Due by 18/09/ parts. Each worth 50% of total. Need to provide 1 excel workbook for part 1 and part 2. This.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Part II Tools for Knowledge Discovery Ch 5. Knowledge Discovery in Databases Ch 6. The Data Warehouse Ch 7. Formal Evaluation Technique.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
In Stat-I, we described data by three different ways. Qualitative vs Quantitative Discrete vs Continuous Measurement Scales Describing Data Types.
An Exercise in Machine Learning
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
Weka Tutorial. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering – association rule Created by.
Overview of the Data Mining Process
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Data Mining What is to be done before we get to Data Mining?
Show Me Potential Customers Data Mining Approach Leila Etaati.
Collage Score Card & Software defect prediction
EDUCAUSE Annual Conference
A Smart Tool to Predict Salary Trends of H1-B Holders
EMPA Statistical Analysis
BINARY LOGISTIC REGRESSION
XLMiner – a Data Mining Toolkit
DATA MINING © Prentice Hall.
Advanced Analytics Using Enterprise Miner
Prepared by: Mahmoud Rafeek Al-Farra
Machine Learning with Weka
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Chapter 7: Transformations
Presentation transcript:

Introduction to Data Mining with XLMiner Business Intelligence

Where Data Mining fits in

Getting started with Data Mining Preparing for Data Mining Types of DM methods Descriptive Predictive Prescriptive

Data Mining Methods in XLMiner Data Preparation and Exploration Data Preparation Data Visualization Dimension Reduction Prediction Linear Regression K-nearest Neighbors Regression Trees Neural Nets Time Series Forecasting Smoothing Methods Regression Based Classification Naïve Bayes K-nearest Neigbors Classification Trees Logistic Regression Discriminant Analysis Affinity Analysis Association Rules Segmentation Cluster Analysis

First Part of DM: Data Data Selection Preprocessing Operational Data External sources Preprocessing Remove duplicates Remove common errors or noisy data Domain Consistency Coding etc. ETL and Data Mining Tools can help

Details on Data Preprocessing Enrichment Adding additional data Sometimes only to a subset of the data Need to organize the data to enable adding the additional values Coding Modify the data to make it work more effectively Without losing the meaning Take out some detail while retaining relative value Categorical data to dummy variables Categorize Address to region Birth date to age Scale appropriately Divide by 1000 for $ values where appropriate Binary Attributes Yes-No to 0-1 Time Series Convert date to month numbers starting from a fixed point (1900) Dimension Reduction

Dealing with Data: Excel and XLMiner Transformation Flat file (CSV, Delimited, etc) Various Functions in Excel (Text, Date, etc.) Missing Data identification Excel functions Count and Countif Countblank Absurd Data Date of Birth as 01-01-01 Phone number as 123-456-7890, etc. Needs review and sometime domain knowledge Pivot Tables and conditional formatting Outliers Identification XLMiner Boxplots Excel Descriptive Statistics Histograms Normalization Needed when you have mixed data and the actual scale does not matter (you do not have to have the income reported to the exact dollars, it is as good when reported in units of $1000) Essentially creation of z-score [ (X-mean)/sigma]

Changing the Categorical Variables Create Dummy Variables Depending on the categories, you may have to create a lot of dummy variables Rule: Total number of dummy variables should be equal to the nominal categories. Usually the last dummy is not needed as it is determined by the absence of other categorical values Example: Color = Red, Blue and Green. We will need three dummies, XR, XB, and XG For Red the values will be 1, 0, 0 For Blue the values will be 0, 1, 0 And for Green the values will be 0, 0, 1 As you can see, we could have done without the dummy for the green as the value of 0, 0 for the first two variables will automatically indicate green (assuming only three colors), so XR and XB would have been enough. For ordinal categorical variables, use one dummy and give it various values based on the order of the categories. Example: Grades can be F, D, C, B and A Possible values of XGrade will be 0, 1, 2, 3, and 4.

Examples of Creating Dummy Experience values: 1, 2, 3 or 4 Dummies are: 000 = 1, 100 = 2, 010 = 3, 001 = 4 Education Values are: Bachelors, Masters and PhD

XLMiner Data Preparation Dummy Variables XLMiner-> Data Utilities->Transform Categorical Data Try it with Universal Bank Data with Education and Family Outliers Histogram (needs Binning) Box Plot (under charting Data) Can also be done in Excel Rule is to find the outright outliers and then cutting off the values which are asymmetric on either side Box Plot and Histogram tell you the skewness of the distribution

Data Mining Data Groups Given a collection of records (training set ) A model is created for a particular prediction or classification Goal: previously unseen records should be predicted or classified as accurately as possible. A validation/test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Finally, a previously unseen data group is used for final model performance (called Test set in XLMiner)

XLMiner: Opening and Partitioning Datasets The demo version of the program only allows 600 rows. We will use the educational version to go beyond that. Partitioning of the data Training Used for building the model Validation Used for validating the quality of the model Test Used for testing the model, specially for the algorithms that may use both training and validation data recursively to build the model Oversampling Employed for spare record sets.

XL Miner Outputs Module dependent Typically has a summary report Has the option of full reports All Excel worksheets, so can be used for further calculations Includes Lift Charts and Decile Lift Charts