Download presentation
1
Introduction to Data Mining with XLMiner
Business Intelligence
2
Where Data Mining fits in
3
Getting started with Data Mining
Preparing for Data Mining Types of DM methods Descriptive Predictive Prescriptive
4
Data Mining Methods in XLMiner
Data Preparation and Exploration Data Preparation Data Visualization Dimension Reduction Prediction Linear Regression K-nearest Neighbors Regression Trees Neural Nets Time Series Forecasting Smoothing Methods Regression Based Classification Naïve Bayes K-nearest Neigbors Classification Trees Logistic Regression Discriminant Analysis Affinity Analysis Association Rules Segmentation Cluster Analysis
5
First Part of DM: Data Data Selection Preprocessing Operational Data
External sources Preprocessing Remove duplicates Remove common errors or noisy data Domain Consistency Coding etc. ETL and Data Mining Tools can help
6
Details on Data Preprocessing
Enrichment Adding additional data Sometimes only to a subset of the data Need to organize the data to enable adding the additional values Coding Modify the data to make it work more effectively Without losing the meaning Take out some detail while retaining relative value Categorical data to dummy variables Categorize Address to region Birth date to age Scale appropriately Divide by 1000 for $ values where appropriate Binary Attributes Yes-No to 0-1 Time Series Convert date to month numbers starting from a fixed point (1900) Dimension Reduction
7
Dealing with Data: Excel and XLMiner
Transformation Flat file (CSV, Delimited, etc) Various Functions in Excel (Text, Date, etc.) Missing Data identification Excel functions Count and Countif Countblank Absurd Data Date of Birth as Phone number as , etc. Needs review and sometime domain knowledge Pivot Tables and conditional formatting Outliers Identification XLMiner Boxplots Excel Descriptive Statistics Histograms Normalization Needed when you have mixed data and the actual scale does not matter (you do not have to have the income reported to the exact dollars, it is as good when reported in units of $1000) Essentially creation of z-score [ (X-mean)/sigma]
8
Changing the Categorical Variables
Create Dummy Variables Depending on the categories, you may have to create a lot of dummy variables Rule: Total number of dummy variables should be equal to the nominal categories. Usually the last dummy is not needed as it is determined by the absence of other categorical values Example: Color = Red, Blue and Green. We will need three dummies, XR, XB, and XG For Red the values will be 1, 0, 0 For Blue the values will be 0, 1, 0 And for Green the values will be 0, 0, 1 As you can see, we could have done without the dummy for the green as the value of 0, 0 for the first two variables will automatically indicate green (assuming only three colors), so XR and XB would have been enough. For ordinal categorical variables, use one dummy and give it various values based on the order of the categories. Example: Grades can be F, D, C, B and A Possible values of XGrade will be 0, 1, 2, 3, and 4.
9
Examples of Creating Dummy
Experience values: 1, 2, 3 or 4 Dummies are: 000 = 1, 100 = 2, = 3, 001 = 4 Education Values are: Bachelors, Masters and PhD
10
XLMiner Data Preparation
Dummy Variables XLMiner-> Data Utilities->Transform Categorical Data Try it with Universal Bank Data with Education and Family Outliers Histogram (needs Binning) Box Plot (under charting Data) Can also be done in Excel Rule is to find the outright outliers and then cutting off the values which are asymmetric on either side Box Plot and Histogram tell you the skewness of the distribution
11
Data Mining Data Groups
Given a collection of records (training set ) A model is created for a particular prediction or classification Goal: previously unseen records should be predicted or classified as accurately as possible. A validation/test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Finally, a previously unseen data group is used for final model performance (called Test set in XLMiner)
12
XLMiner: Opening and Partitioning Datasets
The demo version of the program only allows 600 rows. We will use the educational version to go beyond that. Partitioning of the data Training Used for building the model Validation Used for validating the quality of the model Test Used for testing the model, specially for the algorithms that may use both training and validation data recursively to build the model Oversampling Employed for spare record sets.
13
XL Miner Outputs Module dependent Typically has a summary report
Has the option of full reports All Excel worksheets, so can be used for further calculations Includes Lift Charts and Decile Lift Charts
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.