Introduction to Data Mining with XLMiner Business Intelligence
Where Data Mining fits in
Getting started with Data Mining Preparing for Data Mining Types of DM methods Descriptive Predictive Prescriptive
Data Mining Methods in XLMiner Data Preparation and Exploration Data Preparation Data Visualization Dimension Reduction Prediction Linear Regression K-nearest Neighbors Regression Trees Neural Nets Time Series Forecasting Smoothing Methods Regression Based Classification Naïve Bayes K-nearest Neigbors Classification Trees Logistic Regression Discriminant Analysis Affinity Analysis Association Rules Segmentation Cluster Analysis
First Part of DM: Data Data Selection Preprocessing Operational Data External sources Preprocessing Remove duplicates Remove common errors or noisy data Domain Consistency Coding etc. ETL and Data Mining Tools can help
Details on Data Preprocessing Enrichment Adding additional data Sometimes only to a subset of the data Need to organize the data to enable adding the additional values Coding Modify the data to make it work more effectively Without losing the meaning Take out some detail while retaining relative value Categorical data to dummy variables Categorize Address to region Birth date to age Scale appropriately Divide by 1000 for $ values where appropriate Binary Attributes Yes-No to 0-1 Time Series Convert date to month numbers starting from a fixed point (1900) Dimension Reduction
Dealing with Data: Excel and XLMiner Transformation Flat file (CSV, Delimited, etc) Various Functions in Excel (Text, Date, etc.) Missing Data identification Excel functions Count and Countif Countblank Absurd Data Date of Birth as 01-01-01 Phone number as 123-456-7890, etc. Needs review and sometime domain knowledge Pivot Tables and conditional formatting Outliers Identification XLMiner Boxplots Excel Descriptive Statistics Histograms Normalization Needed when you have mixed data and the actual scale does not matter (you do not have to have the income reported to the exact dollars, it is as good when reported in units of $1000) Essentially creation of z-score [ (X-mean)/sigma]
Changing the Categorical Variables Create Dummy Variables Depending on the categories, you may have to create a lot of dummy variables Rule: Total number of dummy variables should be equal to the nominal categories. Usually the last dummy is not needed as it is determined by the absence of other categorical values Example: Color = Red, Blue and Green. We will need three dummies, XR, XB, and XG For Red the values will be 1, 0, 0 For Blue the values will be 0, 1, 0 And for Green the values will be 0, 0, 1 As you can see, we could have done without the dummy for the green as the value of 0, 0 for the first two variables will automatically indicate green (assuming only three colors), so XR and XB would have been enough. For ordinal categorical variables, use one dummy and give it various values based on the order of the categories. Example: Grades can be F, D, C, B and A Possible values of XGrade will be 0, 1, 2, 3, and 4.
Examples of Creating Dummy Experience values: 1, 2, 3 or 4 Dummies are: 000 = 1, 100 = 2, 010 = 3, 001 = 4 Education Values are: Bachelors, Masters and PhD
XLMiner Data Preparation Dummy Variables XLMiner-> Data Utilities->Transform Categorical Data Try it with Universal Bank Data with Education and Family Outliers Histogram (needs Binning) Box Plot (under charting Data) Can also be done in Excel Rule is to find the outright outliers and then cutting off the values which are asymmetric on either side Box Plot and Histogram tell you the skewness of the distribution
Data Mining Data Groups Given a collection of records (training set ) A model is created for a particular prediction or classification Goal: previously unseen records should be predicted or classified as accurately as possible. A validation/test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Finally, a previously unseen data group is used for final model performance (called Test set in XLMiner)
XLMiner: Opening and Partitioning Datasets The demo version of the program only allows 600 rows. We will use the educational version to go beyond that. Partitioning of the data Training Used for building the model Validation Used for validating the quality of the model Test Used for testing the model, specially for the algorithms that may use both training and validation data recursively to build the model Oversampling Employed for spare record sets.
XL Miner Outputs Module dependent Typically has a summary report Has the option of full reports All Excel worksheets, so can be used for further calculations Includes Lift Charts and Decile Lift Charts