Advanced Analytics Using Enterprise Miner

Slides:



Advertisements
Similar presentations
Summary Statistics/Simple Graphs in SAS/EXCEL/JMP.
Advertisements

SAS 9.2 Getting Started. 2006/03/01. SAS Main Window.
One-Way and Factorial ANOVA SPSS Lab #3. One-Way ANOVA Two ways to run a one-way ANOVA 1.Analyze  Compare Means  One-Way ANOVA Use if you have multiple.
“I Don’t Need Enterprise Miner”
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Detecting univariate outliers Detecting multivariate outliers
Descriptive Statistics In SAS Exploring Your Data.
AEB 37 / AE 802 Marketing Research Methods Week 5
A Simple Guide to Using SPSS© for Windows
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
Decision Tree Models in Data Mining
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
1 Chapter 1: Introduction 1.1 Introduction to SAS Enterprise Miner.
Chapter 1: Introduction
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
Zhangxi Lin ISQS Texas Tech University Note: Most slides in this file are sourced from Course Notes Lecture Notes 8 Continuous and Multiple.
Application of SAS®! Enterprise Miner™ in Credit Risk Analytics
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
Overview DM for Business Intelligence.
PY550 Research and Statistics Dr. Mary Alberici Central Methodist University.
Introductory Statistical Concepts. Disclaimer – I am not an expert SAS programmer. – Nothing that I say is confirmed or denied by Texas A&M University.
Using the Frequencies Procedure in SPSS 9.0 for Windows © by Julia Hartman © Copyright 2000, Julia Hartman.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Building And Interpreting Decision Trees in Enterprise Miner.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
The AIE Monte Carlo Tool The AIE Monte Carlo tool is an Excel spreadsheet and a set of supporting macros. It is the main tool used in AIE analysis of a.
Using SPSS for Windows Part II Jie Chen Ph.D. Phone: /6/20151.
The introduction to SPSS Ⅱ.Tables and Graphs for one variable ---Descriptive Statistics & Graphs.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
SAS Homework 4 Review Clustering and Segmentation
Summary Statistics Review
Introduction to SPSS. Object of the class About the windows in SPSS The basics of managing data files The basic analysis in SPSS.
1 An Introduction to SPSS for Windows Jie Chen Ph.D. 6/4/20161.
Introduction to Quantitative Research Analysis and SPSS SW242 – Session 6 Slides.
Chapter 4: Introduction to Predictive Modeling: Regressions
June 21, Objectives  Enable the Data Analysis Add-In  Quickly calculate descriptive statistics using the Data Analysis Add-In  Create a histogram.
LANDESK SOFTWARE CONFIDENTIAL Tips and Tricks with Filters Jenny Lardh.
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
Chapter 3: Organizing Data. Raw data is useless to us unless we can meaningfully organize and summarize it (descriptive statistics). Organization techniques.
1 Chapter 4: Introduction to Predictive Modeling: Regressions 4.1 Introduction 4.2 Selecting Regression Inputs 4.3 Optimizing Regression Complexity 4.4.
1 Chapter 3: Getting Started with Tasks 3.1 Introduction to Task Dialogs 3.2 Creating a Listing Report 3.3 Creating a Frequency Report 3.4 Creating a Two-Way.
Describing Distributions Statistics for the Social Sciences Psychology 340 Spring 2010.
IENG-385 Statistical Methods for Engineers SPSS (Statistical package for social science) LAB # 1 (An Introduction to SPSS)
Advanced Quantitative Techniques
Chapter 3: Getting Started with Tasks
EMPA Statistical Analysis
Introduction to SPSS July 28, :00-4:00 pm 112A Stright Hall
BINARY LOGISTIC REGRESSION
Probability and Statistics
Logistic Regression APKC – STATS AFAC (2016).
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Notes on Logistic Regression
Introduction to SPSS.
By Dr. Madhukar H. Dalvi Nagindas Khandwala college
Assumption of normality
Jonathan W. Duggins; James Blum NC State University; UNC Wilmington
DEPARTMENT OF COMPUTER SCIENCE
Introduction to Data Mining and Classification
Bivariate Testing (Chi Square)
SAS Homework 2 Review Decision trees
Bivariate Testing (Chi Square)
Introduction to TouchDevelop
Data Analysis Module: Chi Square
Introduction to SAS Essentials Mastering SAS for Data Analytics
Multiple Regression – Split Sample Validation
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
Presentation transcript:

Advanced Analytics Using Enterprise Miner A Primer for the Predictive Modeler Certification

Importing Data: PVA97NK Metadata Advisor Options. ”Advanced  Customize” class levels count threshold: 2 reject levels count threshold: 100 Target_D change to rejected

Exploring Data Right click dataset in the left side panel Choose Explore. See options for SAMPLING METHOD Choose Plot. Histogram. DemAge set Role to X Right click  graph properties Change number of bins to 87 Observe histogram Then include a bin for missing values resize window to be smaller Create Pie Chart w/ Target_B (set role to category) see how you can interactively select parts of population from either chart.

See histograms for all variables Drag dataset to diagram right click dataset, highlight all variables, click explore Look further into DemMedIncome variable. Add more bins to the histogram. We’ll want to change these 0 income values to missing. In the Explore Window you can also retrieve some basic descriptive statistics, including percent missing.

MODIFY tab This tab has nodes that involve modifying the columns of a dataset

MODIFY: Replacement Node Can be used to replace certain values of a variable (usually extreme values) with a specified replacement value. Income variable is interval – so focus on that portion of properties panel Change Default Limits Method to ‘None’ (So nothing else gets changed) Change Replacement Values to ‘Missing’ Click ellipses next to Replacement Editor Change method to ’User-Specified’ for the DemMedIncome For Replacement Lower Limit put the number 1. A new variable is created All values of DemMedIncome values that fall below 1 are then set to missing. All other values do not change. Run the node, then in properties panel click Exported Data ellipses and explore the histogram for the generated variable – include a missing bin in histogram.

Regression Modelling Let’s build a regression model to predict binary target We’ll split our data into training and validation first Then we’ll need to take care of missing values using the impute node

SAMPLE tab This tab has nodes that involve modifying the rows of a dataset

SAMPLE: Data Partition Node Splits data into specified proportions of Training/Validation/Test data Connect Data Partition Node to the Replacement Node and specify: 65% Training 35% Validation

MODIFY: Impute Node After Data Partition Node, Connect the Impute Node. Take a look at the Properties Panel and explore the defaults. Panel is split into Class Variables and Interval Variables. Input and Target variables are specified separately. Use Median Imputation for numeric variables and tree imputation for Class Variables. Under Score section, create binary indicator variables that show you’ve imputed a variable. Unique indicators are for each variable Single indicators are for each observation (1 if anything imputed) Set their Role to Input to include them in the modeling process

MODEL: Regression Node Selection Model: Stepwise Selection Change the Selection Criteria for the model to validation misclassification rate (This is how EM will optimize the complexity of the model) Notice panel option for including all interactions and all quadratic terms. Can also specify certain interactions by setting User Terms to ‘Yes’ and using the Term Editor Choose the variable to be entered into interaction. click right arrow. Choose Second variable. click right arrow. Click Save to save that interaction term Can be used to create multivariable interactions.

MODIFY: Transform Variables Let’s see if we can get a better model by transforming our numeric inputs to be more normal Drag Transform Variables node and connect to Impute Node For Interval INPUTS choose ‘Maximum Normal’ Can look at the results of this node to see what type of transformation was applied to each interval variable to make its distribution more normal.

ASSESS: Model Comparison Node Which regression model worked best? Change the selection criteria to Validation Average Squared Error (Change both the Selection Statistic and the Selection Table), do you choose the same model? Which model has the best lift at a depth of 10%?

EXPLORE: StatExplore Node The StatExplore Node will give basic univariate descriptive statistics and also statistics regarding the relationships of variables with the target. Connect the StatExplore Node after the Data Partition See the ”worth” of each interval variable and the Chi-Square Score for each Class variable. In this case, try to find the Chi-Squared value relating DemCluster to the target. Hint: Search the output for DemCluster.

Variable Selection – Two Methods EXPLORE: Variable Selection Node. MODEL: Decision Tree Node. For this example, change the Subtree method to Largest (this results in an unpruned tree) Use Both of these methods to filter variables by simply connecting them to a model on the other side. Connect a Neural Network to each (two Neural Network nodes). Use the default Neural Network options. (1 Hidden Layer, 3 Hidden Units) Connect all 4 models (2 Regressions, 2 Nnets) to a Model Comparison Node.

Correcting for Prior Probabilities Suppose the current data was oversampled to account for a rare event. We can enter in the true population proportion of events in the Decisions ellipses on the data set properties. This is the same place we entered in decision weights for profit/loss Set the prior of the event (Target =1) to 0.05 and the nonevent to 0.95 How does this effect our models? Regressions were chosen to minimize validation misclassification. That is done by calling everything a non-event with these priors! If we change that specification, we get better regression models.

Scoring a Data Set Lets score a new set of observations, contained in the dataset SCOREPVA97NK Import Data using same customized advanced metadata input as previous data (Slide 2) Set the data ROLE to SCORE (if you miss this on import, it’s on the properties panel) Drag data to the diagram Drag a Score node from ASSESS tab Connect Score Data and Model to the Score Node and run Score node Examine (Browse) Exported Data (The Score Table) to find predicted probabilities etc.

SAS CODE Node YOU MUST USE THE MACRO NAMES PROVIDED IN THE SAS CODE NODE TO REFER TO DATASET. Run proc univariate on the predicted probabilities: proc univariate data= &EM_import_score; var P_Target_B1; run;

SAS CODE Node Add a new column to that scored data that contains the variable name, equal to “Shaina” for all observations: Click Code Editor in the Properties Panel. data &EM_EXPORT_TRAIN; set &EM_Import_score; name = "Shaina"; run;