 Propensity Model  Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition.

Slides:



Advertisements
Similar presentations
Using the SmartPLS Software “Structural Model Assessment”
Advertisements

SPSS Session 5: Association between Nominal Variables Using Chi-Square Statistic.
Week 3. Logistic Regression Overview and applications Additional issues Select Inputs Optimize complexity Transforming Inputs.
Logit & Probit Regression
Bivariate Analysis Cross-tabulation and chi-square.
Logistic Regression Example: Horseshoe Crab Data
Logistic Regression Multivariate Analysis. What is a log and an exponent? Log is the power to which a base of 10 must be raised to produce a given number.
Outliers Split-sample Validation
Logistic Regression Predicting Dichotomous Data. Predicting a Dichotomy Response variable has only two states: male/female, present/absent, yes/no, etc.
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
Multinomial Logistic Regression
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Ordinal Logistic Regression
Chi-square Test of Independence
Outliers Split-sample Validation
Nemours Biomedical Research Statistics April 23, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Notes on Logistic Regression STAT 4330/8330. Introduction Previously, you learned about odds ratios (OR’s). We now transition and begin discussion of.
An Introduction to Logistic Regression
Statistics for Managers Using Microsoft® Excel 5th Edition
Multiple Regression – Basic Relationships
Multinomial Logistic Regression Basic Relationships
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 12: Multiple and Logistic Regression Marshall University.
Example of Simple and Multiple Regression
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
Logistic Regression II Simple 2x2 Table (courtesy Hosmer and Lemeshow) Exposure=1Exposure=0 Disease = 1 Disease = 0.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Overview DM for Business Intelligence.
Chapter 10 Hypothesis Testing
Fundamentals of Hypothesis Testing: One-Sample Tests
Hierarchical Binary Logistic Regression
9/23/2015Slide 1 Published reports of research usually contain a section which describes key characteristics of the sample included in the study. The “key”
Chi-square Test of Independence Steps in Testing Chi-square Test of Independence Hypotheses.
SW388R7 Data Analysis & Computers II Slide 1 Multinomial Logistic Regression: Complete Problems Outliers and Influential Cases Split-sample Validation.
Slide 1 The SPSS Sample Problem To demonstrate these concepts, we will work the sample problem for logistic regression in SPSS Professional Statistics.
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
 Mail Order Company in USA › Would like to find out if there is a way › To reduce mailing cost › By analyzing the past data.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.
CADA Final Review Assessment –Continuous assessment (10%) –Mini-project (20%) –Mid-test (20%) –Final Examination (50%) 40% from Part 1 & 2 60% from Part.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 26.
© Department of Statistics 2012 STATS 330 Lecture 20: Slide 1 Stats 330: Lecture 20.
Slide 1 The Kleinbaum Sample Problem This problem comes from an example in the text: David G. Kleinbaum. Logistic Regression: A Self-Learning Text. New.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
Chap 8-1 Fundamentals of Hypothesis Testing: One-Sample Tests.
© Department of Statistics 2012 STATS 330 Lecture 30: Slide 1 Stats 330: Lecture 30.
General Linear Model.
Multiple Logistic Regression STAT E-150 Statistical Methods.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
Logistic Regression Analysis Gerrit Rooks
Analysis of Experiments
Intermediate Applied Statistics STAT 460 Lecture 23, 12/08/2004 Instructor: Aleksandra (Seša) Slavković TA: Wang Yu
Chapter 5 – Evaluating Predictive Performance Data Mining for Business Analytics Shmueli, Patel & Bruce.
Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis.
LOGISTIC REGRESSION. Purpose  Logistical regression is regularly used when there are only two categories of the dependent variable and there is a mixture.
Other tests of significance. Independent variables: continuous Dependent variable: continuous Correlation: Relationship between variables Regression:
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Notes on Logistic Regression
Propensity Modeling and Targeted Marketing
Regression Analysis Simple Linear Regression
Multiple Regression.
Advanced Analytics Using Enterprise Miner
Multiple Regression – Split Sample Validation
Multinomial Logistic Regression: Complete Problems
Presentation transcript:

 Propensity Model  Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition Development  The most common application is in New Customer Acquisition or Customer Development (Cross-sell or Up-sell) skimming the cream  Finds probability of “willingness” to an action, and leads to “focused marketing”, and thus skimming the cream  Retail Chains, Hotels, Airlines, Banks etc. may seek to find the Customers most likely to respond to an offer  We would develop this with “Logistic Regression” in R Introduction

Propensity Profession, Locality, Credit Score Loyalty Membership, Time since acquired Age, Income, Marital Status, Gender Introduction Practical Cases may have dozens of inputs Very large datasets Needs significant groundwork in terms of Data Quality Screening Inputs Output Example

Introduction Login to open RStudio

Command Prompt, we give instructions here Objects created during R Session appear here We view them here by clicking on them Introduction Graphical Results are displayed here

Loading Sample Data > data <- read.csv(“Sample_data.csv”) > data <- data.frame(data) Sample Data on Response to a Promotional Campaign Creating Training & Validation Sets > training <- data.frame(data[1:150, ]) > Validation <- data.frame(data[151:200, ]) Data Quality (missing values) Checks > sapply(training, function(x) sum(is.na(x))) Getting Started 1.read.csv is used to read csv files in R 2.Likewise we have read.fwf (fixed width files), read.table, scan, readLines etc. to read data in different formats 3.Data can be read from Databases, from URLs or HDFS (Hadoop Clusters) 1.Training data is built on first 150 records in the file, will be used to develop the model 2.Validation data is built on next 50 records, and will be used to assess the accuracy of model 1.sapply is a standard R function, which in this case applies a function (count no. of missing values) on training data) 2.is.na(x) is used to check whether a cell has no value 3.The function can be defined outside and called by its name in sapply

Data 200 data points in total Outcome Probable Inputs > nrow(subset(data, data$Response == "Yes")) Result is 39, hence 20% of Customers responded to this Campaign > sapply(training, function(x) sum(is.na(x))) Yields 0 for all Fields, hence no missing values In case of missing values, usually populated with Mean of remaining values

Logistic Regression Model > model <- glm(Response ~., family = binomial(link = 'logit'), data = training) Developing the Model  All Inputs includes in the model  Car shows “NA” values  Car and House columns are replica of each other, one of them is redundant input  Either Car or House has to be discarded  ID is merely an identifier, hence should be discarded  Level “Yes” is reported in the table, as level “No” is considered a reference value  In case of Categorical variables like them, one is taken as Reference level and Dummy variables are created for other levels  Useful to check reference levels, as eases interpretation of results > contrasts(data$House) or > contrasts(data$Car) would show the reference level

Logistic Regression Model > model <- glm(Response ~ Age + Income + Asset + Liability + Dependents + House + Spending, family = binomial(link = 'logit'), data = training) Developing the Model  Model developed with screened variables  Income and Liability seems to be strong predictors of “Propensity” to accept the offer (based on Pr(>|z|), should be less than 0.05  A large difference in Null and Residual Deviance is also desired  Null deviance corresponds to a no variable model (only intercept)  Residual deviance corresponds to inclusion of screened variables in the model

Variable Significance Apart from the p value, Anova test can be used to check variable importance Developing the Model  Variables with large Deviance indicates strong relationship with the Outcome  Income and Liability come as strong predictors based on Chi-square test too

Validation we would modify “Validation” data as per the screened variables, Excluding ID and Car as discussed earlier > validation <- data.frame(validation[,c(2:7, 9:10)]) Predicting the “Response” for Validation Data > predicted <- predict( model, newdata = validation, type = 'response') Default type is “logit” value, i.e. the log of odds, from which probability needs to be derived Type = ‘response’ gives result as 1 or 0 based on 0.5 cut-off value Since our data has “Yes” and “No” values, we need to convert as follows > predicted 0.5, "Yes", "No") Error is measured as misclassified cases, where predicted is not same as actual Response > error <- mean(predicted != validation$Response) > print(error) it is 0.08 in this case, hence 92% records are classified correctly Validating the Model

Validation Model accuracy can be calculated more explicitly as follows > predicted <- predict( model, newdata = validation, type = 'response') > predicted 0.5, "Yes", "No") > predicted <- data.frame(predicted) > compare <- data.frame(cbind(validation, predicted) > error <- nrow(subset(compare, Response != predicted)) > error <- table(compare$Response != compare$predicted) Things to watch out for Please recall, we sought to find the proportion of Respondents to Non-respondents earlier (slide #6), if this ratio is very skewed like 5% or less, then Logistic Regression may not be a good choice We may have to use “Oversampling for Rare Cases” (read more) or use other modelsread more In case of substantial missing values, we need to impute them with some reasonable value, like Average or Median values Validating the Model

Propensity Scoring  Dataset would be much larger, probably several thousand or million of records  Variable screening would be much more rigorous  logit or “probability” value is used  Records are sorted based on the probability values  Based on the cut-off values used, top 20% or 30% customers are targeted Practical Scenario

Keep watching this space for more examples in R Please do not try to read/write any files or install packages You may occasionally find “memory problem” In case of any questions or concern, please do not hesitate to send an at

CTR + L (lower case) To clear the console To clear Objects from the Workspace To clear Plots from the Window Clean up the Rstudio before exiting for others to use Clean up To clear data