Propensity Model Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition Development The most common application is in New Customer Acquisition or Customer Development (Cross-sell or Up-sell) skimming the cream Finds probability of “willingness” to an action, and leads to “focused marketing”, and thus skimming the cream Retail Chains, Hotels, Airlines, Banks etc. may seek to find the Customers most likely to respond to an offer We would develop this with “Logistic Regression” in R Introduction
Propensity Profession, Locality, Credit Score Loyalty Membership, Time since acquired Age, Income, Marital Status, Gender Introduction Practical Cases may have dozens of inputs Very large datasets Needs significant groundwork in terms of Data Quality Screening Inputs Output Example
Introduction Login to open RStudio
Command Prompt, we give instructions here Objects created during R Session appear here We view them here by clicking on them Introduction Graphical Results are displayed here
Loading Sample Data > data <- read.csv(“Sample_data.csv”) > data <- data.frame(data) Sample Data on Response to a Promotional Campaign Creating Training & Validation Sets > training <- data.frame(data[1:150, ]) > Validation <- data.frame(data[151:200, ]) Data Quality (missing values) Checks > sapply(training, function(x) sum(is.na(x))) Getting Started 1.read.csv is used to read csv files in R 2.Likewise we have read.fwf (fixed width files), read.table, scan, readLines etc. to read data in different formats 3.Data can be read from Databases, from URLs or HDFS (Hadoop Clusters) 1.Training data is built on first 150 records in the file, will be used to develop the model 2.Validation data is built on next 50 records, and will be used to assess the accuracy of model 1.sapply is a standard R function, which in this case applies a function (count no. of missing values) on training data) 2.is.na(x) is used to check whether a cell has no value 3.The function can be defined outside and called by its name in sapply
Data 200 data points in total Outcome Probable Inputs > nrow(subset(data, data$Response == "Yes")) Result is 39, hence 20% of Customers responded to this Campaign > sapply(training, function(x) sum(is.na(x))) Yields 0 for all Fields, hence no missing values In case of missing values, usually populated with Mean of remaining values
Logistic Regression Model > model <- glm(Response ~., family = binomial(link = 'logit'), data = training) Developing the Model All Inputs includes in the model Car shows “NA” values Car and House columns are replica of each other, one of them is redundant input Either Car or House has to be discarded ID is merely an identifier, hence should be discarded Level “Yes” is reported in the table, as level “No” is considered a reference value In case of Categorical variables like them, one is taken as Reference level and Dummy variables are created for other levels Useful to check reference levels, as eases interpretation of results > contrasts(data$House) or > contrasts(data$Car) would show the reference level
Logistic Regression Model > model <- glm(Response ~ Age + Income + Asset + Liability + Dependents + House + Spending, family = binomial(link = 'logit'), data = training) Developing the Model Model developed with screened variables Income and Liability seems to be strong predictors of “Propensity” to accept the offer (based on Pr(>|z|), should be less than 0.05 A large difference in Null and Residual Deviance is also desired Null deviance corresponds to a no variable model (only intercept) Residual deviance corresponds to inclusion of screened variables in the model
Variable Significance Apart from the p value, Anova test can be used to check variable importance Developing the Model Variables with large Deviance indicates strong relationship with the Outcome Income and Liability come as strong predictors based on Chi-square test too
Validation we would modify “Validation” data as per the screened variables, Excluding ID and Car as discussed earlier > validation <- data.frame(validation[,c(2:7, 9:10)]) Predicting the “Response” for Validation Data > predicted <- predict( model, newdata = validation, type = 'response') Default type is “logit” value, i.e. the log of odds, from which probability needs to be derived Type = ‘response’ gives result as 1 or 0 based on 0.5 cut-off value Since our data has “Yes” and “No” values, we need to convert as follows > predicted 0.5, "Yes", "No") Error is measured as misclassified cases, where predicted is not same as actual Response > error <- mean(predicted != validation$Response) > print(error) it is 0.08 in this case, hence 92% records are classified correctly Validating the Model
Validation Model accuracy can be calculated more explicitly as follows > predicted <- predict( model, newdata = validation, type = 'response') > predicted 0.5, "Yes", "No") > predicted <- data.frame(predicted) > compare <- data.frame(cbind(validation, predicted) > error <- nrow(subset(compare, Response != predicted)) > error <- table(compare$Response != compare$predicted) Things to watch out for Please recall, we sought to find the proportion of Respondents to Non-respondents earlier (slide #6), if this ratio is very skewed like 5% or less, then Logistic Regression may not be a good choice We may have to use “Oversampling for Rare Cases” (read more) or use other modelsread more In case of substantial missing values, we need to impute them with some reasonable value, like Average or Median values Validating the Model
Propensity Scoring Dataset would be much larger, probably several thousand or million of records Variable screening would be much more rigorous logit or “probability” value is used Records are sorted based on the probability values Based on the cut-off values used, top 20% or 30% customers are targeted Practical Scenario
Keep watching this space for more examples in R Please do not try to read/write any files or install packages You may occasionally find “memory problem” In case of any questions or concern, please do not hesitate to send an at
CTR + L (lower case) To clear the console To clear Objects from the Workspace To clear Plots from the Window Clean up the Rstudio before exiting for others to use Clean up To clear data