Presentation is loading. Please wait.

Presentation is loading. Please wait.

 Propensity Model  Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition.

Similar presentations


Presentation on theme: " Propensity Model  Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition."— Presentation transcript:

1  Propensity Model  Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition Development  The most common application is in New Customer Acquisition or Customer Development (Cross-sell or Up-sell) skimming the cream  Finds probability of “willingness” to an action, and leads to “focused marketing”, and thus skimming the cream  Retail Chains, Hotels, Airlines, Banks etc. may seek to find the Customers most likely to respond to an offer  We would develop this with “Logistic Regression” in R Introduction

2 Propensity Profession, Locality, Credit Score Loyalty Membership, Time since acquired Age, Income, Marital Status, Gender Introduction Practical Cases may have dozens of inputs Very large datasets Needs significant groundwork in terms of Data Quality Screening Inputs Output Example

3 Introduction Login to open RStudio

4 Command Prompt, we give instructions here Objects created during R Session appear here We view them here by clicking on them Introduction Graphical Results are displayed here

5 Loading Sample Data > data <- read.csv(“Sample_data.csv”) > data <- data.frame(data) Sample Data on Response to a Promotional Campaign Creating Training & Validation Sets > training <- data.frame(data[1:150, ]) > Validation <- data.frame(data[151:200, ]) Data Quality (missing values) Checks > sapply(training, function(x) sum(is.na(x))) Getting Started 1.read.csv is used to read csv files in R 2.Likewise we have read.fwf (fixed width files), read.table, scan, readLines etc. to read data in different formats 3.Data can be read from Databases, from URLs or HDFS (Hadoop Clusters) 1.Training data is built on first 150 records in the file, will be used to develop the model 2.Validation data is built on next 50 records, and will be used to assess the accuracy of model 1.sapply is a standard R function, which in this case applies a function (count no. of missing values) on training data) 2.is.na(x) is used to check whether a cell has no value 3.The function can be defined outside and called by its name in sapply

6 Data 200 data points in total Outcome Probable Inputs > nrow(subset(data, data$Response == "Yes")) Result is 39, hence 20% of Customers responded to this Campaign > sapply(training, function(x) sum(is.na(x))) Yields 0 for all Fields, hence no missing values In case of missing values, usually populated with Mean of remaining values

7 Logistic Regression Model > model <- glm(Response ~., family = binomial(link = 'logit'), data = training) Developing the Model  All Inputs includes in the model  Car shows “NA” values  Car and House columns are replica of each other, one of them is redundant input  Either Car or House has to be discarded  ID is merely an identifier, hence should be discarded  Level “Yes” is reported in the table, as level “No” is considered a reference value  In case of Categorical variables like them, one is taken as Reference level and Dummy variables are created for other levels  Useful to check reference levels, as eases interpretation of results > contrasts(data$House) or > contrasts(data$Car) would show the reference level

8 Logistic Regression Model > model <- glm(Response ~ Age + Income + Asset + Liability + Dependents + House + Spending, family = binomial(link = 'logit'), data = training) Developing the Model  Model developed with screened variables  Income and Liability seems to be strong predictors of “Propensity” to accept the offer (based on Pr(>|z|), should be less than 0.05  A large difference in Null and Residual Deviance is also desired  Null deviance corresponds to a no variable model (only intercept)  Residual deviance corresponds to inclusion of screened variables in the model

9 Variable Significance Apart from the p value, Anova test can be used to check variable importance Developing the Model  Variables with large Deviance indicates strong relationship with the Outcome  Income and Liability come as strong predictors based on Chi-square test too

10 Validation we would modify “Validation” data as per the screened variables, Excluding ID and Car as discussed earlier > validation <- data.frame(validation[,c(2:7, 9:10)]) Predicting the “Response” for Validation Data > predicted <- predict( model, newdata = validation, type = 'response') Default type is “logit” value, i.e. the log of odds, from which probability needs to be derived Type = ‘response’ gives result as 1 or 0 based on 0.5 cut-off value Since our data has “Yes” and “No” values, we need to convert as follows > predicted 0.5, "Yes", "No") Error is measured as misclassified cases, where predicted is not same as actual Response > error <- mean(predicted != validation$Response) > print(error) it is 0.08 in this case, hence 92% records are classified correctly Validating the Model

11 Validation Model accuracy can be calculated more explicitly as follows > predicted <- predict( model, newdata = validation, type = 'response') > predicted 0.5, "Yes", "No") > predicted <- data.frame(predicted) > compare <- data.frame(cbind(validation, predicted) > error <- nrow(subset(compare, Response != predicted)) > error <- table(compare$Response != compare$predicted) Things to watch out for Please recall, we sought to find the proportion of Respondents to Non-respondents earlier (slide #6), if this ratio is very skewed like 5% or less, then Logistic Regression may not be a good choice We may have to use “Oversampling for Rare Cases” (read more) or use other modelsread more In case of substantial missing values, we need to impute them with some reasonable value, like Average or Median values Validating the Model

12 Propensity Scoring  Dataset would be much larger, probably several thousand or million of records  Variable screening would be much more rigorous  logit or “probability” value is used  Records are sorted based on the probability values  Based on the cut-off values used, top 20% or 30% customers are targeted Practical Scenario

13 Keep watching this space for more examples in R Please do not try to read/write any files or install packages You may occasionally find “memory problem” In case of any questions or concern, please do not hesitate to send an email at info@anallyz.com info@anallyz.com

14 CTR + L (lower case) To clear the console To clear Objects from the Workspace To clear Plots from the Window Clean up the Rstudio before exiting for others to use Clean up To clear data


Download ppt " Propensity Model  Propensity Model refers to Statistical Models predicting “willingness” to perform an action like accepting an offer etc. Acquisition."

Similar presentations


Ads by Google