Presentation is loading. Please wait.

Presentation is loading. Please wait.

Regression and Classification Analysis for Improved Life Insurance Underwriting – reducing information requirements to improve enrollment Script: Hello,

Similar presentations


Presentation on theme: "Regression and Classification Analysis for Improved Life Insurance Underwriting – reducing information requirements to improve enrollment Script: Hello,"— Presentation transcript:

1 Regression and Classification Analysis for Improved Life Insurance Underwriting – reducing information requirements to improve enrollment Script: Hello, and thank you for taking time to review the results of our project. Our group members are Liz Goff, Mercedez Hernandez, and David Smith. The topic we are focused on is «Regression and Classification Analysis for Improved Life Insurance Underwriting» with a high level goal of reducing the information requirements to ultimately improve insurance policy enrollment. Let’s take a closer look at the business problem. Liz Goff, Mercedez Hernandez, David Smith BIT 5534 Semester Project Presentation

2 Business Problem: Insurance Risk Classification
Background Life insurance is a billion dollar industry, so even small improvements in risk classification can result is significant savings for a life insurance company The greater number of policyholders results in better allocation of risk across the pool of policies Data Analysis Objectives Reduce information requirements for new subscribers Develop data models to predict the insurance risk level of new subscribers Provide data model deployment recommendations Script: Life insurance is a billion dollar industry that relies on the proper assignment of risk to its customers. Companies, like Prudential, have already proven that they need to focus on two areas. First, improvements in risk classification will provide better cost management across various policy types and subscriber risk levels. And second, that an increase in total policyholders will help spread risk and dampen the effects of risk level misclassifications. Our data analysis objectives support these focus areas. More potential insurance subscribers are expected to complete the application process if there are less questions they need to answer. As a result, we focused on reducing the number of input variables to the data models. Our goal for the data models was to predict insurance subscriber risk level. This includes both the risk level assignment along with a probability of that assignment being correct. The risk level prediction could then be deployed for use with new insurance policy customers. To complete these objectives, data is needed. We discuss that next…

3 Exploring the Data Data Source
Prudential Life Insurance – Existing Customer Data (Source: kaggle.com) Risk Level – known, dependent Ordinal Values, 1-8 (multi-class) Supervised Learning Mostly Categorical Attributes Pre-Normalized and Anonymized Difficult to Understand the Data Attribute Details 126 Independent, 1 Dependent 59,381 Observations/Rows Most Continuous Attributes Have Missing Values Attribute Groups Product Info, Applicant Personal Info Employment Info, Insured Info Insurance, Medical, and Family History Medical Keyword (dummy variables) Script: Our data is provided by Prudential Life Insurance and includes 59,381 rows (or observations) of existing customer data. For each observation, the insurance risk level is known, and this attribute represents our dependent variable for data modeling. Because the risk level was known in advance, supervised learning was used. Risk level is ordinal and takes on values that range from 1 (low risk) to 8 (high risk). This adds some complexity to data modeling since the risk level value is considered multi-class, meaning there are more than 2 possible values to choose from. Additional challenges were that most of the independent attributes were categorical and that all attributes were pre-normalized and anonymized. This made it difficult to understand the meaning of the data and question being asked of an insurance subscriber. In total, there were 127 attributes within the data set and they were broken up into groups. Groups include product info, personal information about the applicant such as weight, height, and age, employment info, insured info, insurance history, medical history, and family history. And finally, a set of dummy variables indicate the presence of certain medical keywords applicable to the customer. The data set needed to be prepared for modeling, and that is what we focused on next…

4 Preparing For Modeling
Missing Values Variable Reduction Training/Validation Split (2/3, 1/3) Training - 39,588; Validation – 19,794 Modeling Bootstrap Forest Neural Network Script: A few things needed to be addressed with the data set prior to modeling. Missing values was a concern as most of the continuous numeric variables included some degree of missingness. Variable reduction was also explored as it supports the business objectives. Both topics are described later in more detail. The data set was also split into a training and validation set using random selection. The split was 2/3 of the observations for training, and 1/3 for validation which ultimately provided 39,588 training observations, and 19,794 validation observations. This holdout data is used to evaluate models for overfitting and compare model performance for final model selection. The data preparation is described next, which will then be followed by an overview of our bootstrap forest and neural network models.

5 Missing Values and Imputation
Uncovered relationship between Medical_History_1 and _2 When Medical_History_1 was missing, the value of Medical_History_2 is ‘162’ These missing values were replaced with the column average, and a new variable MedHist1_Missing was created to identify these substitutions and monitor their impact to the model Excluded the following variables due to >80% Missingness Medical_History_10, 15, 24, 34 Missing values needed to be addressed within the dataset. Extreme missingness, where greater than 80% of the observations had missing values, was found in four of the ‘medical history’ continuous attributes. Due to the anonymous nature of this dataset, these attributes were excluded from data mining. Other attributes with missing values resulted in interesting findings. One example is the relationship between Medical_History_1 and 2. The mosaic plot shown is slightly difficult to read, but the important aspect of this plot is the exclusively blue middle section that indicates a missing value for the Medical_History_1 attribute. This large section of blue occurs when the value of Medical_History_2 is In other words, when Medical_History_2 has a value of 162, then it is guaranteed that the value of Medical_Histor_1 will be missing. For model training, we filled the missing values with the column average and created an indicator variable that stores whether the original value was missing or not (1 for missing, 0 if not). Without compromising the anonymity of the dataset, it could be beneficial to consult with data experts to determine why the value of 162 is significant and potentially improve the data models in future projects.

6 Missing Values and Imputation
Mutually exclusive variables Family_Hist_2 and Family_Hist_3 Family_Hist_4 and Family_Hist_5 Merged the variables and created nominal indicator column where 0=both missing 2/4=value from Family_Hist_2 or _4 3/5=value from Family_Hist_3 or _5 Other missing values imputed with column means Another interesting set of missing data occurs within the ‘Family History’ attribute group. There is mutual exclusive missingness observed between ‘Family_Hist_2’ and ‘3’, and also between ‘Family_Hist_4’ and ‘5’. This is shown visually in the cell plot where red indicates a non-missing value and blue indicates missing. Here you see there is no overlap of red between attributes 2 and 3, or 4 and 5. We merged them both into individual columns; replaced overall missing values with the column mean; and generated an indicator column that stores the original column number where the data came from. For other attributes with less interesting missingness, the values were imputed with the column means. And now that missingness has been addressed, we turn to variable reduction.

7 All Variables (Showing Top Ranked)
Predictor Screening All Variables (Showing Top Ranked) Variables by Group Variable reduction is a key business objective for this data mining project. A reduced set of variables translates into a reduced set of questions that a potential insurance customer will have to answer when filling out an application. Classification tree boosting was used as the primary method of variable reduction. Within the JMP software program, there is a “predictor screening” tool that ranks a set of independent variables by their contribution to a boosted tree model. Two approaches were taken when using this tool. The predictor attributes are grouped into different types of data collected about an insurance applicant. The first approach, shown on the left, was to rank attributes by only comparing them to other attributes within their same group. Here, we use the “Insured Info” set of attributes. The second approach, shown top right, was to use all available attributes compared against each other. Notice how “InsuredInfo_6” performs extremely well within its smaller grouped set of attributes, yet has weaker performance when compared against all attributes. Visually, a receiver operator characteristic curve (or ROC curve) is used on the bottom right to show weak vs. strong predictor attributes. The diagonal upward sloping reference line of an ROC plot indicates a pure random prediction. Curves above the diagonal represent predictive power that is better than random. These plots were created by performing a logistic regression of one attribute to predict the ordinal target attribute that represents insurance risk level. Here, ‘Employment_Info_4’ is not a good predictor since the curve follows the diagonal. Notice how the ROC curve on the right shows an increase in prediction performance for the ‘BMI’ attribute. This is expected since ‘BMI’ is one of the best predictors as seen in the upper right image. The top predictors of each attribute group and the overall top predictors were then used to build models. ROC Curve Shows Prediction Power differences

8 Bootstrap Forest Model
24 Top Attributes Across All Available Attributes 3 model variations explored 41 Trees in Forest R-Square Training = .701 Validation = .626 Misclassification Rate Training = 44.8% Validation = 47.9% Our data mining problem is supervised with a known response variable that indicates an insurance applicants assigned risk value. The value of the risk level dependent variable is multi-class, meaning that there is only one assigned value but it is selected from one of eight possible values. In addition, the risk level is ordinal such that risk is greater as you move from values of 1 up to 8. The first model type selected for analysis was a bootstrap forest. Three different models were generated using different combinations of variables. The variable sets had been previously determined during the predictor screening process. The final bootstrap forest model that performed the best utilized the top 24 attributes across all available attributes. A total of 41 classification trees were used to make final predictions for the insurance risk level. Model fit is represented by R-square values, misclassification rates, and ROC curves. Here, the training R-square is .701 with a misclassification rate of 44.8% across all predicted values of insurance risk. This means that slightly greater than half of all predictions are correct. The ROC curves show the predictive performance across each different value of the insurance risk level. One interesting observation is that the best performing value in the training data set (risk level 3 in dark blue) is only the 4th performing value in the validation data set. This could be an indication of model instability.

9 Neural Network Model 29 input attributes; Best from each attribute group 11 model variations explored Vary hidden layers, input attributes, penalty methods 2 Hidden Layers, 16 Nodes ea. R-Square Training = .653 Validation = .619 Misclassification Rate Training = 46.9% Validation = 48.8% Neural network models were also explored. 11 different models were built using variations of hidden layer node counts, predictor variables, and penalty methods. The best performing model used 29 input attributes made up of the best attributes from each of the attribute groups (examples being Family History, Medical History, etc.). This is different than the bootstrap forest that used the top attributes overall. This means that the neural network model has more diversity among the input variables since it uses some from each group. The final model included 2 hidden layers, each containing 16 nodes with a hyperbolic tangent activation function. The hidden layer then sends output to 8 logistic functions that calculate the probability of an observation being assigned to the corresponding insurance risk level. The highest probability is then used as the predicted value. It’s important to note that bootstrap forests also create similar probabilities for each insurance risk level value. Model performance is again shown using R-square, misclassification rates, and ROC curves. Like the bootstrap forest model, greater than 50% of the predicted values are correct with misclassification rates on the training and validation data at 46.9% and 48.8% respectively. So which model is preferred?

10   Final Model Selection Bootstrap Neural Confusion Matrix Comparison
Using Validation Data Set; 19,794 Rows Similar Misclassification rates Bootstrap – 47.9%, Neural – 48.8% Neural Network Selected as Best Difference in R-Square, train vs. validation Neural Δ R-Sq = -.034 Bootstrap Δ R-Sq = -.075 Neural Network performs across all output values; Bootstrap ignores output values with lower frequencies (i.e. risk level 3 and 4) Neural model uses attributes from all attribute groups; more diverse Bootstrap Final model selection was evaluated using confusion matrices, misclassification rates, and R-square deltas across training and validation data sets. The confusion matrices shown are specific to the validation data used for modeling which included 19,794 observations. The diagonal highlighted area on each matrix represents the correct predictions across each insurance risk level. The correct vs. incorrect predictions are used to calculate the misclassification rates, which are very similar between the neural and bootstrap forest models; only differing by .9%. Ultimately, the neural network model was selected as the preferred prediction model. The differences in R-square values across training and validation data were calculated and the neural network model is ideal with a smaller delta of The bootstrap forest model has an R-square delta of meaning that it does not generalize to new data observations as good as the neural network model. This lack of generalization can is seen when we focus on the predicted risk level of 3 and 4. Insurance risk level 3 and 4 are the least frequent among the training and validation data set at 1.7 ad 2.4% of all observations, respectively. The bootstrap model nearly ignores these values when making predictions, while the neural network does not. The highlighted areas in the confusion matrices for values 3 and 4 show this with the bootstrap model having nearly all 0’s for predicted counts. This diversity in prediction capability is further evidenced by the predictor attributes used. The neural model uses attributes from each attribute group (Family History, Medical History, etc.), while the bootstrap model only focuses on the best attributes overall. Based on these results, the neural network model was selected for deployment. Neural

11 Deployment The model is deployed with the following commitments to Prudential: 1. ·Customer expectations for the software must be managed. 2. A complete delivery package is assembled and tested. 3. A support system is established before software is delivered. 4. Instructional materials are provided to the end users. 5. Software bugs are fixed first and delivered later. We fully expect and understand that questions will be raised when changes occur to a consumer-facing process like underwriting. We also recognize that predictive modeling is a new and growing trend in life insurance, and the industry culture and regulations may evolve to in ways that impact how data and predictive models are used. The new model will be delivered with the following commitments: 1. Customer expectations for the software must be managed: This ensures the development team and Prudential’s expectations are on the same page about the model’s capabilities and prediction abilities. 2. A complete delivery package is assembled and tested: All executable software, support data files, and other relevant information is assembled and thoroughly tested with actual users before implementation. 3. A support system is established before software is delivered: The development team ensures responsiveness and accurate information about how the predictive model works and the results. The team will be available before, after, and during implementation to ensure an accurate deployment. 4. Instructional materials are provided to the end users: Prudential is provided with accurate and thorough support documents. Training and trouble-shooting guidelines are also provided. 5. Software bugs are fixed first and delivered later: The deployed model is held to rigorous standards, and all will not deploy a “buggy” system if issues are known. Prudential can chose to implement a prediction model with a acceptable risk score interval. There could be a threshold set on the probabilities such that a probability of is required to be confident enough in the predicted risk level.  For example, the risk score probability must be at least .65 (65%), and any less of a prediction probability would refer the applicant to an insurance sales agent for further review.

12 Final Model Allows Prudential to set a threshold for risk level probability Hard to interpret given dataset was pre-normalized Neural Network models are hard to understand so use for targeted marketing and other leveraging of the model from its initial purpose is limited The data model generates formulas that calculate probabilities for an observation belonging to each of 8 different insurance risk levels.  An insurance application could be made available via a web or mobile device service where a potential customer is presented with the 29 questions.  This represents approximately a 77% reduction in questions from the original set of 126. Based on the responses and calculated probabilities, an insurance quote would be instantly generated.  However, there could be thresholds set on the probabilities such that a probability of, say, at least .65 (65%) is referred to an insurance sales agent for further review.


Download ppt "Regression and Classification Analysis for Improved Life Insurance Underwriting – reducing information requirements to improve enrollment Script: Hello,"

Similar presentations


Ads by Google