Data Analysis Case Study – Auto Claim Assignment Ming Sun, American Family Insurance
About Myself 2014 - present 1999 -2005 2005 -2014 Application Development J2EE Web App Java Batch Processing Solution Architecture Big Data Analytics Mobile APP Application Integrations Data Warehouse Integrations Data Science Engineering Repeatable Data Science Pipelines Exploratory Data Analysis Data Lake Design Technology Incubation
Analytical Solution Life-cycle Start Here Current State Bottomline CBA Topline Benefits Data Sources Containerization CI/CD Monitor Pipelines Model Registry Solution Deployment Problem Definition Model Techniques Model Performance Model Pipelines Data Domains Data Quality Data Design Data Blend Data Pipelines Model Development Data Preparation
Problem Definition Scope – Determine if a damaged vehicle should be totaled or repaired at the early stage of auto claims Current State Point Based Model Accuracy < 80% Bottom Line CBA Annual savings amount 10% lift ≈ $500k-$2M Top Line Benefits Impact to customer satisfaction
Problem Definition – Data Sources 3rd normal form DB Claim System – Old (DB2) Partial Data Claims Data Warehouse (DB2) Claim System – New (Oracle) Partial Data No Data 3rd Party Data (daily files)
Data Preparation – Data Domains Handling Assignment (6 - 8 table) 3rd Party Loss Estimates (5 files) Initial Claim (7 - 10 table) Customer Satisfaction (2 files) Code Description 10+ Table Total Loss Workflow (2 - 4 table) Salvage Info (2 table)
Data Preparation – Grain/Quality/Blend The grain of blended dataset - Vehicle Current snapshot of all closed auto collision claims Identify keys to blend claims, 3rd party estimates, and customer satisfaction Profile the blended dataset: record counts, missing values, column value distribution, correlation, etc. This is where the 60% project time is spent
Problem Definition Analysis Current Process: Vehicle Questionnaire Number of questions: 17 12 Questions not answered > 80% Assignment Accuracy ≈ 80% Assigned Repairable, actual Total Loss ≈ 2x % Assigned Total Loss, actual Repairable ≈ x % Mis-assigned Claim Costs Assigned Repairable, actual Total Loss ≈ $ 3y per claim Assigned Total Loss, actual Repairable ≈ $ y per claim
Customer Satisfaction Impact Analysis 5 satisfaction score buckets with 5 being the most satisfied False Positives have the worst impact, followed by False Negatives Customers are happy with True Negatives
Model Development Winner – Logistic Regression Models Misclassification Rate ROC Random Forest 0.136 0.90 Logistic Regression 0.145 0.89 Comparison Category Which Model is Better Technical Performance Random Forest Implementation Cost Logistic Regression (200 vs 1000 hours) Annual Saving Forecast tie The Random Forest model out performs the Points Model from a model performance standpoint. The forecasted annuals savings from these two models are very similar. The time to integrate the Points Model into the claim system is much shorter than the Random Forest model Winner – Logistic Regression
Model Development – Cont’d low Scores Repairable Cutoff Point Manual Review Total Loss Cutoff Point high
Solution Deployment Simplified Vehicle Questionnaire Questions: 12 8 Logistic Regression Points Assignment Claim System - New UI Got rid of the questions that cannot be answered easily. Simplified Vehicle Questionnaire Questions: 12 8 Answers: Y/N List of Choices
Takeaways Data analysis is critical throughout Keep the data scope reasonable Deep knowledge of business process and data Ease of implementation over model techniques Be conservative when estimating savings Pilot the solution first for 3-6 months to test It is a team effort (analysts, engineers, scientists)
Parting Thought – Data Preparation Most time consuming work Tedious and not glamourous Foundational work – Data Lake Venerable of being the scapegoat