An Automatic Intelligent Model Building System with Business Applications Dr. Morgan C. Wang Department of Statistics University of Central Florida Morgan C. Wang 11/28/2018
OUTLINES Introduction System Discription Case Study I Case Study II Questions Morgan C. Wang 11/28/2018
Introduction 11/28/2018 Morgan C. Wang
Accenture and GE Report 84% of companies believe that the big data analysis affect their industry in 2017 89% of companies believe that their competition advantage will erode with adequate big data analytics capability and their market share will reduce too。
Accenture and GE eport Only 16% of all companies use big data analytics to perform predicted analytics Only 13%of all companies use big data analytics to optimize their working process
Accenture and GE Report Analytics: Summarize facts such as “Customer A claim cost is $5,000 in 2016” Predictive Analytics: Accurate predict what will happen in the future such as “Customer A will claim three times totally cost us $6,000 in 2017”
Accenture and GE Report In 2020, the internet market will become to 350 billion and 20% of these are data analytics。 Analytics market will move from basic analytics to advance analytics。 In China alone, the market size will be 70 billion
Shortage on Data Scientists Despite the activity around Big Data, there is still a significant shortage of skilled professionals who can truly be called Data Scientists who can evaluate business needs and impact, write the algorithms and program platforms such as Hadoop. Morgan C. Wang 11/28/2018
System Description 11/28/2018 Morgan C. Wang
Automatic-Intelligent Model Building System YiMing Data has developed an automatic-intelligent model building system. The system has five components: data exploration component, data preparation component, model building/validation/selection component, result automatic generation/data scoring component, and model understanding component. All components reside within the data warehouse and can be used by company personnel without extensive model building training. Morgan C. Wang 11/28/2018
Case Study I Insurance Rate Making 11/28/2018 Morgan C. Wang
Data 11/28/2018 Morgan C. Wang
Data available for this study Training Data: About 1,370,000 insurance policies came from all 16 units from a big metropolitan area Validation and Testing Data: About 880,000 policies came from the same metropolitan area the next year There are 26 variables that were used by this company for many years Morgan C. Wang 11/28/2018
Study Goal 11/28/2018 Morgan C. Wang
Study Goal Build a more precise pricing model that can help this company to target their customers more fairly Use price elasticity to bring in more low risk customers by lower premium Use price to discourage more high risk customers by higher premium, consequently, increase profit margin Learning model building and data preparation techniques from Dr. Wang through Morgan C. Wang 11/28/2018
Data Quality Modeling Tool Selection Challenges Data Quality Modeling Tool Selection Morgan C. Wang 11/28/2018
Challenges – Data Quality Data Quality Issues: Large amount of missing values Significant number of categorical variables with high cardinality Morgan C. Wang 11/28/2018
Data Preparation—High Cardinality 以车型变量为例,对承保标的数较少(小于1000台)的车型,对x平滑前后分别预测y,10分段的 提升曲线效果如下。平滑后两端业务过拟合的现象改善。 注:表中GINI是对承保1000台以下标的排序测算得到。若基于全体测试集,平滑前是15.71%,平滑后是15.73%。其中,承保1000台以下占比14.5%。
Data Preparation —Missing Values 投保组合、上年出险记录共计8个变量的缺失组合有 2 8 =256 种,分析缺失组合可以对风险细分。 为了分析变量缺失的预测效果,对8个变量分别构造是否缺失的MI变量;此外生成由它们组合衍生 的MVP变量。对比MVP加入前后的效果改善。 Performance Wiithout MVP 74.04% With MVP Smoothing 74.98% Improvement 1.27%
Modeling - GLM Before: Without Adequate Data Preparation After: with Adequate Data Preparation Conclusion:With adequate data preparation the model performance improved。 Model Gini C statistics Before 18.25% 86.50% After 20.28% 90.46% Improvement 11.12% 4.58%
Modeling – Neural Network Before: Without Adequate Data Preparation After: with Adequate Data Preparation Conclusions:Use neural network model performance improved。 Count Average Claim Combine Before 75.10% 74.04% 86.50% After 75.86% 74.26% 92.18% Improvement 1.01% 0.30% 6.57%
Pricing with Models Good model has higher C Statistics Higher premium for all customers with predicted risk higher than current premium Lower pricing for all customers with predicted risk lower than current premium Higher Gini index after new pricing Morgan C. Wang 11/28/2018
Case Study II Customer Risk Score 11/28/2018 Morgan C. Wang
Data 11/28/2018 Morgan C. Wang
Data available for this study Data came from several different sources There are 4800 Default Customers and 100,000 Normal Customers Morgan C. Wang 11/28/2018
Study Goal 11/28/2018 Morgan C. Wang
Study Goal Utilize the hidden value in the data to Estimate the loan defaulting risk for potential customers Assess the adequate loan amount for potential customers Identify the low risk and high value new customer cluster allowing marketing and sales department to expand Morgan C. Wang 11/28/2018
Data Quality Modeling Tool Selection Challenges Data Quality Modeling Tool Selection Morgan C. Wang 11/28/2018
Data Quality Modeling Tool Selection Integration with Existing System Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018
Challenges – Data Quality Data Quality Issues: Large amount of missing values Data came from multiple sources at different time Some out of range values Significant number of categorical variables with high cardinality The existing of non-linear relationship between target variable - “FLAG” and many numerical predictors. Morgan C. Wang 11/28/2018
Consequence– Missing Values (cont.) Large amount of information will lose Prediction power will reduce Prediction results are biased Morgan C. Wang 11/28/2018
Consequence – Multiple Sources Ignoring time dimension is the major reason that model performance significant lower in future scoring data than original modeling data Morgan C. Wang 11/28/2018
Consequence – Out of Range Values Reduce the credibility of the study result Bias the model parameter estimation Decrease future cases scoring accuracy Morgan C. Wang 11/28/2018
Consequence – High Cardinality Unreliable result produced from low frequency categories Increase model dimensionality due to high cardinality Ignore important high cardinality variables can significant reduce model reliability Morgan C. Wang 11/28/2018
Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018
Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018
Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018
Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018
Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018
Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018
Consequence – Non-linearity Most statistical procedures including both multiple regression and generalized linear model can not fit data adequately due to nonlinearity Morgan C. Wang 11/28/2018
Data Quality Modeling Tool Selection Integration with Existing System Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018
Modeling Tool Selection Statistical modeling tools (Linear Regression, Ridge Regression, Lasso Regression, Elastic Net, Generalized Linear Model, Least Angle Regression) can not handle nonlinearity well Tree based modeling tools (Decision Trees, Random Forest, Gradient Boosting) can not handle linearity effectively Neural network can not handle high cardinality categorical variables very well especially when some categories have very low feequency Morgan C. Wang 11/28/2018
Data Quality Modeling Tool Selection Integration with Existing System Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018
Integration with Existing System Integrate modeling results with our current system seamlessly, since I want to Integrate the model with our IT system seamlessly Allow sale personal to use model results to increase their sale performance Allow marketing department to use model results to identify new marketing opportunity and to make wise marketing decision Allow internet sales department to approve loan application in timely manner I want to build model in timely manner and to spend more time to use the model to help my company Morgan C. Wang 11/28/2018
Model Performance Morgan C. Wang 11/28/2018
Model Performance Model 1: Use all attributes provided by Company A Model 2: Use all attributes except Online Consumption Data Model 3: Use all attributes except Online Consumption and Other Consumption Data Note: All date attributes were converted to “numerical” length before performing modeling Morgan C. Wang 11/28/2018
Model Performance Model performance based on c-statistics (AUC) Morgan C. Wang 11/28/2018
Model Performance Model performance based on Catch Rate Morgan C. Wang 11/28/2018
Model Performance Model Performance Based on Lift Morgan C. Wang 11/28/2018
Model Performance All models performance are excellent Online Consumption Data play very limit role on improving model performance Other Consumption Data play an important role on improving model performance Morgan C. Wang 11/28/2018
Questions and Next Step Morgan C. Wang 11/28/2018