Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dr. Morgan C. Wang Department of Statistics

Similar presentations


Presentation on theme: "Dr. Morgan C. Wang Department of Statistics"— Presentation transcript:

1 An Automatic Intelligent Model Building System with Business Applications
Dr. Morgan C. Wang Department of Statistics University of Central Florida Morgan C. Wang 11/28/2018

2 OUTLINES Introduction System Discription Case Study I Case Study II
Questions Morgan C. Wang 11/28/2018

3 Introduction 11/28/2018 Morgan C. Wang

4 Accenture and GE Report
84% of companies believe that the big data analysis affect their industry in 2017 89% of companies believe that their competition advantage will erode with adequate big data analytics capability and their market share will reduce too。

5 Accenture and GE eport Only 16% of all companies use big data analytics to perform predicted analytics Only 13%of all companies use big data analytics to optimize their working process

6 Accenture and GE Report
Analytics: Summarize facts such as “Customer A claim cost is $5,000 in 2016” Predictive Analytics: Accurate predict what will happen in the future such as “Customer A will claim three times totally cost us $6,000 in 2017”

7 Accenture and GE Report
In 2020, the internet market will become to 350 billion and 20% of these are data analytics。 Analytics market will move from basic analytics to advance analytics。 In China alone, the market size will be 70 billion

8 Shortage on Data Scientists
Despite the activity around Big Data, there is still a significant shortage of skilled professionals who can truly be called Data Scientists who can evaluate business needs and impact, write the algorithms and program platforms such as Hadoop. Morgan C. Wang 11/28/2018

9 System Description 11/28/2018 Morgan C. Wang

10 Automatic-Intelligent Model Building System
YiMing Data has developed an automatic-intelligent model building system. The system has five components: data exploration component, data preparation component, model building/validation/selection component, result automatic generation/data scoring component, and model understanding component. All components reside within the data warehouse and can be used by company personnel without extensive model building training. Morgan C. Wang 11/28/2018

11 Case Study I Insurance Rate Making
11/28/2018 Morgan C. Wang

12 Data 11/28/2018 Morgan C. Wang

13 Data available for this study
Training Data: About 1,370,000 insurance policies came from all 16 units from a big metropolitan area Validation and Testing Data: About 880,000 policies came from the same metropolitan area the next year There are 26 variables that were used by this company for many years Morgan C. Wang 11/28/2018

14 Study Goal 11/28/2018 Morgan C. Wang

15 Study Goal Build a more precise pricing model that can help this company to target their customers more fairly Use price elasticity to bring in more low risk customers by lower premium Use price to discourage more high risk customers by higher premium, consequently, increase profit margin Learning model building and data preparation techniques from Dr. Wang through Morgan C. Wang 11/28/2018

16 Data Quality Modeling Tool Selection
Challenges Data Quality Modeling Tool Selection Morgan C. Wang 11/28/2018

17 Challenges – Data Quality
Data Quality Issues: Large amount of missing values Significant number of categorical variables with high cardinality Morgan C. Wang 11/28/2018

18 Data Preparation—High Cardinality
以车型变量为例,对承保标的数较少(小于1000台)的车型,对x平滑前后分别预测y,10分段的 提升曲线效果如下。平滑后两端业务过拟合的现象改善。 注:表中GINI是对承保1000台以下标的排序测算得到。若基于全体测试集,平滑前是15.71%,平滑后是15.73%。其中,承保1000台以下占比14.5%。

19 Data Preparation —Missing Values
投保组合、上年出险记录共计8个变量的缺失组合有 =256 种,分析缺失组合可以对风险细分。 为了分析变量缺失的预测效果,对8个变量分别构造是否缺失的MI变量;此外生成由它们组合衍生 的MVP变量。对比MVP加入前后的效果改善。 Performance Wiithout MVP 74.04% With MVP Smoothing 74.98% Improvement 1.27%

20 Modeling - GLM Before: Without Adequate Data Preparation
After: with Adequate Data Preparation Conclusion:With adequate data preparation the model performance improved。 Model Gini C statistics Before 18.25% 86.50% After 20.28% 90.46% Improvement 11.12% 4.58%

21 Modeling – Neural Network
Before: Without Adequate Data Preparation After: with Adequate Data Preparation Conclusions:Use neural network model performance improved。 Count Average Claim Combine Before 75.10% 74.04% 86.50% After 75.86% 74.26% 92.18% Improvement 1.01% 0.30% 6.57%

22 Pricing with Models Good model has higher C Statistics
Higher premium for all customers with predicted risk higher than current premium Lower pricing for all customers with predicted risk lower than current premium Higher Gini index after new pricing Morgan C. Wang 11/28/2018

23 Case Study II Customer Risk Score
11/28/2018 Morgan C. Wang

24 Data 11/28/2018 Morgan C. Wang

25 Data available for this study
Data came from several different sources There are 4800 Default Customers and 100,000 Normal Customers Morgan C. Wang 11/28/2018

26 Study Goal 11/28/2018 Morgan C. Wang

27 Study Goal Utilize the hidden value in the data to
Estimate the loan defaulting risk for potential customers Assess the adequate loan amount for potential customers Identify the low risk and high value new customer cluster allowing marketing and sales department to expand Morgan C. Wang 11/28/2018

28 Data Quality Modeling Tool Selection
Challenges Data Quality Modeling Tool Selection Morgan C. Wang 11/28/2018

29 Data Quality Modeling Tool Selection Integration with Existing System
Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018

30 Challenges – Data Quality
Data Quality Issues: Large amount of missing values Data came from multiple sources at different time Some out of range values Significant number of categorical variables with high cardinality The existing of non-linear relationship between target variable - “FLAG” and many numerical predictors. Morgan C. Wang 11/28/2018

31 Consequence– Missing Values (cont.)
Large amount of information will lose Prediction power will reduce Prediction results are biased Morgan C. Wang 11/28/2018

32 Consequence – Multiple Sources
Ignoring time dimension is the major reason that model performance significant lower in future scoring data than original modeling data Morgan C. Wang 11/28/2018

33 Consequence – Out of Range Values
Reduce the credibility of the study result Bias the model parameter estimation Decrease future cases scoring accuracy Morgan C. Wang 11/28/2018

34 Consequence – High Cardinality
Unreliable result produced from low frequency categories Increase model dimensionality due to high cardinality Ignore important high cardinality variables can significant reduce model reliability Morgan C. Wang 11/28/2018

35 Data Quality – Non-linearity (cont.)
Morgan C. Wang 11/28/2018

36 Data Quality – Non-linearity (cont.)
Morgan C. Wang 11/28/2018

37 Data Quality – Non-linearity (cont.)
Morgan C. Wang 11/28/2018

38 Data Quality – Non-linearity (cont.)
Morgan C. Wang 11/28/2018

39 Data Quality – Non-linearity (cont.)
Morgan C. Wang 11/28/2018

40 Data Quality – Non-linearity (cont.)
Morgan C. Wang 11/28/2018

41 Consequence – Non-linearity
Most statistical procedures including both multiple regression and generalized linear model can not fit data adequately due to nonlinearity Morgan C. Wang 11/28/2018

42 Data Quality Modeling Tool Selection Integration with Existing System
Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018

43 Modeling Tool Selection
Statistical modeling tools (Linear Regression, Ridge Regression, Lasso Regression, Elastic Net, Generalized Linear Model, Least Angle Regression) can not handle nonlinearity well Tree based modeling tools (Decision Trees, Random Forest, Gradient Boosting) can not handle linearity effectively Neural network can not handle high cardinality categorical variables very well especially when some categories have very low feequency Morgan C. Wang 11/28/2018

44 Data Quality Modeling Tool Selection Integration with Existing System
Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018

45 Integration with Existing System
Integrate modeling results with our current system seamlessly, since I want to Integrate the model with our IT system seamlessly Allow sale personal to use model results to increase their sale performance Allow marketing department to use model results to identify new marketing opportunity and to make wise marketing decision Allow internet sales department to approve loan application in timely manner I want to build model in timely manner and to spend more time to use the model to help my company Morgan C. Wang 11/28/2018

46 Model Performance Morgan C. Wang 11/28/2018

47 Model Performance Model 1: Use all attributes provided by Company A
Model 2: Use all attributes except Online Consumption Data Model 3: Use all attributes except Online Consumption and Other Consumption Data Note: All date attributes were converted to “numerical” length before performing modeling Morgan C. Wang 11/28/2018

48 Model Performance Model performance based on c-statistics (AUC)
Morgan C. Wang 11/28/2018

49 Model Performance Model performance based on Catch Rate Morgan C. Wang
11/28/2018

50 Model Performance Model Performance Based on Lift Morgan C. Wang
11/28/2018

51 Model Performance All models performance are excellent
Online Consumption Data play very limit role on improving model performance Other Consumption Data play an important role on improving model performance Morgan C. Wang 11/28/2018

52 Questions and Next Step
Morgan C. Wang 11/28/2018


Download ppt "Dr. Morgan C. Wang Department of Statistics"

Similar presentations


Ads by Google