Dr. Morgan C. Wang Department of Statistics

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Managerial Economics Estimation of Demand
Quantitative Research and Analytics, Proprietary and Confidential1 Ryan Michaluk
Evaluating Inforce Blocks Of Disability Business With Predictive Modeling SOA Spring Health Meeting May 28, 2008 Jonathan Polon FSA
1 Multiple Regression A single numerical response variable, Y. Multiple numerical explanatory variables, X 1, X 2,…, X k.
Predictive Modeling for Disability Pricing May 13, 2009 Claim Analytics Inc. Barry Senensky FSA FCIA MAAA Jonathan Polon FSA
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
STCPM title A model of bank price and nonprice competition with endogenous expected loan losses Filipa Lima Paulo Soares de Pinho Emerging Scholars in.
1. Abstract 2 Introduction Related Work Conclusion References.
Chapter 9 Business Intelligence Systems
Considerations in P&C Pricing Segmentation February 25, 2015 Bob Weishaar, Ph.D., FCAS, MAAA.
Data Mining: A Closer Look
Introduction to Data Mining Data mining is a rapidly growing field of business analytics focused on better understanding of characteristics and.
2006 CAS RATEMAKING SEMINAR CONSIDERATIONS FOR SMALL BUSINESSOWNERS POLICIES (COM-3) Beth Fitzgerald, FCAS, MAAA.
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
Comparison of Classification Methods for Customer Attrition Analysis Xiaohua Hu, Ph.D. Drexel University Philadelphia, PA, 19104
Data Mining By Jason Baltazar, Phil Cademas, Jillian Latham, Rachel Peeler & Kamila Singh.
THE SCIENCE OF RISK SM 1 Interaction Detection in GLM – a Case Study Chun Li, PhD ISO Innovative Analytics March 2012.
Overview DM for Business Intelligence.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Chapter 9 Business Intelligence and Information Systems for Decision Making.
Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
5.2 Input Selection 5.3 Stopped Training
Neural Networks Automatic Model Building (Machine Learning) Artificial Intelligence.
Arben Asllani University of Tennessee at Chattanooga Prescriptive Analytics CHAPTER 8 Marketing Analytics with Linear Programming Business Analytics with.
Part 5 Staffing Activities: Employment
CEN st Lecture CEN 4021 Software Engineering II Instructor: Masoud Sadjadi Monitoring (POMA)
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Predictive Modeling Spring 2005 CAMAR meeting Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc
MIS2502: Data Analytics Advanced Analytics - Introduction.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Special Challenges With Large Data Mining Projects CAS PREDICTIVE MODELING SEMINAR Beth Fitzgerald ISO October 2006.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
BY International School of Engineering {We Are Applied Engineering} Disclaimer: Some of the Images and content have been taken from multiple online sources.
Predicting Mortgage Pre-payment Risk. Introduction Definition Borrower pays off the loan before the contracted term loan length. Lender loses future part.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences
Statistics in Insurance Business
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
INSURANCE ANALYTICS SUITE
Machine Learning with Spark MLlib
Makes Insurance Smarter.
Data Transformation: Normalization
Predicting Azure Consumption using Ensemble Learning
Decision Trees in Analytical Model Development
MIS2502: Data Analytics Advanced Analytics - Introduction
Discussion/Presentation of Park and Basu: “Alternative Evaluation Metrics for Risk Adjustment Models” Stephen P. Ryan, Olin.
Regression Analysis Module 3.
[ March 9, 2017] [ Bill Bowles, Audit Supervisor]
Predict House Sales Price
NBA Draft Prediction BIT 5534 May 2nd 2018
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Predicting Government Spending on Professional Services
L. Isella, A. Karvounaraki (JRC) D. Karlis (AUEB)
Big Data Econometrics: Nowcasting and Early Estimates
What is Regression Analysis?
Introduction to Predictive Modeling
Linear Model Selection and regularization
INNOvation in TRAINING BUSINESS ANALYSTS HAO HElEN Zhang UniVERSITY of ARIZONA
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Predicting Loan Defaults
[Group Name].
Presentation transcript:

An Automatic Intelligent Model Building System with Business Applications Dr. Morgan C. Wang Department of Statistics University of Central Florida Morgan C. Wang 11/28/2018

OUTLINES Introduction System Discription Case Study I Case Study II Questions Morgan C. Wang 11/28/2018

Introduction 11/28/2018 Morgan C. Wang

Accenture and GE Report 84% of companies believe that the big data analysis affect their industry in 2017 89% of companies believe that their competition advantage will erode with adequate big data analytics capability and their market share will reduce too。

Accenture and GE eport Only 16% of all companies use big data analytics to perform predicted analytics Only 13%of all companies use big data analytics to optimize their working process

Accenture and GE Report Analytics: Summarize facts such as “Customer A claim cost is $5,000 in 2016” Predictive Analytics: Accurate predict what will happen in the future such as “Customer A will claim three times totally cost us $6,000 in 2017”

Accenture and GE Report In 2020, the internet market will become to 350 billion and 20% of these are data analytics。 Analytics market will move from basic analytics to advance analytics。 In China alone, the market size will be 70 billion

Shortage on Data Scientists Despite the activity around Big Data, there is still a significant shortage of skilled professionals who can truly be called Data Scientists who can evaluate business needs and impact, write the algorithms and program platforms such as Hadoop. Morgan C. Wang 11/28/2018

System Description 11/28/2018 Morgan C. Wang

Automatic-Intelligent Model Building System YiMing Data has developed an automatic-intelligent model building system. The system has five components: data exploration component, data preparation component, model building/validation/selection component, result automatic generation/data scoring component, and model understanding component. All components reside within the data warehouse and can be used by company personnel without extensive model building training. Morgan C. Wang 11/28/2018

Case Study I Insurance Rate Making 11/28/2018 Morgan C. Wang

Data 11/28/2018 Morgan C. Wang

Data available for this study Training Data: About 1,370,000 insurance policies came from all 16 units from a big metropolitan area Validation and Testing Data: About 880,000 policies came from the same metropolitan area the next year There are 26 variables that were used by this company for many years Morgan C. Wang 11/28/2018

Study Goal 11/28/2018 Morgan C. Wang

Study Goal Build a more precise pricing model that can help this company to target their customers more fairly Use price elasticity to bring in more low risk customers by lower premium Use price to discourage more high risk customers by higher premium, consequently, increase profit margin Learning model building and data preparation techniques from Dr. Wang through Morgan C. Wang 11/28/2018

Data Quality Modeling Tool Selection Challenges Data Quality Modeling Tool Selection Morgan C. Wang 11/28/2018

Challenges – Data Quality Data Quality Issues: Large amount of missing values Significant number of categorical variables with high cardinality Morgan C. Wang 11/28/2018

Data Preparation—High Cardinality 以车型变量为例,对承保标的数较少(小于1000台)的车型,对x平滑前后分别预测y,10分段的 提升曲线效果如下。平滑后两端业务过拟合的现象改善。 注:表中GINI是对承保1000台以下标的排序测算得到。若基于全体测试集,平滑前是15.71%,平滑后是15.73%。其中,承保1000台以下占比14.5%。

Data Preparation —Missing Values 投保组合、上年出险记录共计8个变量的缺失组合有 2 8 =256 种,分析缺失组合可以对风险细分。 为了分析变量缺失的预测效果,对8个变量分别构造是否缺失的MI变量;此外生成由它们组合衍生 的MVP变量。对比MVP加入前后的效果改善。 Performance Wiithout MVP 74.04% With MVP Smoothing 74.98% Improvement 1.27%

Modeling - GLM Before: Without Adequate Data Preparation After: with Adequate Data Preparation Conclusion:With adequate data preparation the model performance improved。 Model Gini C statistics Before 18.25% 86.50% After 20.28% 90.46% Improvement 11.12% 4.58%

Modeling – Neural Network Before: Without Adequate Data Preparation After: with Adequate Data Preparation Conclusions:Use neural network model performance improved。 Count Average Claim Combine Before 75.10% 74.04% 86.50% After 75.86% 74.26% 92.18% Improvement 1.01% 0.30% 6.57%

Pricing with Models Good model has higher C Statistics Higher premium for all customers with predicted risk higher than current premium Lower pricing for all customers with predicted risk lower than current premium Higher Gini index after new pricing Morgan C. Wang 11/28/2018

Case Study II Customer Risk Score 11/28/2018 Morgan C. Wang

Data 11/28/2018 Morgan C. Wang

Data available for this study Data came from several different sources There are 4800 Default Customers and 100,000 Normal Customers Morgan C. Wang 11/28/2018

Study Goal 11/28/2018 Morgan C. Wang

Study Goal Utilize the hidden value in the data to Estimate the loan defaulting risk for potential customers Assess the adequate loan amount for potential customers Identify the low risk and high value new customer cluster allowing marketing and sales department to expand Morgan C. Wang 11/28/2018

Data Quality Modeling Tool Selection Challenges Data Quality Modeling Tool Selection Morgan C. Wang 11/28/2018

Data Quality Modeling Tool Selection Integration with Existing System Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018

Challenges – Data Quality Data Quality Issues: Large amount of missing values Data came from multiple sources at different time Some out of range values Significant number of categorical variables with high cardinality The existing of non-linear relationship between target variable - “FLAG” and many numerical predictors. Morgan C. Wang 11/28/2018

Consequence– Missing Values (cont.) Large amount of information will lose Prediction power will reduce Prediction results are biased Morgan C. Wang 11/28/2018

Consequence – Multiple Sources Ignoring time dimension is the major reason that model performance significant lower in future scoring data than original modeling data Morgan C. Wang 11/28/2018

Consequence – Out of Range Values Reduce the credibility of the study result Bias the model parameter estimation Decrease future cases scoring accuracy Morgan C. Wang 11/28/2018

Consequence – High Cardinality Unreliable result produced from low frequency categories Increase model dimensionality due to high cardinality Ignore important high cardinality variables can significant reduce model reliability Morgan C. Wang 11/28/2018

Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018

Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018

Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018

Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018

Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018

Data Quality – Non-linearity (cont.) Morgan C. Wang 11/28/2018

Consequence – Non-linearity Most statistical procedures including both multiple regression and generalized linear model can not fit data adequately due to nonlinearity Morgan C. Wang 11/28/2018

Data Quality Modeling Tool Selection Integration with Existing System Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018

Modeling Tool Selection Statistical modeling tools (Linear Regression, Ridge Regression, Lasso Regression, Elastic Net, Generalized Linear Model, Least Angle Regression) can not handle nonlinearity well Tree based modeling tools (Decision Trees, Random Forest, Gradient Boosting) can not handle linearity effectively Neural network can not handle high cardinality categorical variables very well especially when some categories have very low feequency Morgan C. Wang 11/28/2018

Data Quality Modeling Tool Selection Integration with Existing System Challenges Data Quality Modeling Tool Selection Integration with Existing System Morgan C. Wang 11/28/2018

Integration with Existing System Integrate modeling results with our current system seamlessly, since I want to Integrate the model with our IT system seamlessly Allow sale personal to use model results to increase their sale performance Allow marketing department to use model results to identify new marketing opportunity and to make wise marketing decision Allow internet sales department to approve loan application in timely manner I want to build model in timely manner and to spend more time to use the model to help my company Morgan C. Wang 11/28/2018

Model Performance Morgan C. Wang 11/28/2018

Model Performance Model 1: Use all attributes provided by Company A Model 2: Use all attributes except Online Consumption Data Model 3: Use all attributes except Online Consumption and Other Consumption Data Note: All date attributes were converted to “numerical” length before performing modeling Morgan C. Wang 11/28/2018

Model Performance Model performance based on c-statistics (AUC) Morgan C. Wang 11/28/2018

Model Performance Model performance based on Catch Rate Morgan C. Wang 11/28/2018

Model Performance Model Performance Based on Lift Morgan C. Wang 11/28/2018

Model Performance All models performance are excellent Online Consumption Data play very limit role on improving model performance Other Consumption Data play an important role on improving model performance Morgan C. Wang 11/28/2018

Questions and Next Step Morgan C. Wang 11/28/2018