Free and Cheap Sources of External Data CAS 2007 Predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
Objectives Information sharing Introduce some useful sources of data to augment company internal databases Show examples of applications using external data
Why Augment Data? For small companies, new lines of business, internal data may not be sufficient Add variables (i.e, demographic and economic) that are not in data
Some Kinds of External Data Demographic Geographic Economic –Unemployment rate, avg wage, etc –Financial Market Insurance data Occupational Weather
Zip Code Level Data Census bureau web site, has a wealth of informationwww.census.gov May require some processing effort to put into useful format for analysis For a small fee there are vendors who pre- process some of the useful data One of them is zip-codes.com
Zip-codes.com
Some Useful Variables Average Income Population Average house value # people per house Latitude, longitude –Use to compute distances City, county
Distance formula
The Data
California Auto Data by ZIP BI Exposures BI Losses BI Claims PD Exposures PD Losses PD Claims
CAARP Data CAARP data California Auto Assigned Risk Plan Collected by state Aggregated data Request from Statistical Analysis Division of department
California Proposed Changes to Territory Rating
Effect of Change by County
Effect of Change by Pure Premium Group
Effect of Change by Average House Value
Effect of Change by Average Income
The Data used for Fraud Model Described in “Distinguishing the Forest From the Trees”, Derrig and Francis, 2005 CAS Winter Forum
The Fraud Surrogates used as Dependent Variables Independent Medical Exam (IME) requested Special Investigation Unit (SIU) referral –(IME successful) –(SIU successful) Data: Detailed Auto Injury Claim Database for Massachusetts Accident Years ( )
Predictor Variables Claim file variables –Provider bill, Provider type –Injury Derived from claim file variables –Attorneys per zip code –Docs per zip code Using external data –Average household income –Households per zip
Neural Network Ranking of Variables
Variable Importance for IME Requested for 3 Methods
Variable Importance (IME) Based on Average of Methods
Trends Using External Information People still rely on Masterson’s indices and other indices based on the CPI Shortcomings –Hedonic adjustment –Substitution –Imputed rental cost –Geometric chaining –See or Getting Prices Right by Economic Policy Institute and Dean Bakerwww.shadowstats.com Insurance inflation has typically been much higher than these indications Many need reliable trend indications on smaller segments of their data Trend is another weak link in the modeling process
Questions?