Download presentation
Presentation is loading. Please wait.
Published byKellie Reynolds Modified over 9 years ago
1
Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. Louise_francis@msn.com www.data-mines.com
2
Objectives Discuss topic of data quality Present methods for screening data for problems Errors Missing data Present methods of fixing problems
3
Data Quality: A Problem Actuary doing reconciliation on February 15
4
It’s Not Just Us “In just about any organization, the state of information quality is at the same low level” Olson, Data Quality
5
AAA Standards of Practice AAA SOP: Review data for completeness, accuracy and relevance IDMA and CAS White paper. Evaluate data for Validity accuracy reasonableness completeness
6
Example Data Private passenger auto Some variables are: Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses
7
Screening Data Many of the Methods have been in use for a while Pioneered in field of exploratory data analysis More recently – missing data methods for dealing with some quality problems
8
Some Methods for Numeric Data Visual Histograms Box and Whisker Plots Statistical Descriptive statistics Data spheres
9
Histograms Can do them in Microsoft Excel
10
Histograms Frequencies for Age Variable
11
Histograms of Age Variable Varying Window Size
12
Formula for Window Width
13
Example of Suspicious Value
14
Discrete-Numeric Data
15
Filtered Data Filter out Unwanted Records
16
Box and Whisker Plot
17
Box and Whisker Example
18
Plot of Heavy Tailed Data Paid Losses
19
Heavy Tailed Data – Log Scale
20
Descriptive Statistics
21
Mahalanobis Distance
22
Data Spheres Example: Longitude and Latitude
23
Sample from Highest Percentile
24
Categorical Data: Data Cubes
25
Example – Marital Status
26
Screening for Missing Data
27
Blanks as Missing
28
Types of Missing Values Missing completely at random Missing at random Informative missing
29
Methods for Missing Values Drop record if any variable used in model is missing Drop variable Data Imputation Other CART, MARS use surrogate variables Expectation Maximization
30
Imputation A method to “fill in” missing value Use other variables (which have values) to predict value on missing variable Involves building a model for variable with missing value Y = f(x 1,x 2,…x n )
31
Example: Age Variable About 14% of records missing values Imputation will be illustrated with simple regression model Age = a+b 1 X 1 +b 2 X 2 …b n X n
32
Model for Age
33
Censorship Problem Property and casualty insurance data is typically censored We do not know final settlement value for data Adjustments must be made to avoid erroneous models Use ultimates Mix adjust
34
Example From Ignoring Censorship
35
Metadata Data about data Detailed description of the variables in the file, their meaning and permissible values
36
Conclusions Data quality is significant problem in insurance and in other industries Statistical methods can be used to detect and remediate data quality problems How do we get better data?
37
Conclusions “In the end, the best defense is relentless monitoring of data and metadata” Dasu and Johnson, Exploratory Data Mining and Data Cleaning
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.