1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive Modeling Seminar Louise Francis Francis Analytics and.

Slides:



Advertisements
Similar presentations
UNIT – 1 Data Preprocessing
Advertisements

Chapter 2 Exploring Data with Graphs and Numerical Summaries
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Business Statistics: A Decision-Making Approach, 7e © 2008 Prentice-Hall, Inc. Chap 3-1 Business Statistics: A Decision-Making Approach 7 th Edition Chapter.
Distinguishing the Forest from the Trees University of Texas November 11, 2009 Richard Derrig, PhD, Opal Consulting Louise Francis,
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Created by Tom Wegleitner, Centreville, Virginia Section 3-5.
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Chapter 2 Graphs, Charts, and Tables – Describing Your Data
Chapter 2 Describing Data Sets
Data Mining – Best Practices CAS 2008 Spring Meeting Quebec City, Canada Louise Francis, FCAS, MAAA
1 Economics 240A Power One. 2 Outline w Course Organization w Course Overview w Resources for Studying.
Data Tutorial Tutorial on Types of Graphs Used for Data Analysis, Along with How to Enter Them in MS Excel Carryn Bellomo University of Nevada, Las Vegas.
EFFECTIVE PREDICTIVE MODELING- DATA,ANALYTICS AND PRACTICE MANAGEMENT Richard A. Derrig Ph.D. OPAL Consulting LLC Karthik Balakrishnan Ph.D. ISO Innovative.
5 Number Summary Box Plots. The five-number summary is the collection of The smallest value The first quartile (Q 1 or P 25 ) The median (M or Q 2 or.
1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS 2007 Ratemaking Seminar Louise Francis, FCAS Francis Analytics and.
Overview DM for Business Intelligence.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 12 Describing Data.
Exploratory Data Analysis. Height and Weight 1.Data checking, identifying problems and characteristics Data exploration and Statistical analysis.
Exploratory Data Analysis Statistics Introduction If you are going to find out anything about a data set you must first understand the data Basically.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Outline 1. Histograms and boxplots 2. Mean and standard deviation 3. Proportions and bar charts 4. Sampling and allocation 5. Inference and confidence.
Statistics 3502/6304 Prof. Eric A. Suess Chapter 3.
First Quantitative Variable: Ear Length  The unit of measurement for this variable is INCHES.  A few possible values for this first quantitative variable.
Free and Cheap Sources of External Data CAS 2007 Predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
Copyright © Cengage Learning. All rights reserved. 2 Organizing Data.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Describing Data.
Exploratory Data Analysis
6-1 Numerical Summaries Definition: Sample Mean.
1 Data Quality: GIRO Working Party Paper & Methods to Detect Data Glitches CAS Annual Meeting Louise Francis Francis Analytics and Actuarial Data Mining,
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 2-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
3-5: Exploratory Data Analysis  Exploratory Data Analysis (EDA) data can be organized using a stem and leaf (as opposed to a frequency distribution) 
1 Further Maths Chapter 2 Summarising Numerical Data.
Chap 2-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course in Business Statistics 4 th Edition Chapter 2 Graphs, Charts, and Tables.
Copyright © 2011 Deloitte Development LLC. All rights reserved. Data Visualization for Data QC March, 2012 Steve Berman Deloitte Consulting.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Descriptive statistics Petter Mostad Goal: Reduce data amount, keep ”information” Two uses: Data exploration: What you do for yourself when.
MMSI – SATURDAY SESSION with Mr. Flynn. Describing patterns and departures from patterns (20%–30% of exam) Exploratory analysis of data makes use of graphical.
Exploratory Data Analysis Exploratory Data Analysis Dr.Lutz Hamel Dr.Joan Peckham Venkat Surapaneni.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Predictive Modeling Spring 2005 CAMAR meeting Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Texas Algebra I Unit 3: Probability/Statistics Lesson 28: Box and Whiskers plots.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall2(2)-1 Chapter 2: Displaying and Summarizing Data Part 2: Descriptive Statistics.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Dancing With Dirty Data: Methods for Exploring and Cleaning Data 2005 CAS Ratemaking Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial.
Preparing to collect data. Make sure you have your materials Surveys –All surveys should have a unique numerical identifier on each page –You can write.
Introduction Exploring Categorical Variables Exploring Numerical Variables Exploring Categorical/Numerical Variables Selecting Interesting Subsets of Data.
Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar Richard Derrig, PhD, Opal Consulting Louise Francis, FCAS,
Class Two Before Class Two Chapter 8: 34, 36, 38, 44, 46 Chapter 9: 28, 48 Chapter 10: 32, 36 Read Chapters 1 & 2 For Class Three: Chapter 1: 24, 30, 32,
Chapter 16: Exploratory data analysis: numerical summaries
Exploring, Displaying, and Examining Data
Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa.
Data Mining: Concepts and Techniques
Fundamentals of Probability and Statistics
STATISTICS ELEMENTARY MARIO F. TRIOLA
Bar graphs are used to compare things between different groups
Topic 5: Exploring Quantitative data
Unit 3: Statistics Final Exam Review.
Unit 2: Statistics Final Exam Review.
CS3332(01) Course Description
THE STAGES FOR STATISTICAL THINKING ARE:
Data Quality Data Exploration
EECS3030(02) Course Description
Lecture 1: Descriptive Statistics and Exploratory
(-4)*(-7)= Agenda Bell Ringer Bell Ringer
Probability and Statistics
Dot Plot A _________ is a type of graphic display used to compare frequency counts within categories or groups using dots. __________ help you see how.
Presentation transcript:

1 Data Preparation Part 1: Exploratory Data Analysis & Data Cleaning, Missing Data CAS Predictive Modeling Seminar Louise Francis Francis Analytics and Actuarial Data Mining, Inc.

2 Objectives Introduce data preparation and where it fits in in modeling process Discuss Data Quality Focus on a key part of data preparation Exploratory data analysis Identify data glitches and errors Understanding the data Identify possible transformations What to do about missing data Provide resources on data preparation

3 CRISP-DM Guidelines for data mining projects Gives overview of life cycle of data mining project Defines different phases and activities that take place in phase

4 Modelling Process

5 Data Preprocessing

6 Data Quality Problem

7 Data Quality: A Problem Actuary reviewing a database

8 May’s Law

9 It’s Not Just Us “In just about any organization, the state of information quality is at the same low level” Olson, Data Quality

10 Some Consequences of poor data quality Affects quality (precision) of result Can’t do modeling project because of data problems If errors not found – modeling blunder

11 Data Exploration in Predictive Modeling

12 Exploratory Data Analysis Typically the first step in analyzing data Makes heavy use of graphical techniques Also makes use of simple descriptive statistics Purpose Find outliers (and errors) Explore structure of the data

13 Definition of EDA Exploratory data analysis (EDA) is that part of statistical practice concerned with reviewing, communicating and using data where there is a low level of knowledge about its cause system.. Many EDA techniques have been adopted into data mining and are being taught to young students as a way to introduce them to statistical thinking.data analysisstatistical practicedatacause systemdata mining -

14 Example Data Private passenger auto Some variables are: Age Gender Marital status Zip code Earned premium Number of claims Incurred losses Paid losses

15 Some Methods for Numeric Data Visual Histograms Box and Whisker Plots Stem and Leaf Plots Statistical Descriptive statistics Data spheres

16 Histograms Can do them in Microsoft Excel

17 Histograms Frequencies for Age Variable

18 Histograms of Age Variable Varying Window Size

19 Formula for Window Width

20 Example of Suspicious Value

21 Discrete-Numeric Data

22 Filtered Data Filter out Unwanted Records

23 Box Plot Basics: Five – Point Summary Minimum 1 st quartile Median 2 nd quartile Maximum

24 Functions for five point summary =min(data range) =quartile(data range1) =median(data range) =quartile(data range,3) =max(data range)

25 Box and Whisker Plot

26 Plot of Heavy Tailed Data Paid Losses

27 Heavy Tailed Data – Log Scale

28 Box and Whisker Example

29 Descriptive Statistics Analysis ToolPak

30 Descriptive Statistics Claimant age has minimum and maximums that are impossible

31 Data Spheres: The Mahalanobis Distance Statistic

32 Screening Many Variables at Once Plot of Longitude and Latitude of zip codes in data Examination of outliers indicated drivers in Ca and PR even though policies only in one mid-Atlantic state

33 Records With Unusual Values Flagged

34 Categorical Data: Data Cubes

35 Categorical Data Data Cubes Usually frequency tables Search for missing values coded as blanks

36 Categorical Data Table highlights inconsistent coding of marital status

37 Missing Data

38 Screening for Missing Data

39 Blanks as Missing

40 Types of Missing Values Missing completely at random Missing at random Informative missing

41 Methods for Missing Values Drop record if any variable used in model is missing Drop variable Data Imputation Other CART, MARS use surrogate variables Expectation Maximization

42 Imputation A method to “fill in” missing value Use other variables (which have values) to predict value on missing variable Involves building a model for variable with missing value Y = f(x 1,x 2,…x n )

43 Example: Age Variable About 14% of records missing values Imputation will be illustrated with simple regression model Age = a+b 1 X 1 +b 2 X 2 …b n X n

44 Model for Age

45 Missing Values A problem for many traditional statistical models Elimination of records missing on anything from analysis Many data mining procedures have techniques built in for handling missing values If too many records missing on a given variable, probably need to discard variable

46 Metadata

47 Metadata Data about data A reference that can be used in future modeling projects Detailed description of the variables in the file, their meaning and permissible values

48 Library for Getting Started Dasu and Johnson, Exploratory Data Mining and Data Cleaning, Wiley, 2003 Francis, L.A., “Dancing with Dirty Data: Methods for Exploring and Claeaning Data”, CAS Winter Forum, March 2005, Find a comprehensive book for doing analysis in Excel such as: John Walkebach, Excel 2003 Formulas or Jospeh Schmuller, Statistical Analysis With Excel for Dummies If you use R, get a book like: Fox, John, An R and S-PLUS Companion to Applied Regression, Sage Publications, 2002