Data Quality Data Exploration

Slides:



Advertisements
Similar presentations
Chapter Four Making and Describing Graphs of Quantitative Variables
Advertisements

Very simple to create with each dot representing a data value. Best for non continuous data but can be made for and quantitative data 2004 US Womens Soccer.
Chapter 2 Exploring Data with Graphs and Numerical Summaries
Lesson Describing Distributions with Numbers parts from Mr. Molesky’s Statmonkey website.
Section 2.1 Visualizing Distributions: Shape, Center and Spread.
Introduction to Summary Statistics
Introduction to Summary Statistics
DENSITY CURVES and NORMAL DISTRIBUTIONS. The histogram displays the Grade equivalent vocabulary scores for 7 th graders on the Iowa Test of Basic Skills.
Basic Statistical Concepts
Chapter 1 & 3.
Programming in R Describing Univariate and Multivariate data.
Data Handbook Chapter 4 & 5. Data A series of readings that represents a natural population parameter A series of readings that represents a natural population.
Univariate Data Chapters 1-6. UNIVARIATE DATA Categorical Data Percentages Frequency Distribution, Contingency Table, Relative Frequency Bar Charts (Always.
The Normal Distribution Chapter 6. Outline 6-1Introduction 6-2Properties of a Normal Distribution 6-3The Standard Normal Distribution 6-4Applications.
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
Categorical vs. Quantitative…
Displaying Quantitative Data Graphically and Describing It Numerically AP Statistics Chapters 4 & 5.
To be given to you next time: Short Project, What do students drive? AP Problems.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
© 2008 Pearson Addison-Wesley. All rights reserved Chapter 6 Putting Statistics to Work.
Edpsy 511 Exploratory Data Analysis Homework 1: Due 9/19.
Statistics with TI-Nspire™ Technology Module E Lesson 1: Elementary concepts.
Data Description Chapter 3. The Focus of Chapter 3  Chapter 2 showed you how to organize and present data.  Chapter 3 will show you how to summarize.
SECONDARY MATH Normal Distribution. Graph the function on the graphing calculator Identify the x and y intercepts Identify the relative minimums.
(Unit 6) Formulas and Definitions:. Association. A connection between data values.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Unit 1 - Graphs and Distributions. Statistics 4 the science of collecting, analyzing, and drawing conclusions from data.
Interpreting Categorical and Quantitative Data. Center, Shape, Spread, and unusual occurrences When describing graphs of data, we use central tendencies.
Slide 1 Copyright © 2004 Pearson Education, Inc.  Descriptive Statistics summarize or describe the important characteristics of a known set of population.
Normal Distributions Overview. 2 Introduction So far we two types of tools for describing distributions…graphical and numerical. We also have a strategy.
Section 2.1 Visualizing Distributions: Shape, Center, and Spread.
Prof. Eric A. Suess Chapter 3
Describing Distributions
Section 2.1 Density Curves
Types of variables Discrete VS Continuous Discrete Continuous
Exploring Data: Summary Statistics and Visualizations
Advanced Quantitative Techniques
Lesson 11.1 Normal Distributions (Day 1)
Chapter 4: The Normal Distribution
Probability and the Normal Curve
Chapter 3 Describing Data Using Numerical Measures
Chapter 2: Describing Location in a Distribution
Unit 4 Statistical Analysis Data Representations
4. Interpreting sets of data
Chapter 1 & 3.
1st Semester Final Review Day 1: Exploratory Data Analysis
Distributions and Graphical Representations
Unit 1 - Graphs and Distributions
Basics of Statistics.
Chapter 3 Describing Data Using Numerical Measures
CHAPTER 3: The Normal Distributions
Statistical significance & the Normal Curve
Means & Medians Chapter 4.
Descriptive Statistics
An Introduction to Statistics
Topic 5: Exploring Quantitative data
Descriptive and inferential statistics. Confidence interval
Probability & Statistics Describing Quantitative Data
CHAPTER 5 Fundamentals of Statistics
Data Analysis and Statistical Software I Quarter: Spring 2003
Means & Medians.
The Shape of Distributions
Click the mouse button or press the Space Bar to display the answers.
Data Quality Data Exploration
Honors Statistics Review Chapters 4 - 5
CHAPTER 3: The Normal Distributions
Advanced Algebra Unit 1 Vocabulary
Types of variables. Types of variables Categorical variables or qualitative identifies basic differentiating characteristics of the population.
CHAPTER 3: The Normal Distributions
Lesson Plan Day 1 Lesson Plan Day 2 Lesson Plan Day 3
Presentation transcript:

Data Quality Data Exploration CSC 600: Data Mining Class 3

Today Data Quality Data Exploration

Data Quality Report A data quality report includes tabular reports that describe the characteristics of each feature in a dataset using standard statistical measures of central tendency and variation. In textbook, ABT refers to “Analytics Base Table” The tabular reports are accompanied by data visualizations: histogram for each continuous feature bar plot for each categorical feature also generally used for continuous features with cardinality < 10

Tabular Structure in a Data Quality Report Card = Cardinality Measures the number of distinct values present for a feature Note the differences between each table.

Case Study: ABT for Motor Insurance Claims Fraud Detection

Data Exploration: Getting to Know the Data For categorical features: Examine the mode, 2nd mode, mode %, and 2nd mode % Represent the most common levels within these features Will identify if any levels dominate the dataset. For continuous features: Examine the mean and standard deviation of each feature Get a sense of the central tendency and variation of the values Examine the minimum and maximum values to understand the range that is possible for each feature Histograms of continuous features will resemble the following well understood shapes (probability distributions) Recognizing the distribution of values for a feature will be useful when applying machine learning models

Uniform Distribution Sometimes indicative of a feature such as an ID, rather than something more interesting A uniform distribution indicates that a feature is equally likely to take a value in any of the ranges present.

Naturaly occurring phenomena (heights, weights of a randomly selected group of men, women) tend to follow a normal distribution. Normal Distribution Features following a normal distribution are characterized by a strong tendency towards a central value and symmetrical variation to either side of this. Unimodal: single peak around the central tendency

Skewed Distributions Skew is simply a tendency towards very high (right skew) or very low (left skew) values.

Exponential Distribution Examples: number of times a person has been married; number of times a person has made an insurance claim Exponential Distribution In a feature following an exponential distribution the likelihood of occurrence of a small number of low values is very high, but sharply diminishes as values increase.

Multimodal Distribution Example: measure of heights of a randomly selected group of Irish men and women Multimodal Distribution A feature characterized by a multimodal distribution has two or more very commonly occurring ranges of values that are clearly separated. Bi-modal distribution: two clear peaks “two normal distributions pushed together” Tends to occur when a feature contains a measurement made across two distinct groups

Normal Distribution The probability density function for the normal distribution (or Gaussian distribution) is x is any value μ (mu) and σ (sigma) are parameters that define the shape of the distribution the population mean and population standard deviation

Standard normal distribution: μ = 0 and σ = 1.

68-95-99.7 Rule The 68 − 95 − 99.7 rule is a useful characteristic of the normal distribution. The rule states that approximately: 68% of the observations will be within one σ of μ 95% of observations will be within two σ of μ 99.7% of observations will be within three σ of μ. Very low probability of observations occurring that differ from the mean by more than two standard deviations.

Case Study Become familiar with the central tendency and variation of each feature using the data quality report. Note bar graphs and histograms (earlier slides). Note number of levels and frequency of Injury Type. What is the type of probability distribution for each histogram? Exponential Distribution: all except Income and Fraud Flag Normal Distribution: Income (except for the 0 bar) Fraud Flag: not a typical continuous feature

Identifying Data Quality Issues A data quality issue is loosely defined as anything unusual about the data in an ABT. The most common data quality issues are: missing values Rule of thumb: remove feature is more than 60% of data is missing irregular cardinality Cardinality of 1: everything has the same value; no useful predictive information Continuous features will usually have a cardinality value close to the number of instances Investigate further if cardinality seems much lower or higher than expected outliers (invalid vs. valid) Investigate using domain knowledge Compare gap between 3rd quartile and max vs. median and 3rd quartile

Case Study (refer to earlier tables and graphs) Missing Values Remove Marital Status feature Note Income feature Irregular Cardinality No predictive information in Insurance Type Fraud Flag should be categorical feature Other valid features with very low cardinality Outliers Unusual minimum value in instance #3 Claim Amount, Total Claimed, Num Claims, Amount Received seem to have high maximum values compared to the 3rd quartile and median Locate instance in the dataset that leads to high maximum values (instance #460) Judge if it is a valid or invalid outlier

References Fundamentals of Machine Learning for Predictive Data Analytics, Kelleher et al., First Edition