Assessing Normality and Data Transformations. Role of Normality Many statistical methods require that the numeric variables we are working with have an.

Slides:



Advertisements
Similar presentations
Descriptive Statistics-II
Advertisements

Describing Quantitative Variables
Assumption of normality
1 Assessing Normality and Data Transformations Many statistical methods require that the numeric variables we are working with have an approximate normal.
Looking at data: distributions - Describing distributions with numbers IPS chapter 1.2 © 2006 W.H. Freeman and Company.
1.2: Describing Distributions
CHAPTER 3: The Normal Distributions Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.
The Normal Distributions
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 6-1 Chapter 6 The Normal Distribution Business Statistics: A First Course 5 th.
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of shape measures of relative standing.
Chapter 2: The Normal Distribution
Chapter 7: Normal Probability Distributions
Describing distributions with numbers
Let’s Review for… AP Statistics!!! Chapter 1 Review Frank Cerros Xinlei Du Claire Dubois Ryan Hoshi.
Chap 6-1 Copyright ©2013 Pearson Education, Inc. publishing as Prentice Hall Chapter 6 The Normal Distribution Business Statistics: A First Course 6 th.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
3.3 Density Curves and Normal Distributions
Looking at Data - Distributions Density Curves and Normal Distributions IPS Chapter 1.3 © 2009 W.H. Freeman and Company.
© 2008 Brooks/Cole, a division of Thomson Learning, Inc. 1 Chapter 4 Numerical Methods for Describing Data.
CHAPTER 3: The Normal Distributions ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Chapter 6 The Normal Curve. A Density Curve is a curve that: *is always on or above the horizontal axis *has an area of exactly 1 underneath it *describes.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 2 Modeling Distributions of Data 2.2 Density.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Evaluating Univariate Normality PowerPoint Prepared.
CHAPTER 3: The Normal Distributions
Measures of Relative Standing Percentiles Percentiles z-scores z-scores T-scores T-scores.
Applied Quantitative Analysis and Practices LECTURE#09 By Dr. Osman Sadiq Paracha.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 2 Modeling Distributions of Data 2.2 Density.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
1 Chapter 4 Numerical Methods for Describing Data.
Descriptive Statistics Review – Chapter 14. Data  Data – collection of numerical information  Frequency distribution – set of data with frequencies.
Ch 2 The Normal Distribution 2.1 Density Curves and the Normal Distribution 2.2 Standard Normal Calculations.
Applied Quantitative Analysis and Practices
Normal Distributions.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
The Normal Distribution: Comparing Apples and Oranges.
Math 4030 – 7b Normality Issues (Sec. 5.12) Properties of Normal? Is the sample data from a normal population (normality)? Transformation to make it Normal?
5-Minute Check on Activity 7-9 Click the mouse button or press the Space Bar to display the answers. 1.What population parameter is a measure of spread?
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
Welcome to the Wonderful World of AP Stats.…NOT! Chapter 2 Kayla and Kelly.
Density Curves & Normal Distributions Textbook Section 2.2.
Chapter 7 Random Variables and Continuous Distributions.
CHAPTER 4 NUMERICAL METHODS FOR DESCRIBING DATA What trends can be determined from individual data sets?
Statistical Hydrology 1 Dr. Muhammad Ajmal Lecturer, Agri. Engg. Dept. UET Peshawar DATA TRANSFORMATION.
BAE 6520 Applied Environmental Statistics
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Good Afternoon! Agenda: Knight’s Charge-please wait for direction
Evaluating Univariate Normality
BIOS 501 Lecture 3 Binomial and Normal Distribution
Density Curves and Normal Distribution
CHAPTER 2 Modeling Distributions of Data
Advanced Placement Statistics Chapter 2.2: Normal Distributions
Numerical Measures: Skewness and Location
TRANSFORMATION.
Assessing Normality and Data Transformations
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Click the mouse button or press the Space Bar to display the answers.
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
The Normal Distribution
CHAPTER 2 Modeling Distributions of Data
CHAPTER 2 Modeling Distributions of Data
Presentation transcript:

Assessing Normality and Data Transformations

Role of Normality Many statistical methods require that the numeric variables we are working with have an approximate normal distribution. Many statistical methods require that the numeric variables we are working with have an approximate normal distribution. For example, t-tests, F-tests, and regression analyses all require in some sense that the numeric variables are approximately normally distributed. For example, t-tests, F-tests, and regression analyses all require in some sense that the numeric variables are approximately normally distributed. Standardized normal distribution with empirical rule percentages.

Tools for Assessing Normality Histogram and Boxplot Histogram and Boxplot Normal Quantile Plot (also called Normal Probability Plot) Normal Quantile Plot (also called Normal Probability Plot) Goodness of Fit Tests Shapiro-Wilk Test (JMP) Kolmogorov-Smirnov Test (SPSS) Anderson-Darling Test (MINITAB) Goodness of Fit Tests Shapiro-Wilk Test (JMP) Kolmogorov-Smirnov Test (SPSS) Anderson-Darling Test (MINITAB)

Histograms and Boxplots The cholesterol levels of the patients appear to be approximately normal, although there is some evidence of right skewness as the mean is larger than the median. The red curve represents a normal distribution fit to these data and the blue curve the density estimate for these data, these curves should agree if our data is normally distributed.

Histograms and Boxplots The systolic volumes of the male heart patients in this study suggest that they come from a right skewed population distribution. The red curve represents a normal distribution fit to these data and the blue is the estimated density from the data which does not agree with the imposed normal. Outliers are not consistent with normality.

Normal Quantile Plot Basically compares the spacing of our data to what we would expect to see in terms of spacing if our data were approximately normal. Basically compares the spacing of our data to what we would expect to see in terms of spacing if our data were approximately normal. If our data is approximately normally distributed we should spacing similar to what I attempted to show on the normal curve on the right. Very few observations in both tails and increasingly more observations as we move towards the mean from either side. Also remember the spacing must be symmetric about the mean.

Normal Quantile Plot THE IDEAL PLOT: Here is an example where the data is perfectly normal. The plot on right is a normal quantile plot with the data on the vertical axis and the expected z-scores if our data was normal on the horizontal axis. When our data is approximately normal the spacing of the two will agree resulting in a plot with observations lying on the reference line in the normal quantile plot. The points should lie within the dashed lines.

Normal Quantile Plot THE IDEAL PLOT: Here is an example where the data is perfectly normal. The plot on right is a normal quantile plot with the data on the vertical axis and the expected z-scores if our data was normal on the horizontal axis. When our data is approximately normal the spacing of the two will agree resulting in a plot with observations lying on the reference line in the normal quantile plot. The points should lie within the dashed lines.

Normal Quantile Plot (right skewness) The systolic volumes of the male heart patients are clearly right skewed. When the data is plotted vs. the expected z-scores the normal quantile plot shows right skewness by a upward bending curve.

Normal Quantile Plot (left skewness) The distribution of birthweights from this study of very low birthweight infants is skewed left. When the data is plotted vs. the expected z- scores the normal quantile plot shows left skewness by a downward bending curve.

Normal Quantile Plot (leptokurtosis) The distribution of sodium levels of patients in this right heart catheterization study has heavier tails than a normal distribution (i.e, leptokurtosis). When the data is plotted vs. the expected z- scores the normal quantile plot there is an “S-shape” which indicates kurtosis.

Normal Quantile Plot (discrete data) Although the distribution of the gestational age data of infants in the very low birthweight study is approx. normal there is a “staircase” appearance in normal quantile plot. This is due to the discrete coding of the gestational age which was recorded to the nearest week or half week.

Normal Quantile Plots IMPORTANT NOTE: If you plot DATA vs. NORMAL as on the previous slides then: If you plot DATA vs. NORMAL as on the previous slides then: downward bend = left skew upward bend = right skew upward bend = right skew If you plot NORMAL vs. DATA then: downward bend = right skew upward bend = left skew If you plot NORMAL vs. DATA then: downward bend = right skew upward bend = left skew

Tests of Normality There are several different tests that can be used to test the following hypotheses: H o : The distribution is normal H A : The distribution is NOT normal Common tests of normality include: Shapiro-WilkKolmogorov-Smirnov Anderson-DarlingLillefor’s Problem: THEY DON’T ALWAYS AGREE!!

Tests of Normality H o : The distribution of systolic volume is normal H A : The distribution of systolic volume is NOT normal Because p <.0001 we have strong evidence against normality for the systolic volume population distribution using the Shapiro-Wilk test.

Tests of Normality H o : The distribution of systolic volume is normal H A : The distribution of systolic volume is NOT normal We do not have evidence at the  level against the normality of the population systolic volume distribution when using the Kolmogorov- Smirnov test from SPSS.

Tests of Normality H o : The distribution of cholesterol level is normal H A : The distribution of cholesterol level is NOT normal We have no evidence against the normality of the population distribution of cholesterol levels for male heart patients (p =.2184).

Transformations to Improve Normality (removing skewness) Many statistical methods require that the numeric variables you are working with have an approximately normal distribution. Reality is that this is often times not the case. One of the most common departures from normality is skewness, in particular, right skewness.

UP Bigger Impact Bigger Impact.. Middle rung: No transformation ( = 1) Middle rung: No transformation ( = 1) DOWN Here V represents our variable of interest. We are going to consider this variable raised to a power, i.e. V We go up the ladder to remove left skewness and down the ladder to remove right skewness. Right skewed Left skewed Tukey’s Ladder of Powers

To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V.5, V.333, log 10 (V) (think of V 0 ), V -1, etc. To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V.5, V.333, log 10 (V) (think of V 0 ), V -1, etc. To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V 2, V 3, etc. To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V 2, V 3, etc.

Removing Right Skewness Example 1: PDP-LI levels for cancer patients In the log base 10 scale the PDP-LI values are approximately normally distributed.

Removing Right Skewness Example 2: Systolic Volume for Male Heart Patients sysvol sysvol.5 sysvol.333 log 10 (sysvol) 1/sysvol

Removing Right Skewness Example 2: Systolic Volume for Male Heart Patients 1/sysvol The reciprocal of systolic volume is approximately normally distributed and the Shapiro-Wilk test provides no evidence against normality (p =.5340). CAUTION: The use of the reciprocal transformation reorders the data in the sense that the largest value becomes the smallest and the smallest becomes the largest after transformation. The units after transformation may or may not make sense, e.g. if the original units are mg/ml then after transformation they would be ml/mg.