Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of shape measures of relative standing.

Slides:



Advertisements
Similar presentations
Descriptive Measures MARE 250 Dr. Jason Turner.
Advertisements

Assessing Normality and Data Transformations. Role of Normality Many statistical methods require that the numeric variables we are working with have an.
Descriptive Statistics
Looking at data: distributions - Describing distributions with numbers IPS chapter 1.2 © 2006 W.H. Freeman and Company.
Calculating & Reporting Healthcare Statistics
B a c kn e x t h o m e Parameters and Statistics statistic A statistic is a descriptive measure computed from a sample of data. parameter A parameter is.
A primer in Biostatistics
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Describing Data: Numerical
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4.1 Chapter Four Numerical Descriptive Techniques.
AP Statistics Chapters 0 & 1 Review. Variables fall into two main categories: A categorical, or qualitative, variable places an individual into one of.
Describing distributions with numbers
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 4.1 Chapter Four Numerical Descriptive Techniques.
Objectives 1.2 Describing distributions with numbers
Numerical Descriptive Techniques
Chapter 3 – Descriptive Statistics
1 Measure of Center  Measure of Center the value at the center or middle of a data set 1.Mean 2.Median 3.Mode 4.Midrange (rarely used)
Descriptive statistics Describing data with numbers: measures of location.
Descriptive statistics Describing data with numbers: measures of variability.
1 MATB344 Applied Statistics Chapter 2 Describing Data with Numerical Measures.
Created by Tom Wegleitner, Centreville, Virginia Section 2-4 Measures of Center.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved.
Central Tendency and Variability Chapter 4. Variability In reality – all of statistics can be summed into one statement: – Variability matters. – (and.
Descriptive Statistics1 LSSG Green Belt Training Descriptive Statistics.
1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Measures of Center.
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 3 Descriptive Statistics: Numerical Methods.
Skewness & Kurtosis: Reference
An Introduction to Statistics. Two Branches of Statistical Methods Descriptive statistics Techniques for describing data in abbreviated, symbolic fashion.
1 Measure of Center  Measure of Center the value at the center or middle of a data set 1.Mean 2.Median 3.Mode 4.Midrange (rarely used)
INVESTIGATION 1.
Business Statistics Spring 2005 Summarizing and Describing Numerical Data.
1 Descriptive Statistics 2-1 Overview 2-2 Summarizing Data with Frequency Tables 2-3 Pictures of Data 2-4 Measures of Center 2-5 Measures of Variation.
1 Measures of Center. 2 Measure of Center  Measure of Center the value at the center or middle of a data set 1.Mean 2.Median 3.Mode 4.Midrange (rarely.
Notes Unit 1 Chapters 2-5 Univariate Data. Statistics is the science of data. A set of data includes information about individuals. This information is.
Descriptive statistics Describing data with numbers: measures of variability.
Descriptive Statistics(Summary and Variability measures)
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
CHAPTER 1 Exploring Data
Descriptive measures Capture the main 4 basic Ch.Ch. of the sample distribution: Central tendency Variability (variance) Skewness kurtosis.
Chapter 6 ENGR 201: Statistics for Engineers
Midrange (rarely used)
Description of Data (Summary and Variability measures)
CHAPTER 1 Exploring Data
Numerical Descriptive Measures
CHAPTER 1 Exploring Data
Assessing Normality and Data Transformations
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Describing Data with Numerical Measures
Numerical Descriptive Measures
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
MBA 510 Lecture 2 Spring 2013 Dr. Tonya Balan 4/20/2019.
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
CHAPTER 1 Exploring Data
Chapter 1: Exploring Data
Numerical Descriptive Measures
Presentation transcript:

Descriptive Statistics for Numeric Variables Types of Measures: measures of location measures of spread measures of shape measures of relative standing

What to describe? What is the “location” or “center” of the data? (“measures of location”) How do the data vary? (“measures of variability”)

Measures of Location Mean Median Mode

Mean Another name for average. If describing a population, denoted as , the greek letter  i.e. “mu”. (PARAMETER) If describing a sample, denoted as, called “x-bar”. (STATISTIC) Appropriate for describing measurement data. Seriously affected by unusual values called “outliers”.

Calculating Sample Mean Formula: That is, add up all of the data points and divide by the number of data points. Data (# ER arrivals in 1 hr): Sample Mean = ( )/5 = 3.6 arrivals

Median Another name for 50th percentile. Appropriate for describing measurement data. “Robust to outliers,” that is, not affected much by unusual values.

Calculating Sample Median Order data from smallest to largest. If odd number of data points, the median is the middle value. Data (# ER arrivals in 1 hr.): Ordered Data: Median

Calculating Sample Median Order data from smallest to largest. If even number of data points, the median is the average of the two middle values. Data (# ER arrivals in 1 hr.): Ordered Data: Median = (3+4)/2 = 3.5

Mode The value that occurs most frequently. One data set can have many modes. Appropriate for all types of data, but most useful for categorical data or discrete data with only a few number of possible values.

In JMP: Heart Attack Data Select Analyze  Distribution (JMP Demo)JMP Demo

In JMP: Heart Attack Data Sample size n = 45 (don’t use N)

The most appropriate measure of location depends on … the shape of the data’s distribution. e.g.

Most appropriate measure of location Depends on whether or not data are “symmetric” or “skewed”. Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.

Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

The mean and the median are approximately the same as this distribution is nearly symmetric. Slight right skewness – see measures of shape.

Heights of College Students - Symmetric and Bimodal

Variable n Mean Median StdDev Males Females All Variable SE Mean Min Max Q1 Q3 Males Females All

Heights of College Students - Symmetric and Bimodal Mean height for females Mean height for males Overall mean

Systolic Volume for Heart Attack Patients - Skewed Right Sample mean (79.42) is substantially larger than the sample median (67.00), median is “better” measure of average. Skewness statistic is > 1 suggesting pronounced right skewness (see measures of shape).

Time Until Outcome for Heart Attack Patients - Skewed Left Sample mean (112.4) is substantially smaller than the sample median (138.00), median is “better” measure of average. Skewness statistic is < - 1 suggesting pronounced left skewness (see measures of shape)

Choosing Appropriate Measure of Location If data are symmetric, the mean, median, and mode will be approximately the same. If data are multimodal, report the mean, median and/or mode for each subgroup. If data are skewed, report the median.

Measures of Variability Range Interquartile range (IQR) Variance and standard deviation Coefficient of variation (CV)

Range The difference between largest and smallest data point. Highly affected by outliers. Best for symmetric data with no outliers.

Cholesterol Level of Heart Attack Patients - Symmetric and Unimodal (approx.)

Max. = 93 (mmoles/l) Min. = 38 (mmoles/l) Range = 93 – 38 = 55 (mmoles/l)

Interquartile range The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values. IQR = Q3-Q1 Robust to outliers or extreme observations. Works well for skewed data.

Systolic Volume for Heart Attack Patients - Skewed Right Q3 = Q1 =  IQR = – = 40.0 The range of the middle 50% of systolic volumes is 40 mmoles/l. Q3 Q1

Variance If measuring variance of population, denoted by  2 (“sigma-squared”). If measuring variance of sample, denoted by s 2 (“s-squared”). Measures average squared deviation of data points from their mean. Highly affected by outliers. Best for symmetric data. Problem is units are squared.

Formula for the Sample Variance (s 2 ) This is nearly (if not for the n-1 in the denominator) the average squared deviation from the sample mean for our observed data.

Standard deviation Sample standard deviation is square root of sample variance, and so is denoted by s. Units are the original units. Measures “average” deviation of data points from their mean. Also, highly affected by outliers.

Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers What differences in distribution of time to fall asleep do we see when comparing the smokers to non- smokers in this study? 1)Typical time to fall asleep is minutes for both populations. 2)IQR for smokers is twice that for non-smokers. 3) Distribution for non-smokers is approx. normal, not so for smokers.

Sleep Study: Comparing Time to Fall Asleep of Smokers vs. Non-smokers SmokersNon-smokers s = 3.69 minutes >s = 2.28 minutes IQR = 7.05 minutes > IQR = 3.00 minutes

Empirical Rule – The standard deviation and the normal distribution For unimodal, moderately symmetrical, sets of data approximately: 68% of observations lie within 1 standard deviation of the mean. 95% of observations lie within 2 standard deviations of the mean. i.e. Normally Distributed Data

x The Empirical Rule

x - s x x + sx + s 68% within 1 standard deviation 34% The Empirical Rule

x - 2s x - s x x + 2s x + sx + s 68% within 1 standard deviation 34% 95% within 2 standard deviations The Empirical Rule 13.5%

x - 3s x - 2s x - s x x + 2s x + 3s x + sx + s 68% within 1 standard deviation 34% 95% within 2 standard deviations 99.7% of data are within 3 standard deviations of the mean The Empirical Rule 0.1% 2.4% 13.5%

Application of Empirical Rule – Medical Lab Tests When you have blood drawn and it is screened for different chemical levels, any results two standard deviations below or two standard deviations above the mean for healthy individuals will get flagged as being abnormal. Example: For potassium, healthy individuals have a mean level 4.4 meq/l with a SD of.45 meq/l Individuals with levels outside the range : 4.4 – 2(.45) to (.45) 3.5 meq/l to 5.3 meq/l would be flagged as having abnormal potassium.

Coefficient of Variation (CV) Ratio of sample standard deviation to sample mean multiplied by 100. Measures relative variability, that is, variability relative to the magnitude of the data. Unitless, so good for comparing variation between two groups and for comparing variability of measurements in completely different scales and/or units.

Heart Attack Data: Which volume measure has more variation, systolic or diastolic? SYSVOL CV = 39.95/79.42 = 50.3% DIAVOL CV = 48.79/ = 30.7% Thus systolic volume has the greater variation in our sample on the basis of the CV.

The most appropriate measure of variability depends on … the shape of the data’s distribution.

Choosing Appropriate Measure of Variability If data are symmetric, with no serious outliers, use range and standard deviation. If data are skewed, and/or have serious outliers, use IQR. If comparing variation across two variables, use coefficient of variation if the variables are in different units and/or scales. If the scales and units are roughly the same direct comparison of the standard deviation is fine.

Measures of Shape – Skewness and Kurtosis Statistical software packages will give some measure of skewness and kurtosis for a given numeric variable. Skewness measures departure from symmetry and is usually characterized as being left or right skewed as seen previously. Kurtosis measures “peakedness” of a distribution and comes in two forms, platykurtosis and leptokurtosis.

Skewness Pearson’s Skewness Coefficient Fisher’s Measure of Skewness has a complicated formula but most software packages compute it. Fisher’s Skewness > 1.00 moderate right skewness > 2.00 severe right skewness Fisher’s Skewness < moderate left skewness < severe right skewness If skewness +.20 severe right skewness

Skewness Skewness = Suggesting slight left skewness. Skewness = Suggesting strong right skewness.

Kurtosis Measures peakedness of a distribution. Normal distribution has Kurtosis = 0. Leptokurtotic distributions are more peaked than normal with fatter tails, Kurtosis > 0 Platykurtotic distributions are less peaked (squashed normal) than normal, Kurtosis < 0

Kurtosis Example 1: Blood pH levels for subjects in right heart catheter study. Here we see slightly left skewed (-1.22) but markedly leptokurtotic (3.49) distribution. The reference normal curve has been added and blue curve is the density estimate from the data.

Example 2: Kurtosis Times to fall asleep for non-smokers are approx. normal as both skewness and kurtosis are close to 0. Times to fall asleep for smokers are fairly platykurtotic. Kurtosis = -1.50

Transformations to Improve Normality (removing skewness) Many statistical methods require that the numeric variables you are working with have an approximately normal distribution. Reality is that this is often times not the case. One of the most common departures from normality is skewness, in particular, right skewness.

UP Bigger Impact Bigger Impact.. Middle rung: No transformation ( = 1) Middle rung: No transformation ( = 1) DOWN Here V represents our variable of interest. We are going to consider this variable raised to a power, i.e. V We go up the ladder to remove left skewness and down the ladder to remove right skewness. Right skewed Left skewed Tukey’s Ladder of Powers

To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V.5, V.333, V 0, V -1, etc. To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V 2, V 3, etc.

Removing Right Skewness Example: PDP-LI levels for cancer patients In the log base 10 scale the PDP-LI values are approximately normally distributed.