Download presentation
Presentation is loading. Please wait.
1
Descriptive Statistics:
Part II Each slide has its own narration in an audio file. For the explanation of any slide click on the audio icon to start it. Professor Friedman's Statistics Course by H & L Friedman is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
2
Shape A third important property of data – after location and dispersion - is its shape. Shape can be described by degree of asymmetry (i.e., skewness). mean > median positive or right-skewness mean = median symmetric or zero-skewness mean < median negative or left-skewness Positive skewness can arise when the mean is increased by some unusually high values. Negative skewness can arise when the mean is decreased by some unusually low values. Descriptive Statistics II
3
Skewness Left skewed: Right skewed: Symmetric:
Source: Levine et al., Business Statistics, Pearson, 2013. Descriptive Statistics II
4
Example: # hours to complete a task
This guy took a VERY long time! Data (for n=12 employees): ┋ ┋ ┋ 𝑋 = 180/12 = 15 hours Median = 10 hours The (extremely slow) employee who took 63 hours to complete the task skewed the entire distributon to the right. s2 = 2868 / 11 = s = hours CV = 107.7% -- high value! Descriptive Statistics II
5
Example Using MS Excel Scores of 17 students on a national calculus exam. Data: 0, 0, 10, 12, 15, 18, 20, 25, 30, 33, 34, 41, 56, 87, 92, 94, 95 Open MS Excel. Go to Data Analysis—Analysis Tools — Descriptive Statistics. If you do not have Data Analysis-Analysis Tools, you have to use the Add-in feature and add it to MS Excel. Make sure to check the Summary Statistics box once you are in descriptive statistics. See MS Excel Output on next slide. Descriptive Statistics II
6
Using MS Excel From the output: mean is 38.94 median is 30 mode is 0
TI STAT>Edit>enter into L1 TI STAT>Calc>1-Var Stat MS Excel uses a formula – the Pearson Coefficient of Skewness – to calculate skewness. You do not have to know the formula. If the coefficient is 0 or very close to it, you have a symmetric distribution. From the output: mean is 38.94 median is 30 mode is 0 standard deviation is 33.44 variance is skewness is .78 (positive) range is 95 n is 17 Rcmdr>Statistics>Summary>Numerical Summaries>Click on Statistics and Choose Descriptive Statistics II
7
Standardizing Data: Z-Scores
We can convert the original scores to new scores with 𝑋 = 0 and s = 1. This will give us a pure number with no units of measurement. Any score below the mean will now be negative. Any score at the mean will be 0. Any score above the mean will be positive. Descriptive Statistics II
8
Standardizing Data: Z-Scores
To compute the Z-scores: 𝑍= 𝑋− 𝑋 𝑠 Example. Data: 0, 2, 4, 6, 8, 10 𝑋 = 30/6 = 5; s = 3.74 X Z 0−5 3.74 -1.34 2 2−5 3.74 -.80 4 4−5 3.74 -.27 6 6−5 3.74 .27 8 8−5 3.74 .80 10 10−5 3.74 1.34 Rcmdr>lines submit: Descriptive Statistics II
9
Data: Exam Scores Original data Change 7 to 97 Change 23 to 93 X Z 65 -0.45 -0.81 -1.40 73 -0.11 -0.38 -0.79 78 0.10 -0.10 -0.40 69 -0.28 -0.60 -1.09 7 -2.89 <= 97 0.94 1.07 23 -2.21 -3.12 93 0.76 98 0.99 1.14 99 1.05 1.22 0.90 75 -0.02 -0.27 -0.63 79 0.14 -0.05 -0.32 85 0.40 0.28 63 -0.53 -0.92 -1.56 67 -0.36 -0.70 -1.25 72 -0.15 -0.43 -0.86 0.73 0.72 95 0.82 0.83 0.91 Mean 75.57 79.86 83.19 s 23.75 18.24 s. 12.96 Note how 2 scores are fixed and different results are obtained Descriptive Statistics II
10
Z-Scores Ex: IQ scores with mean = 100, sd = 15 Mean ± 2*sd = (70,130) 95% of IQ scores lie in these bounds No matter what you are measuring, a Z-score of more than +5 or less than – 5 would indicate a very, very unusual score. For standardized data, if it is normally distributed (bell-shaped), 95% of the data will be between ±2 standard deviations about the mean. If the data follows a normal distribution, 95% of the data will be between and 99.7% of the data will fall between -3 and +3. 99.99% of the data will fall between -4 and +4. Worst case scenario: 75% of the data are between 2 standard deviations about the mean. [Chebychev.] Descriptive Statistics II
11
Smallest| Q1 | Median | Q3 | Largest
Five Number Summary When examining a distribution for shape, sometime the five number summary is useful: Smallest| Q1 | Median | Q3 | Largest Example: 𝑋 = 15 5-number summary: 2 | 8 | 10 | | 63 This data is right-skewed. In right-skewed distributions, the distance from Q3 to Xlargest (16.5 to 63) is significantly greater than the distance from Xsmallest to Q1(2 to 8). 2 3 8 9 10 12 15 18 22 63 Median Q1 Smallest Q3 Largest Descriptive Statistics II
12
Boxplot The boxplot is a way to graphically portray a distribution of data by means of its five-number summary. Boxplot can be drawn along the horizontal or vertically. Vertical line drawn within the box is the median Vertical line at the left side of box is Q1 Vertical line at the right side of box is Q3 Line on left connects left side of box with Xsmallest (lower 25% of data) Line on right connects right side of box with Xlargest (upper 25% of data) Descriptive Statistics II
13
A “bell-shaped” symmetric data distribution would look like this:
Rcmdr>Graph>Boxplot Note that if there are groups – like male/female, we can do a very informative side-by-side boxplot. Outliers would appear as separate dots Rcmdr>Data>Import>from text SSHA, by groups choose Sex Test data from previous slide: Descriptive Statistics II
14
Categorical Data We summarize categorical data using frequencies and graphical methods. Descriptive Statistics II
15
Working with Frequencies
A frequency distribution records data grouped into classes and the number of observations that fell into each class. A frequency distribution can be used for: categorical data numerical data that can be grouped into intervals numerical data with repeated observations A percentage distribution records the percent of the observations that fell into each class. Descriptive Statistics II
16
Working with Frequencies
Example. A sample was taken of 200 professors at a (fictitious) local college. Each was asked for his or her (take-home) weekly salary. The responses ranged from about$520 to $590. If we wanted to display the data in, say, 7 equal intervals, we would use an interval width of $10. Width of interval = 𝑅𝑎𝑛𝑔𝑒 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠 = $70 7 = $10/class. The Frequency / Percentage Distribution: . Take-home pay frequency percentage 520 and under 530 6 3 % 530 " " 540 30 15 540 " " 550 38 19 550 " " 560 52 26 560 " " 570 42 21 570 " " 580 24 12 to 590 8 4 200 100 Descriptive Statistics II
17
Working with Frequencies
A Cumulative Distribution focuses on the number or percentage of cases that lie below or above specified values rather than within intervals. Take-home pay frequency percentage less than 520 " " 530 6 3 540 36 18 550 74 37 560 126 63 570 168 84 580 192 96 590 200 100 Descriptive Statistics II
18
Working with Frequencies
The Frequency Histogram: Excel>Data>Data Analysis>Histogram For SSHA data: Rcmdr>Graph>Histogram>choose Score We can also do by groups - Sex Descriptive Statistics II
19
The Frequency Polygon library(ggplot2) ggplot(SSHA, aes(Score)) +
geom_freqpoly(binwidth = 20) Descriptive Statistics II
20
The Cumulative Frequency Distribution
Extra example on pulse rates: Descriptive Statistics II
21
Descriptive Statistics – 2 variables
Categorical Data – graphical representation Contingency Table Side-by-Side Bar Chart Numerical Data – looking for relationships in bivariate data Scatter Plot Correlation The Regression Line Descriptive Statistics II
22
The Contingency Table Two categorical variables are most easily displayed in a contingency table. This is a table of two-way frequencies. Example: “Who would you vote for in the next election?” This also works for two-way percentages: . Male Female Republican Candidate 250 500 Democrat Candidate 150 350 400 600 1000 Descriptive Statistics II
23
The Side-by-Side Bar Chart
Excel>Select>Insert>Bar Chart 1st we need to open a different data file employee.txt: Rcmdr>Data>Import Data>from text file etc… employee.txt To create a contingency table: Rcmdr>Statistics>Contigency Tables>Two-way Table To produce a bar graph: Rcmdr>Graph>Bar Graph>select variables and options as in Descriptive Statistics II
24
The Scatter Plot Excel>Select>Insert>Scatter What can we do with 2 numerical variables? We can graph them against each other. Example – Grade and Height (in inches) Y (Grade) 100 95 90 80 70 65 60 40 30 20 X (Height) 73 79 62 69 74 77 81 63 68 After entering the data as before (this time into a new column, Rcmdr>Data>New Data), we can obtain the scatter plot with: Rcmdr>Graph>Scatterplot: Descriptive Statistics II
25
The Scatter Plot Correlation coefficient is r = .12 Coefficient of determination is r2 = .01 We will learn about the above measures, as well as more about scatter plots, in the topic on CORRELATION. Descriptive Statistics II
26
Homework Practice, practice, practice.
As always, do lots and lots of problems. You can find these in the online lecture notes and homework assignments. Descriptive Statistics II
27
Extra examples of bar plots and side-by side barplots
Descriptive Statistics II
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.