Ten things about Descriptive Statistics AP Statistics, Second Semester Review
What are descriptive statistics? Describing or summarizing a set of data … Numerically Graphically Why do we summarize data? Data, in its raw form is overwhelming Descriptive statistics allows us to digest the data more easily
Overwhelming… 3, 2, 2, 4, 7, 6, 11, 5, 3, 5, 4, 5, 5, 0, 0, 6, 7, 3, 7, 4, 7, 4, 4, 2, 10, 0, 3, 6, 4, 5, 0, 6, 7, 3, 2, 8, 2, 1, 8, 3, 8, 6, 3, 4, 5, 6, 8, 7, 13, 12, 3, 4, 7, 5, 5, 6, 1, 3, 5, 4, 3, 0, 3, 2, 9, 3, 3, 4, 3, 7, 8, 4, 6, 7, 5, 0
Numerical summaries Center Spread Other Mean, Median Standard Deviation, Inter-Quartile Range, Range Other Min, Max, Q1, Q3, Percentiles
Numerical Summaries: Center Mean: The sum of data values, divided by the number of values. Median: The middle data value when sort from low to high.
Numerical Summaries: Center When should we avoid using the mean? In the presence of outliers and skewness, the mean get pulled toward those outliers and skewness. Mean is not a resistant measurement of center, but median is When should we use the mean? Because the Central Limit Theorem and other great tools are based on mean.
Numerical Summaries: Spread Spreads are measurements of the variation in a data. Standard Deviation: the average distance of the individual data value from the mean. Interquartile Range (IQR): the distance from Q1 to Q3 Range: the distance from Min to Max.
Numerical Summaries: Spread Remember: Measurements of spread are single values. For example, the range is given “4”, not “from 6 to 10” When should we avoid using standard deviation? Like mean, standard deviation is not resistant to the effects of outliers and skewness. Why should we use standard deviation? Because the Central Limit Theorem and other great tools are based on standard deviation.
Numerical Summaries: Percentiles “The pth percentile of a distribution is the value with p percent of the observations less than it.” The median is the 50th percentile. Q1 is the 25th percentile. Q3 is the 75th percentile.
Outliers Outliers are atypical data values Outliers are data values that are unusually far from the center Using Tukey’s rule for outliers Any data that is greater than Q3+1.5*IQR is an outlier Any data that is less than Q1-1.5*IQR is an outlier
Comparing Apple to Oranges When comparing performance you can look at … Percentiles Standardize values z-values or t-values
Graphical Display Choosing the right graphical display depends upon the kind of variable There are two types of variables Quantitative Variables that are numbers where adding and averaging make sense Categorical Variables that take on one of a list of categories
Dotplots A quick summary of the distribution of the values What is the shape of this distribution?
Histograms Great for seeing the shape of the distribution Vertical axis can be using frequency or relative frequency (percents)
Stemplots Like a histogram, but with the ability to see all the data What is the shape of this distribution? What is the median? What is the Q1 score?
Ogives (Graphs of Cumulative Relative Frequency) What is the median of this data set? What is the Q1 score? What is the Q3 score?
Normal Probability Plots When the display is linear, the data from which the display came is normal
Boxplots Emphasizes the 5 number summary and outliers
Side by Side Boxplots Great for comparing two distributions Compare and contrast the amount of texts sent by males and females.
Graphical Displays: Categorical Variables Pie Charts Not my favorite Bar Charts Show frequencies or relative frequencies (percents) Stacked Bar Charts Each bar is 100%, but broken into sub-categories
Pie Charts
Bar Charts Show frequencies or relative frequencies (percents) The bars don’t touch because the don’t represent a continuum
Segmented Bar Charts Used for showing the relationship between two categorical variables Each bar is 100%, but broken into sub-categories
Scatterplots Used for showing relationships between two quantitative variables Typically the explanatory variable is placed on the horizontal axis and the response variable is on the vertical axis
Residuals Residuals is the difference between a real y value and the predicted y value A Least-Squares Regression Line (LSRL) minimizes the squares of the residuals.
Residual Plots Help us to determine whether the relationship between two quantitative variables is linear When the residual plot shows no pattern, the relationship is linear When the residual plot shows a pattern, the relationship is not linear
LSRL statistics a=vertical intercept b=slope Slope is the amount of change in the response variable for every unit change in the explanatory variable r=coefficient of correlation Ranges from -1 to 1, and measures the strength of the linear relationship. r2=coefficient of explanation The proportion of the change in the response variable that can be attributed to the explanatory variable