Topic 5: Exploring Quantitative data
Dot plot, mean, and standard deviation
Data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Variable Description Spam Specifies whether the email is spam Num_char Number of characters in email Line_breaks Number of line breaks in email Format Specifies whether email was in html or text format Number Indicates if email contained no number, a small number (under 1,000,000), or a big number
Data matrix for emails Quantitative variables Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Spam Num_char Line_breaks Format Number 1 no 21706 551 html small 2 7011 183 big 3 yes 631 28 text none . 3921 2225 65 Quantitative variables
Sample of email data Let’s consider a random sample of 50 emails from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 emails. Spam Num_char Line_breaks Format Number 1 no 2454 61 text small 2 41623 1088 html 3 57 5 . 50 15829 242
Dot plot A dot plot provides a case-by-case view of data for one quantitative variable. Num_char 1 2454 2 41623 3 57 . 50 15829
Dot plot A dot plot provides a case-by-case view of data for one quantitative variable. Num_char 1 2454 2 41623 3 57 . 50 15829
Dot plot and the mean The “placement” of data, as seen in a dot plot or some other representation, is called the distribution of the data. The mean (also called the average) is a common way to measure the center of the distribution. Mean for data below is 10.704
The mean The sample mean, denoted by , can be calculated as where represent the observed values.
Population mean and estimation The population mean is also computed the same way, but denoted by μ (the Greek letter mu). It is often not possible to compute μ because data on the entire population is not available. The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate is probably not perfect, but if the sample is representative of the population, it is usually a good estimate.
Distributions with the same mean Each dot plot displays 124 observations and the distributions all have a mean of 6. What makes them different?
Distributions with the same mean Order these distributions from the least spread out to the most spread out. A. B. C.
Standard Deviation The standard deviation is the typical distance of an observation from the mean. The mean of the distribution is = 6 and sample size is n = 124. The standard deviation is computed as follows:
Standard deviation measures spread A. Std. dev. = 1.361 The standards deviations of the three distributions are given. B. Std. dev. = 2.550 C. Std. dev. = 1.482
The standard deviation The standard deviation of a sample is denoted by s and can be calculated using the formula given on the previous slide. The standard deviation of the population is computed in a similar way, except we divide by n instead of n-1. The standard deviation of the population is denoted by σ (the Greek letter sigma).
Histograms and the shape of a distribution
Histogram A histogram plots binned counts as bars. Characters (in thousands) Count 0-5 19 5-10 12 10-15 6 15-20 3 20-25 25-30 5 30-35 35-40 40-45 2
Histograms A histogram is another way to display the distribution of a quantitative variable. Better than a stem-and-leaf plot for larger data sets, but doesn’t retain the actual numerical values. Basic Steps for Creating a Histogram Divide the range of the data (smallest to largest) into classes of equal width. The classes should not overlap. Count the number of observations that fall into each class. Recall that the counts are also called frequencies. Draw a horizontal axis and mark off the classes along this axis. The vertical axis can be the count, the proportion, or the percentage. Draw a rectangle (a vertical bar) above each class with the height equal to the count, the proportion, or the percentage.
Bin width: height of MAT 117 students Bin width can alter the story we get from the histogram. ½ in. bins 1 in. bins 6 in. bins 33 in. bins
Shape of a Distribution: Modality Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? Note: To determine modality, step back and imagine a smooth curve over the histogram – imagine the bars are wooden blocks and you drop a limp spaghetti noodle over them, the shape the spaghetti would take could be viewed as a smooth curve.
Modality: height of MAT 117 students Which bin width most accurately presents the modality? ½ in. bins 1 in. bins 6 in. bins 33 in. bins
Shape of a Distribution: Skewness Is the histogram right skewed, left skewed, or symmetric? Note: Histograms are said to be skewed to the side of the long tail.
Shape of a Distribution: Unusual Observations Are there any unusual observations or potential outliers
Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails?
Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails? Unimodal and right skewed, with a potentially unusual observation at 40,000 characters
Box plot and the five number summary
Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median 100-th percentile Maximum
Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3
Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median Second quartile Q2 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3
Percentiles, quartiles, and the median The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 0-th percentile Minimum 50-th percentile Median Second quartile Q2 100-th percentile Maximum 25-th percentile First quartile Q1 75-th percentile Third quartile Q3
Height of female MAT 117 students
Height of female MAT 117 students
Height of female MAT 117 students Median Q1 Q3 Max. Min. We want to graphically represent these five numbers, called the five-number summary. This graph is called a box plot. As you can see, there is a bit more to it than just these five numbers.
Box plot: height of female MAT 117 students
Anatomy of the box plot Median Lower whisker Upper whisker Potential outliers Q1 Q3 Potential outlier
IQR, whisker, and outliers Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range (IQR). IQR = Q3 – Q1 Whiskers of a box plot can extend up to 1.5 x IQR away from the the quartiles: Max upper whisker reach = Q3 + 1.5 x IQR Max lower whisker reach = Q1 – 1.5 x IQR A potential outlier is an observation beyond the maximum reach of the whiskers. It is an observation that appears to be extreme relative to the rest of the data.
Outliers Why is it important to look for outliers? Identify extreme skew in the distribution. Identify data collection and entry errors. Provide insight into interesting features of the data.
Resistant statistics
Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267
Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267
Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603
Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603
Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603
Extreme Observations: 2006 US household income Mean Std. Dev. Median IQR Actual Data 69,319 90,662 39,718 71,267 700K to 1,400K 72,760 122,603 3K to 1,400K 76,300 130,564 40,262 71,888
Quantitative data pairs: scatterplots
Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables. Num_char Line_breaks 1 2454 61 2 41623 1088 3 57 5 . 50 15829 242
Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables. Num_char Line_breaks 1 2454 61 2 41623 1088 3 57 5 . 50 15829 242
Scatterplots: trends Linear trend Nonlinear trend
Scatterplots: trends (continued) Cluster trend No apparent trend
Categorical-quantitative data pairs: comparing groups
A categorical-quantitative data pair Typically the categorical variable is the explanatory variable, and the quantitative variable is the response variable: Explanatory: categorical variable Response: quantitative variable We want to compare the quantitative variable (its mean, median, etc.) for the different groups formed by the categorical variable.
Sample of email data Let’s consider a random sample of 50 emails from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 emails. Spam Num_char Line_breaks Format Number 1 no 2454 61 text small 2 41623 1088 html 3 57 5 . 50 15829 242 Quantitative Categorical
Number of characters and the format of emails The table below shows the mean and standard deviation for the number of characters in emails formatted as text or html. Number of Characters (in thousands) Mean Standard Deviation Text 2.308 3.626 HTML 14.862 13.711
Comparing box plots