Chapter 16: Exploratory data analysis: numerical summaries CIS 3033
16.1 The center of a dataset Sample mean: Sample median: Medn is the middle element (if n is odd), or the average of the two middle elements (if n is even), after the elements are sorted. The sample mean and sample median of a dataset correspond to the expectation and median of a probability distribution, respectively. Sample mean is more sensitive to outliers than sample median.
16.2 The amount of variability Sample variance: Sample standard deviation: sn , the square-root of sample variance. When the center of a dataset is represented by its sample median, the variability is often indicated by median of absolute deviations, or MAD, MAD less sensitive to outliers than sn.
16.3 Empirical quantiles, quartiles, IQR The pth empirical quantile, qn(p), or the 100p empirical percentile, is the number that is greater than (or equal to) a proportion p of the elements in the dataset, and less than (or equal to) a proportion 1−p of the elements. A quantile is not necessarily in the dataset. The order statistics of a dataset x1, x2, . . . , xn consist of the same elements as in the original dataset, but ordered as x(1) ≤ x(2) ≤ · · · ≤ x(n).
16.3 Empirical quantiles, quartiles, IQR The ith order statistic is greater or equal to i elements in the dataset, and less or equal to n-i+1 elements, so it is the i/(n+1) quantile (the element itself is counted in both sides). Example: 2, 5, 7, 8 When p = 0.3, for this dataset = 2 + (0.3*5 - 1) * (5 - 2) = 3.5
16.3 Empirical quantiles, quartiles, IQR The qn(0.25) is called the lower quartile and qn(0.75) is called the upper quartile. The distance between the upper and lower quartiles is called the interquartile range, or IQR. Five-number summary of a dataset: minimum, lower (or first) quartile, median, upper (or third) quartile, maximum.
16.3 Empirical quantiles, quartiles, IQR
16.4 The box-and-whisker plot Box-and-whisker plot, briefly boxplot: visualizing the five-number summary. The horizontal width of the box is irrelevant. The height of the box is precisely the IQR. The horizontal line inside the box corresponds to the sample median. Two whiskers are the min/max data elements within 1.5 IQR from the box. All observations beyond the whiskers are called outliers.
16.4 The box-and-whisker plot
16.4 The box-and-whisker plot Boxplots become useful if we want to compare several sets of data in a simple graphical display.