1 Empirical and probability distributions 0.4 exploratory data analysis
2 Exploratory data analysis Given unknown distribution, we often take a sample to explore its characteristics. stem-and-leaf display Order the n observations in a sample upwards. 50 test scores on a statistics examination: StemsLeavesFrequencyDepths (13)145 StemsLeavesFrequencyDepths (13)145 Table Stem-and-leaf displayTable Ordered stem-and-leaf display
3 Order Statistics of the sample Order statistics of 50 exam scores Easy to compute the sample percentiles. The (100p)th sample percentile is defined as 0 < 1/(n+1) p n/(n+1) < 1 The (n+1)p th order statistic, if (n+1)p is an integer. Or Linear interpolation between y r and y r+1 if (n+1)p=r + proper fraction t.
4 For p=1/2: (n+1)p=25.5, the 50th-percentile is For p=1/4: (n+1)p=12.75, the 25th-percentile is For p=3/4: (n+1)p=38.25, the 75th-percentile is The 50th percentile is called the median of the sample. The 25th, 50th, and 75th percentiles are the first, second, and third quartiles of the sample. The 10th, 20th, …, and 90thpercentiles are the deciles of the sample.
5 Five-number Summary The set has min., 1st quartile q 1, median, 3rd quartile q 3, and max. IQR, inter-quartile range = q 3 -q 1. Box-and-whisker diagram (box plot) to display 5- number summary. Ex0.4-2: y 1 =34, q 1 =58.75, q 2 =m=71, q 3 =81.25, y 50 =97. Slightly skew to the left
6 Ex0.4-5: IQR=13.5-2=11.5 Inner fence: 1.5*11.5=17.25 Outer fence: 3*11.5=34.5 Two suspected outliers are marked with an *.
7 Some functions of 2 or more order statistics Middle Midrange=average of the extremes=(y 1 +y n )/2 Trimean=(q 1 +2q 2 +q 3 )/4 Spread Range=difference of the extremes=y n -y 1 Interquartile range=difference of third and first quartiles=q 3 -q 1 (=IQR)
8 0.5 Graphical comparisons of data sets It is also called a back-to-back stem-and-leaf display. To compare the characteristics of two populations of data. Ex0.5-1: The hardness results for Furnace 10 & 14. Depths Furnace 10 leaves Stems Furnace 14 leaves Depths (11) s 3 . 4*4t4f4s 4 . 5* (6)9831
9 Ex0.5-2: Ex0.5-2: IQR=13.5-2=11.5 F10: (46, 47, 48, 49, 51), F14: (36, 40.5, 43, 46.5, 51). Comparisons of 3+ sets of data are possible.
10 Quantile-quantile (q-q) Plot) For two sets of data: x 1 x 2 … x n & y 1 y 2 … y n x r & y r are called the quantile of order r/(n+1), & the 100[r/(n+1)]th-percentiles. In a q-q plot, the quantiles of one sample are plotted against the corresponding quantiles of the other sample. If both samples were the same, the points (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) will be graphed as a straight line with slope 1 & intercepting 0.
11 If the first sample’s mean is shifted over d units, the intercept is –d. The first sample has greater variability if the slope is less than 1. If the slope increases, the variability of the first sample decreases. For instance, the first sample is skewed to the left. Ex0.5-4: the average hourly number of misfeeding leads.
Probability density and mass function Probability density and mass function -Probability Density Functions The relative frequency (or density) histogram h(x) associated with n observations of a random variable X of the continuous type is a nonnegative function. The area between its graph and the x axis is 1. As n increases, the class intervals approach 0. h(x) ⇒ some function f(x) for the true probability
13 Probability density function (p.d.f) (a) f(x) > 0, x S (b) S f(x)dx =1 (c)The probability of the event a <X < b is P(a < X < b) = b a f(x)dx The corresponding distribution of probability is said to be one of the continuous type.
14 Ex0.7-1: For a balanced spinner, the result of a spin is a random variable X whose space is S={x:0 x<1} Due to the spinner is “balanced”, X has the p.d.f. f(x)=1, 0 ≤x <1.
15 Probability mass function (p.m.f) (a) f(x) > 0, x S (b) x S f(x) = 1 (c) P(X = u i ) = f(u i ), i = 1, 2,..., k
16
17 Percentile from Percentile from p.d.f. The (100p)th percentile is a number p s.t. the area under f(x) to the left of p is p. p = p - f(x)dx The 50 th percentile π 0.5 is called the median, m =π 0.5. The 25 th & 75 th percentiles are called the first and third quartiles q1= 0.25 & q3= 0.75 [m=q2= 0.5 : the second quartile] In discrete case, the percentiles are often not so clean to find because each point in the space S has a positive probability.
18 Ex0.7-6: The distribution of the largest value, Y, of two spins of the balanced spinner has the p.d.f. f(y)=2y, 0 y<1. Find the median.
19 Q-q Plot for Model Evaluation To exam how close a theoretical model is to the real distribution, Coarsely, Compute the mean μand the variance σ 2 of the theoretical model. Perform a random experiment and compute the mean x and the variance s 2 of the observed data. Compare these values. Delicately, Achieve the quantile-quantile (q-q) plot.
20 The (100p) th percentile of a distribution is often called the quantile of order p. The percentile π p of a theoretical distribution is the quantile of order p. Empirically, sort n observations {x 1, x 2, …, x n } into the order statistics {y 1, y 2, …, y n } (y 1 ≤y 2 ≤…≤y n ) y r is the quantile of order r/(n+1), and the 100r/(n+1) percentile. Plot (y r, π p ), where p=r/(n+1), r=1, …, n. If the points closely lie on a line of the slope 1, then If the points closely lie on a line of the slope 1, then y r π p. Or, the theoretical model is not good.
21 Ex0.7-7: using p.d.f. f(x)=1, 0 ≤x <1, to approach the random number. From f(x), compute μ=1/2, and σ 2 =1/12, σ= Pick the first 19 random numbers from Table IX in the Appendix (page.665) Sort them in an ascending order Compute x= and s = Compare these means and standard deviations. Construct q-q plot: