Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Empirical and probability distributions 0.4 exploratory data analysis.

Similar presentations


Presentation on theme: "1 Empirical and probability distributions 0.4 exploratory data analysis."— Presentation transcript:

1 1 Empirical and probability distributions 0.4 exploratory data analysis

2 2 Exploratory data analysis   Given unknown distribution, we often take a sample to explore its characteristics.   stem-and-leaf display   Order the n observations in a sample upwards.  50 test scores on a statistics examination: 93 77 67 72 52 83 66 84 59 63 75 97 84 73 81 42 61 51 91 87 34 54 71 47 79 70 65 57 90 83 58 69 82 76 71 60 38 81 74 69 68 76 85 58 45 73 75 42 93 65 StemsLeavesFrequencyDepths 3456789 4 8 2 7 5 2 2 9 1 4 7 8 8 7 6 3 1 5 9 0 9 8 5 7 2 5 3 1 9 0 6 1 4 6 3 5 3 4 4 1 7 3 2 1 5 3 7 1 0 3 247101395261323(13)145 StemsLeavesFrequencyDepths3456789 4 8 2 2 5 7 1 2 4 7 8 8 9 0 1 3 5 5 6 7 8 9 9 0 1 1 2 3 3 4 5 5 6 6 7 9 1 1 2 3 3 4 4 5 7 0 1 3 3 7 247101395261323(13)145 Table 3.1-4 Stem-and-leaf displayTable 3.1-5 Ordered stem-and-leaf display

3 3 Order Statistics of the sample  Order statistics of 50 exam scores 34 38 42 42 45 47 51 47 51 52 54 57 58 58 59 60 61 63 65 65 66 67 68 69 69 70 71 71 72 73 73 74 75 75 76 76 77 79 81 81 82 83 83 84 84 85 87 90 91 93 93 97   Easy to compute the sample percentiles.   The (100p)th sample percentile is defined as 0 < 1/(n+1)  p  n/(n+1) < 1   The (n+1)p th order statistic, if (n+1)p is an integer.   Or Linear interpolation between y r and y r+1 if (n+1)p=r + proper fraction t.

4 4  For p=1/2: (n+1)p=25.5, the 50th-percentile is  For p=1/4: (n+1)p=12.75, the 25th-percentile is  For p=3/4: (n+1)p=38.25, the 75th-percentile is   The 50th percentile is called the median of the sample.   The 25th, 50th, and 75th percentiles are the first, second, and third quartiles of the sample.   The 10th, 20th, …, and 90thpercentiles are the deciles of the sample.

5 5 Five-number Summary   The set has min., 1st quartile q 1, median, 3rd quartile q 3, and max.   IQR, inter-quartile range = q 3 -q 1.   Box-and-whisker diagram (box plot) to display 5- number summary.   Ex0.4-2: y 1 =34, q 1 =58.75, q 2 =m=71, q 3 =81.25, y 50 =97.   Slightly skew to the left

6 6   Ex0.4-5: IQR=13.5-2=11.5   Inner fence: 1.5*11.5=17.25   Outer fence: 3*11.5=34.5   Two suspected outliers are marked with an *.

7 7  Some functions of 2 or more order statistics  Middle  Midrange=average of the extremes=(y 1 +y n )/2  Trimean=(q 1 +2q 2 +q 3 )/4  Spread  Range=difference of the extremes=y n -y 1  Interquartile range=difference of third and first quartiles=q 3 -q 1 (=IQR)

8 8 0.5 Graphical comparisons of data sets   It is also called a back-to-back stem-and-leaf display.   To compare the characteristics of two populations of data.   Ex0.5-1: The hardness results for Furnace 10 & 14. Depths Furnace 10 leaves Stems Furnace 14 leaves Depths 0000011(11)3 7 7 7 7 7 7 6 6 6 6 6 9 9 9 9 9 8 8 8 8 8 8 1 1 0 1 1 03s 3 . 4*4t4f4s 4 . 5*6 8 8 8 9 0 1 1 1 1 3 3 3 3 3 3 4 6 6 7 7 7 8 9 11510(6)9831

9 9  Ex0.5-2:  Ex0.5-2: IQR=13.5-2=11.5   F10: (46, 47, 48, 49, 51),   F14: (36, 40.5, 43, 46.5, 51).   Comparisons of 3+ sets of data are possible.

10 10 Quantile-quantile (q-q) Plot)   For two sets of data: x 1  x 2  …  x n & y 1  y 2  …  y n   x r & y r are called the quantile of order r/(n+1), & the 100[r/(n+1)]th-percentiles.   In a q-q plot, the quantiles of one sample are plotted against the corresponding quantiles of the other sample.   If both samples were the same, the points (x 1, y 1 ), (x 2, y 2 ), …, (x n, y n ) will be graphed as a straight line with slope 1 & intercepting 0.

11 11   If the first sample’s mean is shifted over d units, the intercept is –d.   The first sample has greater variability if the slope is less than 1.   If the slope increases, the variability of the first sample decreases.   For instance, the first sample is skewed to the left.   Ex0.5-4: the average hourly number of misfeeding leads.

12 12 0.7 Probability density and mass function - 0.7 Probability density and mass function -Probability Density Functions   The relative frequency (or density) histogram h(x) associated with n observations of a random variable X of the continuous type is a nonnegative function.   The area between its graph and the x axis is 1.   As n increases, the class intervals approach 0.   h(x) ⇒ some function f(x) for the true probability

13 13   Probability density function (p.d.f)   (a) f(x) > 0, x  S   (b)  S f(x)dx =1   (c)The probability of the event a <X < b is P(a < X < b) =  b a f(x)dx   The corresponding distribution of probability is said to be one of the continuous type.

14 14   Ex0.7-1: For a balanced spinner, the result of a spin is a random variable X whose space is S={x:0  x<1}   Due to the spinner is “balanced”, X has the p.d.f. f(x)=1, 0 ≤x <1.

15 15 Probability mass function (p.m.f)   (a) f(x) > 0, x  S   (b)  x  S f(x) = 1   (c) P(X = u i ) = f(u i ), i = 1, 2,..., k

16 16

17 17 Percentile from Percentile from p.d.f.   The (100p)th percentile is a number  p s.t. the area under f(x) to the left of  p is p. p =   p -  f(x)dx   The 50 th percentile π 0.5 is called the median, m =π 0.5.   The 25 th & 75 th percentiles are called the first and third quartiles   q1=  0.25 & q3=  0.75 [m=q2=  0.5 : the second quartile]   In discrete case, the percentiles are often not so clean to find because each point in the space S has a positive probability.

18 18   Ex0.7-6: The distribution of the largest value, Y, of two spins of the balanced spinner has the p.d.f. f(y)=2y, 0  y<1. Find the median.

19 19 Q-q Plot for Model Evaluation   To exam how close a theoretical model is to the real distribution,   Coarsely,   Compute the mean μand the variance σ 2 of the theoretical model.   Perform a random experiment and compute the mean x and the variance s 2 of the observed data.   Compare these values.   Delicately,   Achieve the quantile-quantile (q-q) plot.

20 20   The (100p) th percentile of a distribution is often called the quantile of order p.   The percentile π p of a theoretical distribution is the quantile of order p.   Empirically, sort n observations {x 1, x 2, …, x n } into the order statistics {y 1, y 2, …, y n } (y 1 ≤y 2 ≤…≤y n )   y r is the quantile of order r/(n+1), and the 100r/(n+1) percentile.   Plot (y r, π p ), where p=r/(n+1), r=1, …, n.  If the points closely lie on a line of the slope 1, then  If the points closely lie on a line of the slope 1, then y r  π p.   Or, the theoretical model is not good.

21 21   Ex0.7-7: using p.d.f. f(x)=1, 0 ≤x <1, to approach the random number.   From f(x), compute μ=1/2, and σ 2 =1/12, σ=0.2887   Pick the first 19 random numbers from Table IX in the Appendix (page.665)   Sort them in an ascending order. 0.0315 0.0460 0.1233 0.2055 0.2581 0.2906 0.3384 0.4658 0.4779 0.4871 0.4930 0.5334 0.6814 0.6960 0.7244 0.7843 0.8071 0.8287 0.9705   Compute x= 0.4865 and s = 0.2797.   Compare these means and standard deviations.   Construct q-q plot:


Download ppt "1 Empirical and probability distributions 0.4 exploratory data analysis."

Similar presentations


Ads by Google