Presentation is loading. Please wait.

Presentation is loading. Please wait.

Those who don’t know statistics are condemned to reinvent it… David Freedman.

Similar presentations

Presentation on theme: "Those who don’t know statistics are condemned to reinvent it… David Freedman."— Presentation transcript:

1 Those who don’t know statistics are condemned to reinvent it… David Freedman

2 All you ever wanted to know about the histogram and more...

3 Distribution of No of Graphics on web pages (N=1873) Mean = 17.93 N = 1873 Graphic Count Std. Dev = 17.92 Median = 16.00 1

4 Horizontal Scale 2

5 Distribution of Redundant Link % on web pages (N =1861) Std. Dev = 37.33 Mean = 22.1 N = 1861.00 Median = 14 3

6 Plotting a histogram: endpoint convention, plot frequencies, make equal intervals etc.

7 Frequency Table convention: include the left endpoint in the class interval 4

8 Frequency/Probability

9 No of fonts used on a web-page 0/ 0 200/.1 400/.2 600/.3 800/.4 1000/.5 Frequency 110430860280180402010 13579111315 Probability. Frequency /probability 5

10 Cleaning up a histogram: getting rid of outliers

11 Distribution of word count (N=1903) Std. Dev = 725.24 Mean = 393.2 Maximum = 20,357 Minimum = 0 Median = 223

12 Distribution of word count (N=1897) top six removed Std. Dev = 474.04 Mean = 368.0 Maximum = 4132 Minimum = 0 Median = 223 7

13 Distribution of word count (N=1873) Std. Dev = 360.30 Mean = 333.4 Maximum = 4132 Minimum = 0 WORDCNT2 Median = 220

14 What can histograms tell you

15 Distribution of link count on good & bad web-pages Good SitesBad Sites 280.0240.0200.0160.0120. 300 200 100 0 8

16 Making inferences from histograms: Incidence of riots and temperature 3040 506070 80 90100110 temperature 9

17 Mean and Median Mean shifts around, Median does not shift much, is more stable Computing Median: for odd numbered N find middle number For even numbered N interpolate between middle 2, e.g. if it is 7 and 9, then 8 is the median Mean is arithmetic average, median is 50% point Mean is point where graph balances

18 The instability of means and standard deviations

19 Add two numbers: watch the mean, median, & SD

20 Add one outlier...

21 Standard Deviation: a measure of spread

22 Same mean, different spread SD SD 10

23 The Standard Deviation

24 The SD says how far away numbers on a list are from their average. Most entries on the list will be somewhere around one SD away from the average. Very few will be more than two or three SD’s away.

25 Understanding the standard deviation Lets start with a list: 1, 2, 2, 3 0% 25% 50% Histogram is symmetric about 2, 2 is mean, and 50% to left of 2, 50% to right

26 List: 1, 2, 2, 3 Average = 2 SD =.8 0% 25% 50% 0% 25% 50% List: 1, 2, 2, 5 Average =2.5 SD = 1.73 0% 25% 50% List: 1, 2, 2, 7 Average =3 SD = 2.71

27 List: 20, 10, 15, 15 Average = 15 Find deviations from average= 5, -5, 0, 0 Square the deviations: (5) 2 (-5) 2 (0) 2 (0) 2 = 50 divide it by N-1 = 50/3 = 16.67 Square root it= 16.67 = 4.08 Computing the standard deviation

28 Properties of the standard deviation The standard deviation is in the same units as the mean The standard deviation is inversely related to sample size (therefore as a measure of spread it is biased) In normally distributed data 68% of the sample lies within 1 SD

29 Properties of the Normal Probability Curve The graph is symmetric about the mean (the part to the right is a mirror image of the part to the left) The total area under the curve equals 100% Curve is always above horizontal axis Appears to stop after a certain point (the curve gets really low)

30 The graph is symmetric about the mean = The total area under the curve equals 100% Mean to 1 SD = +- 68% Mean to 2 SD = +- 95% Mean to 3 SD = +- 99.7% You can disregard rest of curve 1 SD= 68% 2 SD = 95% 3 SD= 99.7% 11

31 Distribution of judges ratings for the Webby Awards Std. Dev = 1.98 Mean = 6.3 N = 1867.00 Skewness = -.43 Kurtosis = -.201 Median = 6.3 12

32 It is a remarkable fact that many histograms in real life tend to follow the Normal Curve. For such histograms, the mean and SD are good summary statistics. The average pins down the center, while the SD gives the spread. For histogram which do not follow the normal Curve, the mean and SD are not good summary statistics. What when the histogram is not normal...

33 +- 3 SD = (384 * 3) = 1152 Mean - 1152 = about 30% sample had negative number of links Mean = 348.3 Std. Dev = 384.83 Distribution of word count on web pages 13

34 Note. A percentile is a score below which a certain % of sample is When SD is influenced by outliers Use inter quartile range 75th percentile - 25th percentile

35 Measures of Normality Visual examination Skewness: measure of symmetry Positively SkewedNegatively Skewed Symmetric 14

36 Kurtosis: Does it cluster in the middle? Large tailSmall tailNormal Tail Kurtosis is based on a distributions tail. Distributions with a large tail: leptokurtic Distributions with a small tail: platykurtic Distributions with a normal tail: mesokurtic 15

37 Positively Skewed and Leptokurtic: Word Count Std. Dev = 725.24 Mean = 393.2 N = 1903.00 Kurtosis = 321.84 Skewness = 13.62 Median = 223

38 Distribution of word count (N=1897) top six removed Std. Dev = 474.04 Mean = 368.0 N = 1897.00 Skewness = 3.49 Kurtosis = 16.40 Median = 223

39 Degree of Freedom The number of independent pieces of information remaining after estimating one or more parameters Example: List= 1, 2, 3, 4 Average= 2.5 For average to remain the same three of the numbers can be anything you want, fourth is fixed New List = 1, 5, 2.5, __ Average = 2.5

Download ppt "Those who don’t know statistics are condemned to reinvent it… David Freedman."

Similar presentations

Ads by Google