Anscombe’s Quartet.

Anscombe’s Quartet

Why Statistics Can Be Wrong
Anscombe's quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.[1]

The Statistical Properties
Property Value Mean of x in each case 9 (exact) Sample variance of x in each case 11 (exact) Mean of y in each case 7.50 (to 2 decimal places) Sample variance of y in each case 4.122 or 4.127 (to 3 decimal places) Correlation between x and y in each case 0.816 (to 3 decimal places) Linear regression line in each case y = 3.00 + 0.500x (to 2 and 3 decimal places, respectively)

All four sets are identical when examined using simple summary statistics, but vary considerably when graphed

The Datasets I II III IV x y 10.0 8.04 9.14 7.46 8.0 6.58 6.95 8.14 6.77 5.76 13.0 7.58 8.74 12.74 7.71 9.0 8.81 8.77 7.11 8.84 11.0 8.33 9.26 7.81 8.47 14.0 9.96 8.10 7.04 6.0 7.24 6.13 6.08 5.25 4.0 4.26 3.10 5.39 19.0 12.50 12.0 10.84 9.13 8.15 5.56 7.0 4.82 7.26 6.42 7.91 5.0 5.68 4.74 5.73 6.89 A procedure to generate similar data sets with identical statistics and dissimilar graphics has since been developed.

The Representation The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two variables correlated and following the assumption of normality. The second graph (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear, and the Pearson correlation coefficient is not relevant (a more general regression and the corresponding coefficient of determination would be more appropriate). In the third graph (bottom left), the distribution is linear, but with a different regression line, which is offset by the one outlier which exerts enough influence to alter the regression line and lower the correlation coefficient from 1 to (a robust regression would have been called for). Finally, the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear. The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2][3][4][5][6]

References Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician 27 (1): 17– 21. JSTOR Elert, Glenn. "Linear Regression". The Physics Hypertextbook. Janert, Philipp K. (2010). Data Analysis with Open Source Tools. O'Reilly Media, Inc. pp. 65–66. ISBN Chatterjee, Samprit; Hadi, Ali S. (2006). Regression analysis by example. John Wiley and Sons. p. 91. ISBN Saville, David J.; Wood, Graham R. (1991). Statistical methods: the geometric approach. Springer. p. 418. ISBN Tufte, Edward R. (2001). The Visual Display of Quantitative Information (2nd ed.). Cheshire, CT: Graphics Press. ISBN Chatterjee, Sangit; Firat, Aykut (2007). "Generating Data with Identical Statistics but Dissimilar Graphics: A Follow up to the Anscombe Dataset". American Statistician 61 (3): 248–254.doi: / X

Anscombe’s Quartet.

Similar presentations

Presentation on theme: "Anscombe’s Quartet."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Anscombe’s Quartet.

Similar presentations

Presentation on theme: "Anscombe’s Quartet."— Presentation transcript:

Similar presentations

About project

Feedback