Presentation is loading. Please wait.

Presentation is loading. Please wait.

January 5, 2009 - afternoon session 1 Statistics Micro Mini Statistics Review January 5-9, 2009 Beth Ayers.

Similar presentations


Presentation on theme: "January 5, 2009 - afternoon session 1 Statistics Micro Mini Statistics Review January 5-9, 2009 Beth Ayers."— Presentation transcript:

1 January 5, 2009 - afternoon session 1 Statistics Micro Mini Statistics Review January 5-9, 2009 Beth Ayers

2 January 5, 2009 - afternoon session 2 Monday 1pm-4pm Session Datasets with two variables ‒ Role-type classification ‒ Numerical summaries ‒ Graphical displays ‒ Significance testing Association versus causation Simpson’s paradox

3 January 5, 2009 - afternoon session 3 Datasets with 2 variables Decide the role of each variable – which is the explanatory and which is the response variable Classify the type of both the explanatory and response variable as either categorical or quantitative Use the classifications to decide numerical summaries and graphical displays

4 January 5, 2009 - afternoon session 4

5 5 Explanatory = Categorical Response = Categorical Graphical Summary ‒ Two-way table Numerical Summary ‒ Conditional percents Test of significance ‒ Â 2 test of independence

6 January 5, 2009 - afternoon session 6 Assumptions of Testing Events are independent Events have the same distribution The outcomes of each event are mutually exclusive Sufficiently large number of expected observations in each cell ‒ Rule of them is 5 per cell ‒ If less than 5 per cell, need continuity correction

7 January 5, 2009 - afternoon session 7 Consequences of Wrong Assumptions  2 distribution that is used will be wrong Depending on type of misassumption the test statistic will be over or under estimated P-value and conclusions will be wrong

8 January 5, 2009 - afternoon session 8 Hypothesis Testing The null hypothesis is always that the variables are independent (or that there is no relationship between them) The alternative hypothesis is always that the variables are dependent (or that there is a relationship between them)

9 January 5, 2009 - afternoon session 9 Test Statistic Based on the independence assumption, we can calculate the expected number of observations in each cell of the table Expected count = Test Statistic = The subscripted k is the degrees of freedom (df or dof) of test. This is calculated as: (# rows - 1)x(# columns - 1)

10 January 5, 2009 - afternoon session 10 Calculating the p-value Similar to the Normal tables, there are tables for the  2 distribution Computer programs will automatically calculate the test statistic, the degrees of freedom, and the p-value. All three values should be included in any written summary.

11 January 5, 2009 - afternoon session 11 Example 1 Curious if gender and computer interface (A or B) preference are related ‒ Explanatory variable = gender ‒ Response variable = interface preferred Graphical display

12 January 5, 2009 - afternoon session 12 Example 1 (cont) Numerical summary We can note that 60% of males preferred interface A and 68% of females preferred interface B

13 January 5, 2009 - afternoon session 13 Example 1 (cont) H 0 = gender does not affect preference, the variables are independent H 1 = gender does affect preference, the variables are dependent

14 January 5, 2009 - afternoon session 14 Example 1 (cont) Test statistic ‒ Â 2 = 2.130 + 1.815 + 2.130 + 1.815 = 7.89 Degrees of freedom ‒ (2-1)x(2-1) = 1 P-value = 0.005

15 January 5, 2009 - afternoon session 15 Example 1 (cont) Finally we draw conclusions, relating back to the hypotheses Since the p-value is less than ® = 0.05 we will reject H 0 and conclude that gender and interface preference are dependent Use the numerical summary to infer that males are more likely to prefer interface A and females are more likely to prefer interface B

16 January 5, 2009 - afternoon session 16 Notes about Example Why a  2 test? ‒ Formally it has to do with the assumptions we make about the data ‒ Also known as Pearson’s Chi-Square Test (of Association) ‒ For more information, you can look this up in most statistics books or simply Google it

17 January 5, 2009 - afternoon session 17 Notes of Caution Different statistical packages do different things! ‒ Some automatically do a continuity correction and you must tell it not to ‒ Different packages will do slightly different continuity corrections ‒ A few packages will run internal simulations to calculate p-values Moral of the story – take a look at your packages help file to make sure you know exactly which test you are running!

18 January 5, 2009 - afternoon session 18 Example 2 Want to know if birth order is related to juvenile delinquency ‒ Explanatory variable = Birth order ‒ Response variable = Delinquency Graphical display

19 January 5, 2009 - afternoon session 19 Example 2 (cont) Numerical summary From this it appears that youngest-born have a higher rate of delinquency

20 January 5, 2009 - afternoon session 20 Example 2 (cont) H 0 = birth order does not affect juvenile delinquency, the variables are independent H 1 = birth order does affect juvenile delinquency, the variables are dependent

21 January 5, 2009 - afternoon session 21 Example 2 (cont) Test statistic ‒ Â 2 = 0.288 + 0.003 + 0.472 + 3.216 + 0.037 + 5.280 = 9.926 Degrees of freedom ‒ (3-1)x(2-1) = 2 P-value = 0.010

22 January 5, 2009 - afternoon session 22 Example 2 (cont) Finally we draw conclusions, relating back to the hypotheses Since the p-value is less than ® = 0.05 we will reject H 0 and conclude that birth order and delinquency are dependent Use the numerical summary to infer that youngest-born are more delinquent

23 January 5, 2009 - afternoon session 23 Explanatory = Categorical Response = Quantitative Graphical Summary ‒ Side-by-side box-plots ‒ Y-axis is the response variable, have one box- plot for each level of the explanatory variable Numerical Summary ‒ Descriptive Statistics ‒ Mean and 5 number summary, variance/standard deviation Test of significance ‒ T-test or One-way ANOVA

24 January 5, 2009 - afternoon session 24 Paired vs. Two-sample T-test Paired T-test (matched pairs) ‒ Each member of a sample has a relationship with a member of the other sample ‒ Examples –Same person under two treatments –Siblings Two-sample or independent T-test ‒ Individuals are randomly assigned to a group ‒ Examples ‒ 100 students are randomly assigned to two groups

25 January 5, 2009 - afternoon session 25 One-way ANOVA ANalysis Of VAriance ‒ Partitions the observed variance based on explanatory variables ‒ Compare partitions to test significance of explanatory variables One-way ANOVA is used when ‒ Only testing the affect of one explanatory variable ‒ Each subject has only one treatment or condition Gives the same results as two-sample T- test if explanatory variable has 2 levels

26 January 5, 2009 - afternoon session 26 Assumptions Independence between observations ‒ If not, then T-test/One-way ANOVA incorrect test Normality of observations within each level of the explanatory variable ‒ T-test/One-way ANOVA is not the most powerful test, could lead to incorrect conclusions

27 January 5, 2009 - afternoon session 27 Assumptions Equal population variances ‒ Amount of inequality determines consequences, if small can use correction but if the inequality is large it will lead to incorrect conclusions Rule of thumb ‒ The larger standard deviation should be less than twice the smaller standard deviation

28 January 5, 2009 - afternoon session 28 Inference for non-normal samples Another non-normal distribution may fit the data, inference procedures for this distribution may exist If the data is skewed, perform transformations so that that the distribution is as close to normal as possible Use distribution-free or nonparametric inference procedures These methods are beyond the scope of this course, a simple Google search will yield online help pages

29 January 5, 2009 - afternoon session 29 Paired T-test Two measurements or observations on each individual. Want to examine the change from the first to second observation. Typically the observations are “before” and “after” some event For each individual subtract the “before” score from the “after” score Analyze the difference

30 January 5, 2009 - afternoon session 30 Hypothesis Testing Paired T-test The null hypothesis is that there is no difference in the mean (of the response variable) between the two groups ‒ H 0 = ¹ d = 0 The alternative hypothesis is that there is a difference in the mean (of the response variable) between the two groups ‒ H 1 = ¹ d > 0 ‒ H 1 = ¹ d < 0 ‒ H 1 = ¹ d ≠ 0

31 January 5, 2009 - afternoon session 31 Paired T-test: test statistic Calculate the test statistic: ‒ is the mean of the differences ‒ s is the standard deviation of the sample ‒ n is the sample size Degrees of freedom ‒ k = n-1 Compare the test statistic to t-distribution with k degrees of freedom to obtain the p-value

32 January 5, 2009 - afternoon session 32 Interpreting the p-value If the p-value is less than ®, reject the null hypothesis and conclude that there is a difference in the “before” and “after” scores Computer programs will automatically calculate the test statistic, the degrees of freedom, and the p-value. All three values should be included in any written summary.

33 January 5, 2009 - afternoon session 33 Paired T-test guidelines n < 15: use the T-test if the data are close to normal. If the data are clearly not normal do not use the T-test n at least 15: the T-test can be used except in cases of outliers or strong skewness Large samples: the T-test can be used for clearly skewed data when the sample size is large, roughly n ≥ 40

34 January 5, 2009 - afternoon session 34 Two-sample T-test/ One-way ANOVA For now assume that the explanatory variable only has two levels The goal of inference is to compare the responses in the two groups Each group is considered to be a sample from a distinct population The responses in each group are independent of those in the other group

35 January 5, 2009 - afternoon session 35 Hypothesis Testing The null hypothesis is that there is no difference in the mean (of the response variable) between the two groups ‒ H 0 = ¹ 1 = ¹ 2 ‒ H 0 = ¹ 1 - ¹ 2 = 0 The alternative hypothesis is that there is a difference in the mean (of the response variable) between the two groups ‒ H 1 = ¹ 1 - ¹ 2 > 0 ‒ H 1 = ¹ 1 - ¹ 2 < 0 ‒ H 1 = ¹ 1 - ¹ 2 ≠ 0

36 January 5, 2009 - afternoon session 36 Two-sample T-test: test statistic If the variance of each group is known ‒ is the mean of group i ‒ is the variance of group i ‒ n i is the sample size of group i z has a N(0,1) distribution Compare z to a N(0,1) to find the p-value

37 January 5, 2009 - afternoon session 37 Two-sample T-test: test statistic If the variance of each group is unknown ‒ is the mean of group i ‒ is the standard deviation of group i ‒ n i is the sample size of group i t has an approximate t(k) distribution k is the smaller of (n 1 -1) and (n 2 -1) or a number calculated from the data (often not a whole number)

38 January 5, 2009 - afternoon session 38 Two-sample T-test: test statistic If the variance of each group is unknown but we can assume equal variances ‒ Where s p 2 is the pooled standard variance t has an t(n 1 +n 2 -2) distribution

39 January 5, 2009 - afternoon session 39 Two-sample T-test guidelines Similar to those of the paired T-test on slide 33 Equal sample sizes are recommended

40 January 5, 2009 - afternoon session 40 Two-sample T-test cautions The choice to use the equation on slide 37 or 38 depends on your data. You should look at box-plots and standard deviations and perform tests to determine if the equal variances assumption is correct. Different software packages assume different things. Be sure to read help pages so you know what test you are actually performing.

41 January 5, 2009 - afternoon session 41 ANOVA: test statistic ANOVA partitions the variance based on the explanatory variable and the residual error. These partitions are then compared to determine if there are differences in the levels of the explanatory variable. This break down is arranged in an ANOVA table, we will discuss this more later The degrees of freedom depends on the number of levels of the explanatory variable and the sample size For now, just know that the test statistic is denoted by F and has an F-distribution

42 January 5, 2009 - afternoon session 42 Example 3 Curious if keyboard type has an affect on words per minute typed, suppose 25 students per keyboard ‒ Explanatory variable = keyboard type ‒ Response variable = words per minute Graphical display

43 January 5, 2009 - afternoon session 43 Example 3 (cont) Numerical Summary ‒ Keyboard 1 ‒ Mean = 27.1 ‒ Standard deviation = 16.1 ‒ Keyboard 2 ‒ Mean = 36.8 ‒ Standard deviation = 17.0 We can note that 17 < 2*16.1 = 32.2 Keyboard 2 has a higher mean number of words per minute typed

44 January 5, 2009 - afternoon session 44 Example 3 – paired T-test Assume that students use both keyboards, use a paired T-test to see if there is a difference Hypotheses ‒ H 0 = ¹ d = 0 ‒ H 1 = ¹ d ≠ 0 Results ‒ Test statistic: t = -1.81 ‒ Degrees of freedom = 24 ‒ P-value = 0.08 At ® = 0.05, we would not reject the null hypothesis. There is not enough evidence to suggest a difference.

45 January 5, 2009 - afternoon session 45 Two-sample vs. ANOVA Assume that two different groups of students use the keyboards Two-sample T-test ‒ Test statistic: t = -2.0825 ‒ Degrees of freedom = 47.81 ‒ P-value = 0.04267 One-way ANOVA ‒ Test statistic: F 1,49 = 4.337 ‒ P-value = 0.04265 At ® = 0.05 we would reject the null hypothesis and conclude there is a difference in words per minute typed between the two keyboards Things to note ‒ F = t 2 ; (-2.0825) 2 = 4.337 ‒ P-values are the same! ‒ Simply two different ways to test the hypothesis

46 January 5, 2009 - afternoon session 46 Notes about example 3 Recall ‒ Paired T-test ‒ Test statistic: t = -1.81 ‒ df = 24 ‒ P-value = 0.08 ‒ Two-sample T-test ‒ Test statistic: t = -2.0825 ‒ df = 47.81 ‒ P-value = 0.04267 In the two-sample T-test we reject the null hypothesis but we do not in the paired T-test Paired and two-sample T-tests often lead to different conclusions, must use the right test to avoid incorrect conclusions

47 January 5, 2009 - afternoon session 47 Notes about example 3 Use of two-sample T-test versus ANOVA is often a matter of preference since they yield the same results Calculation of test statistics ‒ Either t-statistic of F-statistic, for details you can check most statistics books ‒ There will degrees of freedom associated!

48 January 5, 2009 - afternoon session 48 One-way ANOVA Let’s look at an example with one explanatory variable that has k levels Null hypothesis: all means are equal ‒ H 0 = ¹ 1 = ¹ 2 = … = ¹ k Alternative hypothesis: not all means are equal ‒ This could be because all means are different or because one mean is different from the rest. Will need to do more analysis to determine this.

49 January 5, 2009 - afternoon session 49 One-way ANOVA The test statistic is still denoted by F and has an F-distribution The degrees of freedom depends on the number of levels of the explanatory variable and the sample size

50 January 5, 2009 - afternoon session 50 Example 4 Curious if keyboard type has an affect on words per minute typed, but now with 4 different keyboards ‒ Explanatory variable = keyboard type ‒ Response variable = words per minute Graphical display

51 January 5, 2009 - afternoon session 51 Example 4 (cont) Numerical Summary ‒ Keyboard 1 ‒ Mean = 27.08 ‒ Standard deviation = 16.03 ‒ Keyboard 2 ‒ Mean = 36.84 ‒ Standard deviation = 17.08 ‒ Keyboard 3 ‒ Mean = 27.20 ‒ Standard deviation = 15.90 ‒ Keyboard 4 ‒ Mean = 30.06 ‒ Standard deviation = 15.03 We can note that 17.08 < 2*15.03 = 30.06

52 January 5, 2009 - afternoon session 52 Example 4 (cont) Hypotheses ‒ H 0 = ¹ 1 = ¹ 2 = ¹ 3 = ¹ 4 ‒ H 1 = not all the means are equal Results ‒ Test statistic: F 3,99 = 2.04 ‒ P-value = 0.114 Conclusions ‒ Do not reject the null hypothesis. There is not enough evidence to suggest the mean words per minute varies among the groups

53 January 5, 2009 - afternoon session 53 Example 4 (cont) Suppose that the mean of keyboard 2 was slightly higher Numerical Summary ‒ Keyboard 1 ‒ Mean = 27.08 ‒ Standard deviation = 16.03 ‒ Keyboard 2 ‒ Mean = 41.84 ‒ Standard deviation = 17.08 ‒ Keyboard 3 ‒ Mean = 27.20 ‒ Standard deviation = 15.90 ‒ Keyboard 4 ‒ Mean = 30.06 ‒ Standard deviation = 15.03

54 January 5, 2009 - afternoon session 54 Example 4 (cont) Hypotheses ‒ H 0 = ¹ 1 = ¹ 2 = ¹ 3 = ¹ 4 ‒ H 1 = not all the means are equal Results ‒ Test statistic: F 3,99 = 4.768 ‒ P-value = 0.004 Conclusions ‒ Reject the null hypothesis. At least one of the keyboards has a mean words per minute typed that differs from the rest. ‒ Though we can infer that it is keyboard 2 from the box-plots and summary, must perform additional testing to formally determine which keyboard.

55 January 5, 2009 - afternoon session 55 Explanatory = Quantitative Response = Quantitative Graphical Summary ‒ Scatter plot Numerical Summary ‒ Correlation ‒ R 2 ‒ Regression equation ‒ Response = ¯ 0 + ¯ 1 * explanatory Test of significance ‒ Test significance of regression equation coefficients

56 January 5, 2009 - afternoon session 56 Scatter plot Shows relationship between two quantitative variables ‒ y-axis = response variable ‒ x-axis = explanatory variable

57 January 5, 2009 - afternoon session 57 Correlation Values between -1 and +1 ‒ A negative correlation indicates a negative or inverse relationship ‒ A correlation of 0 indicates no relationship ‒ A positive correlation indicates a positive relationship “Large” correlations vary by field

58 January 5, 2009 - afternoon session 58 R2R2 R 2 is the square of the correlation Also referred to as the coefficient of determination Represents the proportion of the variability in the data accounted for by the linear regression model Values between 0 and 1 ‒ An R 2 value of 1 indicates that the regression equation perfectly fits the data “Large” values of R 2 vary by field

59 January 5, 2009 - afternoon session 59 Linear Regression Equation ‒ Response = ¯ 0 + ¯ 1 * explanatory ‒ ¯ 0 is the intercept ‒ the value of the response variable when the explanatory variable is 0 ‒ ¯ 1 is the slope ‒ For each 1 unit increase in the explanatory variable, the response variable increases by ¯ 1 ¯ 0 and ¯ 1 are most often found using least squares estimation

60 January 5, 2009 - afternoon session 60 Assumptions of linear regression Linearity ‒ Check my looking at either observed vs. predicted or residual vs. predicted plot ‒ If non-linear, predictions will be wrong Independence of errors ‒ Can often be checked by knowing how data was collected. If not sure can use autocorrelation plots. Homoscedasticity (constant variance) ‒ Look at residuals versus predicted plot ‒ If non-constant variance predictions will have wrong confidence intervals and estimated coefficients may be wrong Normality of errors ‒ Look at normal probability plot ‒ If non-normal confidence intervals and estimated coefficients will be wrong

61 January 5, 2009 - afternoon session 61 Predicted Values and Residuals For any value of the explanatory variable X, the predicted value of the response variable is found using the estimates of ¯ 0 and ¯ 1 Residuals are the difference between the observed value of the response variable and the predicted value Example ‒ The estimated values of ¯ 0 = 5 and ¯ 1 = -2 ‒ We’d like to know the predicted value for X=1 ‒ Predicted value = 5-2*1 = 3 ‒ If the observed value is 3.25, then the residual is 3.25 – 3 = 0.25

62 January 5, 2009 - afternoon session 62 Assumptions of linear regression If the assumptions are not met, the estimates of ¯ 0, ¯ 1, their standard deviations, and estimates of R 2 will be incorrect Maybe possible to do transformations to either the explanatory or response variable to make the relationship linear

63 January 5, 2009 - afternoon session 63 Hypothesis testing Want to test if there is a significant linear relationship between the variables ‒ H 0 = there is no linear relationship between the variables ( ¯ 1 = 0) ‒ H 1 = there is a linear relationship between the variables ( ¯ 1 ≠ 0) Testing ¯ 0 = 0 may or may not be interesting and/or valid

64 January 5, 2009 - afternoon session 64 Example 5 Curious if typing speed (words per minute) affects efficiency (as measured by number of minutes required to finish a paper) Graphical display

65 January 5, 2009 - afternoon session 65 Example 5 (cont) Numerical summary ‒ Correlation = -0.946 ‒ R 2 = 0.8944 ‒ Efficiency = 85.99 – 0.52*speed For each additional word per minute typed, the number of minutes needed to complete an assignment decreases by 0.52 minutes The intercept does not make sense since it corresponds to a speed of zero words per minute

66 January 5, 2009 - afternoon session 66 Interpretation of r and R 2 r = -0.946 ‒ This indicates a strong negative linear relationship R 2 = 89.44 ‒ 89.44% of the variability in efficiency can be explained by words per minute typed

67 January 5, 2009 - afternoon session 67 Example 5 (cont) To test the significance of ¯ 1 ‒ H 0 = there is no linear relationship between the speed and efficiency ( ¯ 1 = 0) ‒ H 1 = there is a linear relationship between the speed and efficiency ( ¯ 1 ≠ 0) Test statistic: t = -20.16 P-value = 0.000 In this case, testing ¯ 0 = 0 is not interesting; however it may be in some experiments

68 January 5, 2009 - afternoon session 68 Example 5 (cont) The output of a linear regression varies from package to package. All will list the value of the estimates, their standard errors, the t-statistic for testing if it is equal to 0, and the p-value

69 January 5, 2009 - afternoon session 69 Example 5 (cont) Checking assumptions ‒ Plot on left: residual vs. predicted ‒ Want to see no pattern ‒ Plot on right: normal probability plot ‒ Want to see points fall on line

70 January 5, 2009 - afternoon session 70 Example 6 Suppose we have an explanatory and response variable and would like to know if there is a significant linear relationship Graphical display

71 January 5, 2009 - afternoon session 71 Example 6 (cont) Numerical summary ‒ Correlation = 0.971 ‒ R 2 = 0.942 ‒ Response = -21.19 + 19.63*explanatory For each additional unit of the explanatory variable, the response variable increases by 19.63 minutes When the explanatory variable has a value of 0, the response variable has a value of -21.19

72 January 5, 2009 - afternoon session 72 Example 6 (cont) To test the significance of ¯ 1 ‒ H 0 = there is no linear relationship between the explanatory and response ( ¯ 1 = 0) ‒ H 1 = there is a linear relationship between the explanatory and response ( ¯ 1 ≠ 0) Test statistic: t = 49.145 P-value = 0.000 It appears as though there is a significant linear relationship between the variables

73 January 5, 2009 - afternoon session 73 Example 6 (cont) Sample output for this example, we can see both coefficients are highly significant

74 January 5, 2009 - afternoon session 74 Example 6 (cont) Checking assumptions ‒ Plot on left: residual vs. predicted ‒ Want to see no pattern ‒ Plot on right: normal probability plot ‒ Want to see points fall on line

75 January 5, 2009 - afternoon session 75 Example 6 (cont) Checking assumptions ‒ In the residual vs. predicted plot we see that the residual values are higher for lower and higher predicted values and lower for values in the middle ‒ In the normal probability plot we see that the points are falling off the lines at the two ends This indicates that one of the assumptions was not met! In this case the is a quadratic relationship between the variables ‒ With experience you’ll be able to determine what relationships are present given the residual versus predicted plot

76 January 5, 2009 - afternoon session 76 Data with Linear Prediction Line When we add the predicted linear relationship, we can clearly see misfit

77 January 5, 2009 - afternoon session 77 Explanatory = Quantitative Response = Categorical Use logistic regression ‒ Predict which group individuals belong in The most common logistic regression has only two response categories, though it is possible to have more Ordinal logistic regression ‒ Outcomes are ordered ‒ Freshman, Sophomore, Junior, Senior Nominal logistic regression ‒ Outcomes are simply different groups ‒ Republican, Democrat, etc Many available books for more information

78 January 5, 2009 - afternoon session 78 Association versus Causation Association between two variables does not imply that one causes changes in the other To state that one variable causes a change in another variable, we need the data to come from a controlled experiment Correlation does not imply causation!

79 January 5, 2009 - afternoon session 79 Association versus Causation Examples 1 and 2 ‒ Since we can not control gender or birth order we can not make causal conclusions Examples 3 and 4 ‒ It depends on how the subjects were assigned to keyboards and how the study was run ‒ If subjects were randomly assigned to keyboards then we can make casual claims Example 5 ‒ Since we can’t control a person’s typing speed, we can not make causal conclusions here

80 January 5, 2009 - afternoon session 80 Anscombe dataset 5 datasets with 2 variables ‒ All have the same correlation and regression equation However, when we look at the scatter plots we see significant differences in the relationship between the variables across the datasets

81 January 5, 2009 - afternoon session 81

82 January 5, 2009 - afternoon session 82 Moral of the story Correlation only measures linear association and fitting a straight line only makes sense when the overall pattern is linear Make sure that you do graphical summaries!

83 January 5, 2009 - afternoon session 83 Lurking variables A lurking variable is a variable not among the explanatory or response variables that may influence the interpretation of the relationships among those variables Example – suppose we observe an association between number of computers per household and life expectancy ‒ Lurking variable could be wealth

84 January 5, 2009 - afternoon session 84 Simpson’s paradox Simpson’s paradox occurs when the apparent relationship between two variables changes when groups within the dataset are combined Simpson's Paradox is caused by a combination of a lurking variable and data from unequal sized groups being combined into a single data set.

85 January 5, 2009 - afternoon session 85 Example 1

86 January 5, 2009 - afternoon session 86 Lurking Variable Example

87 January 5, 2009 - afternoon session 87 Moral of the story If you know that a dataset has groups, you should calculate the numerical summaries for each group as well as the whole Breaking the dataset smaller groups may lead to drastically different conclusions Do not combine data sets of different sizes from different sources In experiments, try to identify possible lurking variables and control them, eliminate them, or hold them constant across all groups

88 January 5, 2009 - afternoon session 88 Conclusions


Download ppt "January 5, 2009 - afternoon session 1 Statistics Micro Mini Statistics Review January 5-9, 2009 Beth Ayers."

Similar presentations


Ads by Google