Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review of ANOVA and linear regression. Review of simple ANOVA.

Similar presentations


Presentation on theme: "Review of ANOVA and linear regression. Review of simple ANOVA."— Presentation transcript:

1 Review of ANOVA and linear regression

2 Review of simple ANOVA

3 ANOVA for comparing means between more than 2 groups

4 Hypotheses of One-Way ANOVA All population means are equal i.e., no treatment effect (no variation in means among groups) At least one population mean is different i.e., there is a treatment effect Does not mean that all population means are different (some pairs may be the same)

5 The F-distribution A ratio of variances follows an F-distribution: The F-test tests the hypothesis that two variances are equal. F will be close to 1 if sample variances are equal.

6 How to calculate ANOVA’s by hand… Treatment 1Treatment 2Treatment 3Treatment 4 y 11 y 21 y 31 y 41 y 12 y 22 y 32 y 42 y 13 y 23 y 33 y 43 y 14 y 24 y 34 y 44 y 15 y 25 y 35 y 45 y 16 y 26 y 36 y 46 y 17 y 27 y 37 y 47 y 18 y 28 y 38 y 48 y 19 y 29 y 39 y 49 y 110 y 210 y 310 y 410 n=10 obs./group k=4 groups The group means The (within) group variances

7 Sum of Squares Within (SSW), or Sum of Squares Error (SSE) The (within) group variances + ++ Sum of Squares Within (SSW) (or SSE, for chance error)

8 Sum of Squares Between (SSB), or Sum of Squares Regression (SSR) Sum of Squares Between (SSB). Variability of the group means compared to the grand mean (the variability due to the treatment). Overall mean of all 40 observations (“grand mean”)

9 Total Sum of Squares (SST) Total sum of squares(TSS). Squared difference of every observation from the overall mean. (numerator of variance of Y!)

10 Partitioning of Variance = + SSW + SSB = TSS 10x

11 ANOVA Table Between (k groups) k-1 SSB (sum of squared deviations of group means from grand mean) SSB/k-1 Go to F k-1,nk-k chart Total variation nk-1TSS (sum of squared deviations of observations from grand mean) Source of variation d.f. Sum of squares Mean Sum of Squares F-statisticp-value Within (n individuals per group) nk-k SSW (sum of squared deviations of observations from their group mean) s 2= SSW/nk-k TSS=SSB + SSW

12 Example Treatment 1Treatment 2Treatment 3Treatment 4 60 inches504847 67524967 42435054 67 5567 56675668 62596165 64676165 59646056 72635960 71656465

13 Example Treatment 1Treatment 2Treatment 3Treatment 4 60 inches504847 67524967 42435054 67 5567 56675668 62596165 64676165 59646056 72635960 71656465 Step 1) calculate the sum of squares between groups: Mean for group 1 = 62.0 Mean for group 2 = 59.7 Mean for group 3 = 56.3 Mean for group 4 = 61.4 Grand mean= 59.85 SSB = [(62-59.85) 2 + (59.7-59.85) 2 + (56.3-59.85) 2 + (61.4-59.85) 2 ] xn per group= 19.65x10 = 196.5

14 Example Treatment 1Treatment 2Treatment 3Treatment 4 60 inches504847 67524967 42435054 67 5567 56675668 62596165 64676165 59646056 72635960 71656465 Step 2) calculate the sum of squares within groups: (60-62) 2 + (67-62) 2 + (42-62) 2 + (67-62) 2 + (56-62) 2 + (62- 62) 2 + (64-62) 2 + (59-62) 2 + (72-62) 2 + (71-62) 2 + (50- 59.7) 2 + (52-59.7) 2 + (43- 59.7) 2 + 67-59.7) 2 + (67- 59.7) 2 + (69-59.7) 2 …+….(sum of 40 squared deviations) = 2060.6

15 Step 3) Fill in the ANOVA table 3 196.5 65.51.14.344 362060.657.2 Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between Within Total 39 2257.1

16 Step 3) Fill in the ANOVA table 3 196.5 65.51.14.344 362060.657.2 Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between Within Total 39 2257.1 INTERPRETATION of ANOVA: How much of the variance in height is explained by treatment group? R 2= “Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%

17 Coefficient of Determination The amount of variation in the outcome variable (dependent variable) that is explained by the predictor (independent variable).

18 ANOVA example S1 a, n=25 aS2 b, n=25 bS3 c, n=25 cP-value d d Calcium (mg)Mean117.8158.7206.50.000 SD e e 62.470.586.2 Iron (mg)Mean2.0 0.854 SD0.6 Folate (μg)Mean26.638.742.60.000 SD13.114.515.1 Zinc (mg) Mean1.91.51.30.055 SD1.01.20.4 a School 1 (most deprived; 40% subsidized lunches). b School 2 (medium deprived; <10% subsidized). c School 3 (least deprived; no subsidization, private school). d ANOVA; significant differences are highlighted in bold (P<0.05). Table 6. Mean micronutrient intake from the school lunch by school FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in England- are the nutritional standards being met? Appetite. 2006 Jan;46(1):86-92.

19 Answer Step 1) calculate the sum of squares between groups: Mean for School 1 = 117.8 Mean for School 2 = 158.7 Mean for School 3 = 206.5 Grand mean: 161 SSB = [(117.8-161) 2 + (158.7-161) 2 + (206.5-161) 2 ] x25 per group= 98,113

20 Answer Step 2) calculate the sum of squares within groups: S.D. for S1 = 62.4 S.D. for S2 = 70.5 S.D. for S3 = 86.2 Therefore, sum of squares within is: (24)[ 62.4 2 + 70.5 2 + 86.2 2 ]=391,066

21 Answer Step 3) Fill in your ANOVA table Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between 298,113490569 <.05 Within72 391,0665431 Total74 489,179 **R 2 =98113/489179=20% School explains 20% of the variance in lunchtime calcium intake in these kids.

22 Beyond one-way ANOVA Often, you may want to test more than 1 treatment. ANOVA can accommodate more than 1 treatment or factor, so long as they are independent. Again, the variation partitions beautifully! TSS = SSB1 + SSB2 + SSW

23 Linear regression review

24 What is “Linear”? Remember this: Y=mX+B? B m

25 What’s Slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

26 Regression equation… Expected value of y at a given level of x=

27 Predicted value for an individual… y i =  +  *x i + random error i Follows a normal distribution Fixed – exactly on the line

28 Assumptions (or the fine print) Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances) 4. The observations are independent** **When we talk about repeated measures starting next week, we will violate this assumption and hence need more sophisticated regression models!

29 The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X. S y/x

30 C A B A yi yi x y yi yi C B *Least squares estimation gave us the line (β) that minimized C 2 A 2 B 2 C 2 SS total Total squared distance of observations from naïve mean of y Total variation SS reg Distance from regression line to naïve mean of y Variability due to x (regression) SS residual Variance around the regression line Additional variability not explained by x—what least squares method aims to minimize Regression Picture R 2 =SS reg /SS total

31 Recall example: cognitive function and vitamin D Hypothetical data loosely based on [1]; cross-sectional study of 100 middle- aged and older European men. Cognitive function is measured by the Digit Symbol Substitution Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry. 2009 Jul;80(7):722-9.

32 Distribution of vitamin D Mean= 63 nmol/L Standard deviation = 33 nmol/L

33 Distribution of DSST Normally distributed Mean = 28 points Standard deviation = 10 points

34 Four hypothetical datasets I generated four hypothetical datasets, with increasing TRUE slopes (between vit D and DSST): 0 0.5 points per 10 nmol/L 1.0 points per 10 nmol/L 1.5 points per 10 nmol/L

35 Dataset 1: no relationship

36 Dataset 2: weak relationship

37 Dataset 3: weak to moderate relationship

38 Dataset 4: moderate relationship

39 The “Best fit” line Regression equation: E(Y i ) = 28 + 0*vit D i (in 10 nmol/L)

40 The “Best fit” line Note how the line is a little deceptive; it draws your eye, making the relationship appear stronger than it really is! Regression equation: E(Y i ) = 26 + 0.5*vit D i (in 10 nmol/L)

41 The “Best fit” line Regression equation: E(Y i ) = 22 + 1.0*vit D i (in 10 nmol/L)

42 The “Best fit” line Regression equation: E(Y i ) = 20 + 1.5*vit D i (in 10 nmol/L) Note: all the lines go through the point (63, 28)!

43 Significance testing… Slope Distribution of slope ~ T n-2 (β,s.e.( )) H 0 : β 1 = 0(no linear relationship) H 1 : β 1  0(linear relationship does exist) T n-2 =

44 Example: dataset 4 Standard error (beta) = 0.03 T 98 = 0.15/0.03 = 5, p<.0001 95% Confidence interval = 0.09 to 0.21

45 Multiple linear regression… What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition “Adjust” for age by putting age in the model: DSST score = intercept + slope 1 xvitamin D + slope 2 xage

46 2 predictors: age and vit D…

47 Different 3D view…

48 Fit a plane rather than a line… On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

49 Equation of the “Best fit” plane… DSST score = 53 + 0.0039xvitamin D (in 10 nmol/L) - 0.46 xage (in years) P-value for vitamin D >>.05 P-value for age <.0001 Thus, relationship with vitamin D was due to confounding by age!

50 Multiple Linear Regression More than one predictor… E(y)=  +  1 *X +  2 *W +  3 *Z… Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.

51 Functions of multivariate analysis: Control for confounders Test for interactions between predictors (effect modification) Improve predictions

52 ANOVA is linear regression! Divide vitamin D into three groups: Deficient (<25 nmol/L) Insufficient (>=25 and <50 nmol/L) Sufficient (>=50 nmol/L), reference group DSST=  (=value for sufficient) +  insufficient *(1 if insufficient) +  2 *(1 if deficient) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable

53 The picture… Sufficient vs. Insufficient Sufficient vs. Deficient

54 Results… Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 40.07407 1.47817 27.11 <.0001 deficient 1 -9.87407 3.73950 -2.64 0.0096 insufficient 1 -6.87963 2.33719 -2.94 0.0041 Interpretation: The deficient group has a mean DSST 9.87 points lower than the reference (sufficient) group. The insufficient group has a mean DSST 6.87 points lower than the reference (sufficient) group.


Download ppt "Review of ANOVA and linear regression. Review of simple ANOVA."

Similar presentations


Ads by Google