Review of ANOVA and linear regression. Review of simple ANOVA.

Slides:



Advertisements
Similar presentations
Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
Advertisements

Regression and correlation methods
Kin 304 Regression Linear Regression Least Sum of Squares
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Linear regression models
Linear correlation and linear regression. Continuous outcome (means) Outcome Variable Are the observations independent or correlated? Alternatives if.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Chapter 10 Simple Regression.
Chapter 12 Simple Regression
Lesson #32 Simple Linear Regression. Regression is used to model and/or predict a variable; called the dependent variable, Y; based on one or more independent.
SIMPLE LINEAR REGRESSION
Chapter Topics Types of Regression Models
Introduction to Probability and Statistics Linear Regression and Correlation.
Ch. 14: The Multiple Regression Model building
Simple Linear Regression and Correlation
Introduction to Regression Analysis, Chapter 13,
Simple Linear Regression Analysis
Linear Regression/Correlation
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
Regression and Correlation Methods Judy Zhong Ph.D.
The Chi-Square Distribution 1. The student will be able to  Perform a Goodness of Fit hypothesis test  Perform a Test of Independence hypothesis test.
ANALYSIS OF VARIANCE. Analysis of variance ◦ A One-way Analysis Of Variance Is A Way To Test The Equality Of Three Or More Means At One Time By Using.
Introduction to Linear Regression and Correlation Analysis
Chapter 13: Inference in Regression
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Multivariate Analysis. One-way ANOVA Tests the difference in the means of 2 or more nominal groups Tests the difference in the means of 2 or more nominal.
Chapter 11 Linear Regression Straight Lines, Least-Squares and More Chapter 11A Can you pick out the straight lines and find the least-square?
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Introduction to Probability and Statistics Thirteenth Edition Chapter 12 Linear Regression and Correlation.
Linear correlation and linear regression + summary of tests
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
CORRELATION: Correlation analysis Correlation analysis is used to measure the strength of association (linear relationship) between two quantitative variables.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 13 Multiple Regression
MARKETING RESEARCH CHAPTER 18 :Correlation and Regression.
STA 286 week 131 Inference for the Regression Coefficient Recall, b 0 and b 1 are the estimates of the slope β 1 and intercept β 0 of population regression.
Lecture 10: Correlation and Regression Model.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
General Linear Model.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
1 1 Slide The Simple Linear Regression Model n Simple Linear Regression Model y =  0 +  1 x +  n Simple Linear Regression Equation E( y ) =  0 + 
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Math Interlude: Signs, Symbols, and ANOVA nuts and bolts.
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Chapter 11 Linear Regression and Correlation. Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and.
Bivariate Regression. Bivariate Regression analyzes the relationship between two variables. Bivariate Regression analyzes the relationship between two.
Multiple Regression.
The simple linear regression model and parameter estimation
Correlation Measures the relative strength of the linear relationship between two variables Unit-less Ranges between –1 and 1 The closer to –1, the stronger.
Inference for Least Squares Lines
Chapter 12 Simple Linear Regression and Correlation
CHAPTER 7 Linear Correlation & Regression Methods
More than two groups: ANOVA and Chi-square
Correlation and Regression
Multiple Regression.
Prepared by Lee Revere and John Large
Scatter Plots of Data with Various Correlation Coefficients
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Review of ANOVA and linear regression

Review of simple ANOVA

ANOVA for comparing means between more than 2 groups

Hypotheses of One-Way ANOVA All population means are equal i.e., no treatment effect (no variation in means among groups) At least one population mean is different i.e., there is a treatment effect Does not mean that all population means are different (some pairs may be the same)

The F-distribution A ratio of variances follows an F-distribution: The F-test tests the hypothesis that two variances are equal. F will be close to 1 if sample variances are equal.

How to calculate ANOVA’s by hand… Treatment 1Treatment 2Treatment 3Treatment 4 y 11 y 21 y 31 y 41 y 12 y 22 y 32 y 42 y 13 y 23 y 33 y 43 y 14 y 24 y 34 y 44 y 15 y 25 y 35 y 45 y 16 y 26 y 36 y 46 y 17 y 27 y 37 y 47 y 18 y 28 y 38 y 48 y 19 y 29 y 39 y 49 y 110 y 210 y 310 y 410 n=10 obs./group k=4 groups The group means The (within) group variances

Sum of Squares Within (SSW), or Sum of Squares Error (SSE) The (within) group variances + ++ Sum of Squares Within (SSW) (or SSE, for chance error)

Sum of Squares Between (SSB), or Sum of Squares Regression (SSR) Sum of Squares Between (SSB). Variability of the group means compared to the grand mean (the variability due to the treatment). Overall mean of all 40 observations (“grand mean”)

Total Sum of Squares (SST) Total sum of squares(TSS). Squared difference of every observation from the overall mean. (numerator of variance of Y!)

Partitioning of Variance = + SSW + SSB = TSS 10x

ANOVA Table Between (k groups) k-1 SSB (sum of squared deviations of group means from grand mean) SSB/k-1 Go to F k-1,nk-k chart Total variation nk-1TSS (sum of squared deviations of observations from grand mean) Source of variation d.f. Sum of squares Mean Sum of Squares F-statisticp-value Within (n individuals per group) nk-k SSW (sum of squared deviations of observations from their group mean) s 2= SSW/nk-k TSS=SSB + SSW

Example Treatment 1Treatment 2Treatment 3Treatment 4 60 inches

Example Treatment 1Treatment 2Treatment 3Treatment 4 60 inches Step 1) calculate the sum of squares between groups: Mean for group 1 = 62.0 Mean for group 2 = 59.7 Mean for group 3 = 56.3 Mean for group 4 = 61.4 Grand mean= SSB = [( ) 2 + ( ) 2 + ( ) 2 + ( ) 2 ] xn per group= 19.65x10 = 196.5

Example Treatment 1Treatment 2Treatment 3Treatment 4 60 inches Step 2) calculate the sum of squares within groups: (60-62) 2 + (67-62) 2 + (42-62) 2 + (67-62) 2 + (56-62) 2 + (62- 62) 2 + (64-62) 2 + (59-62) 2 + (72-62) 2 + (71-62) 2 + ( ) 2 + ( ) 2 + ( ) ) 2 + ( ) 2 + ( ) 2 …+….(sum of 40 squared deviations) =

Step 3) Fill in the ANOVA table Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between Within Total

Step 3) Fill in the ANOVA table Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between Within Total INTERPRETATION of ANOVA: How much of the variance in height is explained by treatment group? R 2= “Coefficient of Determination” = SSB/TSS = 196.5/2275.1=9%

Coefficient of Determination The amount of variation in the outcome variable (dependent variable) that is explained by the predictor (independent variable).

ANOVA example S1 a, n=25 aS2 b, n=25 bS3 c, n=25 cP-value d d Calcium (mg)Mean SD e e Iron (mg)Mean SD0.6 Folate (μg)Mean SD Zinc (mg) Mean SD a School 1 (most deprived; 40% subsidized lunches). b School 2 (medium deprived; <10% subsidized). c School 3 (least deprived; no subsidization, private school). d ANOVA; significant differences are highlighted in bold (P<0.05). Table 6. Mean micronutrient intake from the school lunch by school FROM: Gould R, Russell J, Barker ME. School lunch menus and 11 to 12 year old children's food choice in three secondary schools in England- are the nutritional standards being met? Appetite Jan;46(1):86-92.

Answer Step 1) calculate the sum of squares between groups: Mean for School 1 = Mean for School 2 = Mean for School 3 = Grand mean: 161 SSB = [( ) 2 + ( ) 2 + ( ) 2 ] x25 per group= 98,113

Answer Step 2) calculate the sum of squares within groups: S.D. for S1 = 62.4 S.D. for S2 = 70.5 S.D. for S3 = 86.2 Therefore, sum of squares within is: (24)[ ]=391,066

Answer Step 3) Fill in your ANOVA table Source of variation d.f. Sum of squares Mean Sum of Squares F-statistic p-value Between 298, <.05 Within72 391, Total74 489,179 **R 2 =98113/489179=20% School explains 20% of the variance in lunchtime calcium intake in these kids.

Beyond one-way ANOVA Often, you may want to test more than 1 treatment. ANOVA can accommodate more than 1 treatment or factor, so long as they are independent. Again, the variation partitions beautifully! TSS = SSB1 + SSB2 + SSW

Linear regression review

What is “Linear”? Remember this: Y=mX+B? B m

What’s Slope? A slope of 2 means that every 1-unit change in X yields a 2-unit change in Y.

Regression equation… Expected value of y at a given level of x=

Predicted value for an individual… y i =  +  *x i + random error i Follows a normal distribution Fixed – exactly on the line

Assumptions (or the fine print) Linear regression assumes that… 1. The relationship between X and Y is linear 2. Y is distributed normally at each value of X 3. The variance of Y at every value of X is the same (homogeneity of variances) 4. The observations are independent** **When we talk about repeated measures starting next week, we will violate this assumption and hence need more sophisticated regression models!

The standard error of Y given X is the average variability around the regression line at any given value of X. It is assumed to be equal at all values of X. S y/x

C A B A yi yi x y yi yi C B *Least squares estimation gave us the line (β) that minimized C 2 A 2 B 2 C 2 SS total Total squared distance of observations from naïve mean of y Total variation SS reg Distance from regression line to naïve mean of y Variability due to x (regression) SS residual Variance around the regression line Additional variability not explained by x—what least squares method aims to minimize Regression Picture R 2 =SS reg /SS total

Recall example: cognitive function and vitamin D Hypothetical data loosely based on [1]; cross-sectional study of 100 middle- aged and older European men. Cognitive function is measured by the Digit Symbol Substitution Test (DSST). 1. Lee DM, Tajar A, Ulubaev A, et al. Association between 25-hydroxyvitamin D levels and cognitive performance in middle-aged and older European men. J Neurol Neurosurg Psychiatry Jul;80(7):722-9.

Distribution of vitamin D Mean= 63 nmol/L Standard deviation = 33 nmol/L

Distribution of DSST Normally distributed Mean = 28 points Standard deviation = 10 points

Four hypothetical datasets I generated four hypothetical datasets, with increasing TRUE slopes (between vit D and DSST): points per 10 nmol/L 1.0 points per 10 nmol/L 1.5 points per 10 nmol/L

Dataset 1: no relationship

Dataset 2: weak relationship

Dataset 3: weak to moderate relationship

Dataset 4: moderate relationship

The “Best fit” line Regression equation: E(Y i ) = *vit D i (in 10 nmol/L)

The “Best fit” line Note how the line is a little deceptive; it draws your eye, making the relationship appear stronger than it really is! Regression equation: E(Y i ) = *vit D i (in 10 nmol/L)

The “Best fit” line Regression equation: E(Y i ) = *vit D i (in 10 nmol/L)

The “Best fit” line Regression equation: E(Y i ) = *vit D i (in 10 nmol/L) Note: all the lines go through the point (63, 28)!

Significance testing… Slope Distribution of slope ~ T n-2 (β,s.e.( )) H 0 : β 1 = 0(no linear relationship) H 1 : β 1  0(linear relationship does exist) T n-2 =

Example: dataset 4 Standard error (beta) = 0.03 T 98 = 0.15/0.03 = 5, p< % Confidence interval = 0.09 to 0.21

Multiple linear regression… What if age is a confounder here? Older men have lower vitamin D Older men have poorer cognition “Adjust” for age by putting age in the model: DSST score = intercept + slope 1 xvitamin D + slope 2 xage

2 predictors: age and vit D…

Different 3D view…

Fit a plane rather than a line… On the plane, the slope for vitamin D is the same at every age; thus, the slope for vitamin D represents the effect of vitamin D when age is held constant.

Equation of the “Best fit” plane… DSST score = xvitamin D (in 10 nmol/L) xage (in years) P-value for vitamin D >>.05 P-value for age <.0001 Thus, relationship with vitamin D was due to confounding by age!

Multiple Linear Regression More than one predictor… E(y)=  +  1 *X +  2 *W +  3 *Z… Each regression coefficient is the amount of change in the outcome variable that would be expected per one-unit change of the predictor, if all other variables in the model were held constant.

Functions of multivariate analysis: Control for confounders Test for interactions between predictors (effect modification) Improve predictions

ANOVA is linear regression! Divide vitamin D into three groups: Deficient (<25 nmol/L) Insufficient (>=25 and <50 nmol/L) Sufficient (>=50 nmol/L), reference group DSST=  (=value for sufficient) +  insufficient *(1 if insufficient) +  2 *(1 if deficient) This is called “dummy coding”—where multiple binary variables are created to represent being in each category (or not) of a categorical variable

The picture… Sufficient vs. Insufficient Sufficient vs. Deficient

Results… Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept <.0001 deficient insufficient Interpretation: The deficient group has a mean DSST 9.87 points lower than the reference (sufficient) group. The insufficient group has a mean DSST 6.87 points lower than the reference (sufficient) group.