Experimental design and statistical analyses of data Lesson 4: Analysis of variance II A posteriori tests Model control How to choose the best model
Growth of bean plants in four different media ZnCuMnControlOverall Biomass (y) nini Completely randomized design (one-way anova)
How to do it with SAS
DATA medium; /* 20 bean plants exposed to 4 different treatments (5 plants per treatment) Mn = extra mangan added to the soil Zn = ekstra zink added to the soil Cu = ekstra cupper added to the soil K = control soil The dependent variable (Mass) is the biomass of the plants at harvest */ INPUT treat $ mass ; /* treat = treatment */ /* mass = biomass of a plant */ CARDS; zn 61.7 zn 59.4 zn 60.5 zn 59.2 zn 57.6 cu 57.0 cu 58.4 cu 57.3 cu 57.8 cu 59.9 mn 62.3 mn 66.2 mn 65.2 mn 63.7 mn 64.1 k 58.1 k 56.3 k 58.9 k 57.4 k 56.1 ;
PROC SORT; /* sort the observations according to treatment */ BY treat; RUN; /* compute average and 95% confidence limits for each treatment */ PROC MEANS N MEAN CLM; BY treat; RUN;
1 14:09 Wednesday, November 7, 2001 Analysis Variable : MASS TREAT=cu N Mean Lower 95.0% CLM Upper 95.0% CLM TREAT=k N Mean Lower 95.0% CLM Upper 95.0% CLM TREAT=mn N Mean Lower 95.0% CLM Upper 95.0% CLM TREAT=zn N Mean Lower 95.0% CLM Upper 95.0% CLM
PROC GLM; CLASS treat; MODEL mass = treat /SOLUTION; /* SOLUTION gives the estimated parameter values */ RUN;
Class Levels Values TREAT 4 cu k mn zn Number of observations in data set = 20 General Linear Models Procedure Dependent Variable: MASS Sum of Mean Source DF Squares Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE MASS Mean Source DF Type I SS Mean Square F Value Pr > F TREAT Source DF Type III SS Mean Square F Value Pr > F TREAT
T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT B TREAT cu B k B mn B zn B... NOTE: The X'X matrix has been found to be singular and a generalized inverse was used to solve the normal equations. Estimates followed by the letter 'B' are biased, and are not unique estimators of the parameters.
PROC GLM; CLASS treat; MODEL mass = treat /SOLUTION; /* SOLUTION gives the estimated parameter values */ /*Test for pairwise differences between treatments by linear contrasts */ CONTRAST 'Cu vs K' Treat ; CONTRAST 'Cu vs Mn' Treat ; CONTRAST 'Cu vs Zn' Treat ; CONTRAST 'K vs Mn' Treat ; CONTRAST 'K vs Zn' Treat ; CONTRAST 'Mn vs Zn' Treat ; /* test for whether the 3 treatments with added minerals are different from the control */ CONTRAST 'K vs Cu, Mn Zn' Treat ; RUN;
Contrast DF Contrast SS Mean Square F Value Pr > F Cu vs K Cu vs Mn Cu vs Zn K vs Mn K vs Zn Mn vs Zn K vs Cu, Mn Zn
PROC GLM; CLASS treat; MODEL mass = treat /SOLUTION; /* SOLUTION gives the estimated parameter values */ /* Test for differences between levels of treatment */ MEANS treat / BON DUNCAN SCHEFFE TUKEY DUNNETT('k'); RUN;
Tukey's Studentized Range (HSD) Test for variable: MASS NOTE: This test controls the type I experimentwise error rate. Alpha= 0.05 Confidence= 0.95 df= 16 MSE= Critical Value of Studentized Range= Minimum Significant Difference= Comparisons significant at the 0.05 level are indicated by '***'. Simultaneous Simultaneous Lower Difference Upper TREAT Confidence Between Confidence Comparison Limit Means Limit mn - zn *** mn - cu *** mn - k *** zn - mn *** zn - cu zn - k cu - mn *** cu - zn cu - k k - mn *** k - zn k - cu
Bonferroni (Dunn) T tests for variable: MASS NOTE: This test controls the type I experimentwise error rate but generally has a higher type II error rate than Tukey's for all pairwise comparisons. Alpha= 0.05 Confidence= 0.95 df= 16 MSE= Critical Value of T= Minimum Significant Difference= Comparisons significant at the 0.05 level are indicated by '***'. Simultaneous Simultaneous Lower Difference Upper TREAT Confidence Between Confidence Comparison Limit Means Limit mn - zn *** mn - cu *** mn - k *** zn - mn *** zn - cu zn - k cu - mn *** cu - zn cu - k k - mn *** k - zn k - cu
Scheffe's test for variable: MASS NOTE: This test controls the type I experimentwise error rate but generally has a higher type II error rate than Tukey's for all pairwise comparisons. Alpha= 0.05 Confidence= 0.95 df= 16 MSE= Critical Value of F= Minimum Significant Difference= Comparisons significant at the 0.05 level are indicated by '***'. Simultaneous Simultaneous Lower Difference Upper TREAT Confidence Between Confidence Comparison Limit Means Limit mn - zn *** mn - cu *** mn - k *** zn - mn *** zn - cu zn - k cu - mn *** cu - zn cu - k k - mn *** k - zn k - cu
Dunnett's T tests for variable: MASS NOTE: This tests controls the type I experimentwise error for comparisons of all treatments against a control. Alpha= 0.05 Confidence= 0.95 df= 16 MSE= Critical Value of Dunnett's T= Minimum Significant Difference= Comparisons significant at the 0.05 level are indicated by '***'. Simultaneous Simultaneous Lower Difference Upper TREAT Confidence Between Confidence Comparison Limit Means Limit mn - k *** zn - k *** cu - k
Duncan’s test exaggarates the risk of Type I errors Comparison between multiple tests TestMinimum significant difference Duncan Dunnett Tukey Bonferroni Scheffe Type I Scheffe’s test exaggarates the risk of Type II errrors Type II Tukey’s test is recommended as the best!
PROC GLM; CLASS treat; MODEL mass = treat /SOLUTION; /* SOLUTION gives the estimated parameter values */ /* Test for differences between different levels of treatment */ MEANS treat / BON DUNCAN SCHEFFE TUKEY lines; RUN;
General Linear Models Procedure Duncan's Multiple Range Test for variable: MASS NOTE: This test controls the type I comparisonwise error rate, not the experimentwise error rate Alpha= 0.05 df= 16 MSE= Number of Means Critical Range Means with the same letter are not significantly different. Duncan Grouping Mean N TREAT A mn B zn B C B cu C C k
General Linear Models Procedure Tukey's Studentized Range (HSD) Test for variable: MASS NOTE: This test controls the type I experimentwise error rate, but generally has a higher type II error rate than REGWQ. Alpha= 0.05 df= 16 MSE= Critical Value of Studentized Range= Minimum Significant Difference= Means with the same letter are not significantly different. Tukey Grouping Mean N TREAT A mn B zn B B cu B B k
General Linear Models Procedure Bonferroni (Dunn) T tests for variable: MASS NOTE: This test controls the type I experimentwise error rate, but generally has a higher type II error rate than REGWQ. Alpha= 0.05 df= 16 MSE= Critical Value of T= 3.01 Minimum Significant Difference= Means with the same letter are not significantly different. Bon Grouping Mean N TREAT A mn B zn B B cu B B k
General Linear Models Procedure Scheffe's test for variable: MASS NOTE: This test controls the type I experimentwise error rate but generally has a higher type II error rate than REGWF for all pairwise comparisons Alpha= 0.05 df= 16 MSE= Critical Value of F= Minimum Significant Difference= Means with the same letter are not significantly different. Scheffe Grouping Mean N TREAT A mn B zn B B cu B B k
PROC GLM; CLASS treat; MODEL mass = treat /SOLUTION; /* SOLUTION gives the estimated parameter values */ /* In unbalanced (and balanced) designs LSMEANS can be used: */ LSMEANS treat /TDIF PDIFF; RUN;
The GLM Procedure Least Squares Means LSMEAN treat mass LSMEAN Number cu k mn zn Least Squares Means for Effect treat t for H0: LSMean(i)=LSMean(j) / Pr > |t| Dependent Variable: mass i/j < < <.0001 <.0001 < <.0001 NOTE: To ensure overall protection level, only probabilities associated with pre-planned comparisons should be used. Er denne P-værdi signifikant?
Den sekventielle Bonferroni-test Den sekventielle Bonferroni-test er mindre konservativ end den ordinære Bonferroni-test. Procedure: Først ordnes de k P-værdier i voksende rækkefølge. Lad P (i) betegne den i’te P-værdi efter at værdierne er blevet ordnet i voksende rækkefølge. Herefter beregnes hvor α er det signifikansniveau, der benyttes, hvis der kun var en enkelt P-værdi (sædvanligvis 0.05). Hvis P (i) < α (i) er den i’te P-værdi signifikant. iP (i) α (i) P (i) -α (i) Signifikante P-værdier
Model assumptions and model control All GLM’s are based on the assumption that (1) ε is independently distributed (2) ε is normally distributed with the mean = 0 (3) The variance of ε (denoted σ 2 ) is the same for all values of the independent variable(s) (variance homogeneity) (4) Mathematically this is written as ε is iid ND(0; σ 2 ) iid = independently and identically distributed
Transformation of data Transformation of data serves two purposes (1)To remove variance heteroscedasticity (2)To make data more normal Usually a transformation meets both purposes, but if this is not possible, variance homoscedasticity is regarded as the most important, especially if sample sizes are large
How to choose the appropriate transformation?
y * = y p We have to find a value of p, so that the transformed values of y (denoted y * ) meet the condition of being normally distributed and with a variance that is independent of y *. A useful method to find p is to fit Taylor’s power law to data
Taylor’s power law It can be shown that p = 1- b/2 is the appropriate transformation we search for
If y is a proportion, i.e. 0 <= y <= 1, an appropriate transformation is often
T. urticae: log s 2 = log x r 2 = y * = log(y+1) P. persimilis: log s 2 = log x r 2 = y * = log(y+1)
Exponential growth Deterministic model: Stochastic model: b = birth rate/capita d = death rate/capita Instantaneous growth rateN = population size at time t r = net growth rate/capita ΔN = change in N during Δt B = birth rateD = death rate ε = noise associated with births δ = noise associated with deaths The number of births during a time interval follows a Poisson distribution with mean BΔt The number of deaths during a time interval is binomially distributed with parameters (θ, N) The probability that an individual dies during Δt is θ = DΔt/N
Type I, II, III and IV SS Example: Mites in stored grain influenced by temperature (T) and humidity (H)
DATA mites; INFILE'h:\lin-mod\besvar\opg1-1.prn' FIRSTOBS=2; INPUT pos $ depth T H Mites; /* pos = position in store */ /* depth = depth in m */ /* T = Temperature of grain */ /* H = Humidity of grain */ /* Mites = number of mites in sampling unit */ logMites = log10(Mites+1); /* log transformation of Mites */ T2 = T**2; /* square temperature */ H2 = H**2; /* square humidity */ TH = T*H; /* product of temperature and humidity */ PROC GLM; CLASS pos; MODEL logMites = T T2 H H2 TH /SOLUTION SS1 SS3; RUN;
General Linear Models Procedure Dependent Variable: LOGMITES Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE LOGMITES Mean T for H0: Pr > |T| Std Error of Parameter Estimate Parameter=0 Estimate INTERCEPT T T H H TH
General Linear Models Procedure Dependent Variable: LOGMITES Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE LOGMITES Mean Source DF Type I SS Mean Square F Value Pr > F T T H H TH Source DF Type III SS Mean Square F Value Pr > F T T H H TH
Example: β 3 SS I is used to compare the model: with SS III is used to compare the model with
General Linear Models Procedure Dependent Variable: LOGMITES Source DF Sum of Squares Mean Square F Value Pr > F Model Error Corrected Total R-Square C.V. Root MSE LOGMITES Mean Source DF Type I SS Mean Square F Value Pr > F T T H H TH Source DF Type III SS Mean Square F Value Pr > F T T H H TH H is significant if it is added after T and T 2 H is not significant if it is added after T, T 2, H 2, and TH
How do we choose the best model?
DATA mites; INFILE'h:\lin-mod\besvar\opg1-1.prn' FIRSTOBS=2; INPUT pos $ depth T H Mites; /* pos = position in store */ /* depth = depth in m */ /* T = Temperature of grain */ /* H = Humidity of grain */ /* Mites = number of mites in sampling unit */ logMites = log10(Mites+1); /* log transformation of Mites */ T2 = T**2; /* square temperature */ H2 = H**2; /* square humidity */ TH = T*H; /* product of temperature and humidity */ PROC STEPWISE; MODEL logMites = T T2 H H2 TH /MAXR; RUN;
Maximum R-square Improvement for Dependent Variable LOGMITES Step 1 Variable H2 Entered R-square = C(p) = DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP H Bounds on condition number: 1, The above model is the best 1-variable model found.
Step 2 Variable T Entered R-square = C(p) = DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP T H Bounds on condition number: ,
Step 3 Variable H2 Removed R-square = C(p) = Variable TH Entered DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP T TH Bounds on condition number: , The above model is the best 2-variable model found.
Step 4 Variable T2 Entered R-square = C(p) = DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP T T TH Bounds on condition number: , The above model is the best 3-variable model found.
Step 5 Variable H2 Entered R-square = C(p) = DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP T T H TH Bounds on condition number: ,
Step 6 Variable TH Removed R-square = C(p) = Variable H Entered DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP T T H H Bounds on condition number: , The above model is the best 4-variable model found.
Step 7 Variable TH Entered R-square = C(p) = DF Sum of Squares Mean Square F Prob>F Regression Error Total Parameter Standard Type II Variable Estimate Error Sum of Squares F Prob>F INTERCEP T T H H TH Bounds on condition number: , The above model is the best 5-variable model found. No further improvement in R-square is possible.
Models with 1 variable ModelR2R2 FP T T H H T*H
Models with 2 variables ModelR2R2 FP T T T H T H T T*H T2 H T2 H T2 T*H H H H T*H H2 T*H
Models with 3 variables ModelR2R2 FP T T2 H T T2 H T T2 T*H T H H T H T*H T H2 T*H T2 H H T2 H T*H T2 H2 T*H H H2 T*H
Models with 4 variables ModelR2R2 FP T T2 H H T T2 H T*H T T2 H2 T*H T H H2 T*H T2 H H2 T*H
Models with 5 variables ModelR2R2 FP T T2 H H2 TH
Best models ModelR2R2 FPC(p)C(p) H T T*H T T2 T*H T T2 H H T T2 H H2 T*H Overall, this may considered the best model Mallow’s C(p)
Model control
DATA mites; INFILE'h:\lin-mod\besvar\opg1-1.prn' FIRSTOBS=2; INPUT pos $ depth T H Mites; LogMites = log10(Mites+1); /* transform dependent variable */ T2 = T**2; /* square temperature */ H2 = H**2; /* square humidity */ TH = T*H; /* interaction between temperature and humidity */ PROC REG; /* Multiple regression analysis */ MODEL logMites = T T2 H H2 TH; OUTPUT out = new P = pred R = res; RUN; /*Model control */ PROC GPLOT; PLOT LogMites*pred pred*pred /OVERLAY; /*plot observed values against predicted values together with line of equality */ SYMBOL1 COLOR=blue VALUE=circle HEIGHT=1; SYMBOL2 COLOR=red INTERPOL=line WIDTH = 1; PLOT res*pred; /* plot residuals against the predicted values */ SYMBOL1 COLOR=blue VALUE=circle HEIGHT=1; RUN;
Observed values of LogMites against predicted values
DATA mites; INFILE'h:\lin-mod\besvar\opg1-1.prn' FIRSTOBS=2; INPUT pos $ depth T H Mites; LogMites = log10(Mites+1); /* transform dependent variable */ T2 = T**2; /* square temperature */ H2 = H**2; /* square humidity */ TH = T*H; /* interaction between temperature and humidity */ PROC REG; /* Multiple regression analysis */ MODEL logMites = T T2 H H2 TH; OUTPUT out = new P = pred R = res; RUN; /*Model control */ PROC GPLOT; PLOT LogMites*pred pred*pred /OVERLAY; /*plot observed values against predicted values together with line of equality */ SYMBOL1 COLOR=blue VALUE=circle HEIGHT=1; SYMBOL2 COLOR=red INTERPOL=line WIDTH = 1; PLOT res*pred; /* plot residuals against the predicted values */ SYMBOL1 COLOR=blue VALUE=circle HEIGHT=1; RUN;
Residuals plotted against predicted values of LogMites
PROC UNIVARIATE FREQ PLOT NORMAL data= Newdata; /* PROC UNIVARIATE gives information about the variables defined by VAR */ /* FREQ, PLOT, NORMAL etc are options FREQ = number of observations of a given value PLOT = plot of observations NORMAL = test for the variable is normally distributed */ VAR res; /* information about the residuals */ RUN;
Univariate Procedure Variable=RES Moments Quantiles(Def=5) N 20 Sum Wgts % Max % 2.02 Mean 0 Sum 0 75% Q % 1.96 Std Dev Variance % Med % 1.86 Skewness Kurtosis % Q % USS CSS % Min % CV. Std Mean % T:Mean=0 0 Pr>|T| Range 4.1 Num ^= 0 20 Num > 0 9 Q3-Q M(Sign) -1 Pr>=|M| Mode Sgn Rank -3 Pr>=|S| W:Normal Pr<W H 0 : The residuals are normally distributed This is the probability of getting a deviation from the normal distribution equal to or greater than the observed one by chance given H 0 is true
Extremes Lowest Obs Highest Obs -2.08( 20) 0.9( 13) -2( 11) 1.54( 8) -1.26( 10) 1.82( 5) -1.08( 1) 1.9( 12) -1.06( 7) 2.02( 16) Stem Leaf # Boxplot | | 1 | | + | *-----* | -1 | |
Normal Probability Plot *+ | * * +*++ | *+**+ | ++**+ | +*** * *+* | +*+* * | ++* * Points should follow a straight line if data are normally distributed
Frequency Table Percents Percents Value Count Cell Cum Value Count Cell Cum