AOV Assumption Checking and Transformations (§ )

Slides:



Advertisements
Similar presentations
Prepared by Lloyd R. Jaisingh
Advertisements

Probability models- the Normal especially.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Hypothesis Testing Steps in Hypothesis Testing:
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Objectives (BPS chapter 24)
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Chapter 13 Multiple Regression
© 2010 Pearson Prentice Hall. All rights reserved Single Factor ANOVA.
Multiple regression analysis
MARE 250 Dr. Jason Turner Hypothesis Testing II. To ASSUME is to make an… Four assumptions for t-test hypothesis testing:
Chapter 12 Multiple Regression
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
ANalysis Of VAriance (ANOVA) Comparing > 2 means Frequently applied to experimental data Why not do multiple t-tests? If you want to test H 0 : m 1 = m.
Topic 2: Statistical Concepts and Market Returns
Chapter 3 Experiments with a Single Factor: The Analysis of Variance
MARE 250 Dr. Jason Turner Hypothesis Testing III.
Introduction to Probability and Statistics Linear Regression and Correlation.
Inferences About Process Quality
5-3 Inference on the Means of Two Populations, Variances Unknown
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Assumptions of the ANOVA The error terms are randomly, independently, and normally distributed, with a mean of zero and a common variance. –There should.
Transforming the data Modified from: Gotelli and Allison Chapter 8; Sokal and Rohlf 2000 Chapter 13.
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.
Inference for regression - Simple linear regression
Chapter 5 Sampling and Statistics Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
The Examination of Residuals. The residuals are defined as the n differences : where is an observation and is the corresponding fitted value obtained.
Today’s lesson Confidence intervals for the expected value of a random variable. Determining the sample size needed to have a specified probability of.
Inferences in Regression and Correlation Analysis Ayona Chatterjee Spring 2008 Math 4803/5803.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
+ Chapter 12: Inference for Regression Inference for Linear Regression.
Design and Analysis of Experiments Dr. Tai-Yue Wang Department of Industrial and Information Management National Cheng Kung University Tainan, TAIWAN,
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
Testing Multiple Means and the Analysis of Variance (§8.1, 8.2, 8.6) Situations where comparing more than two means is important. The approach to testing.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
The Completely Randomized Design (§8.3)
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Topic 23: Diagnostics and Remedies. Outline Diagnostics –residual checks ANOVA remedial measures.
1 Always be mindful of the kindness and not the faults of others.
Lack of Fit (LOF) Test A formal F test for checking whether a specific type of regression function adequately fits the data.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Diagnostics – Part II Using statistical tests to check to see if the assumptions we made about the model are realistic.
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
Simple linear regression Tron Anders Moger
Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
KNN Ch. 3 Diagnostics and Remedial Measures Applied Regression Analysis BUSI 6220.
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
CHAPTER 12 More About Regression
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Chapter 4. Inference about Process Quality
CHAPTER 12 More About Regression
CHAPTER 29: Multiple Regression*
Homogeneity of Variance
CHAPTER 12 More About Regression
CHAPTER 12 More About Regression
Correlation and Simple Linear Regression
Presentation transcript:

AOV Assumption Checking and Transformations (§8.4-8.5) How do we check the Normality of residuals assumption in AOV? How do we check the Homogeneity of variances assumption in AOV? (§7.4) What to do if these assumptions are not met?

Model Assumptions Homoscedasticity (common group variances). Normality of residuals. Independence of residuals. (Hopefully achieved through randomization.) Effect additivity. (Only an issue in multi-way AOV; later).

Checking the Equal Variance Assumption HA: some of the variances are different from each other Little work but little power Hartley’s Test: A logical extension of the F test for t=2. Requires equal replication, n, among groups. Requires normality. Reject if Fmax > Fa,t,n-1, tabulated in Table 12.

Bartlett’s Test More work but better power T.S. Bartlett’s Test: Allows unequal replication, but requires normality. T.S. If C > c2(t-1),a then apply the correction term Reject if C/CF > c2(t-1),a R.R.

Levene’s Test More work but powerful result. Let T.S. df1 = t -1 = sample median of i-th group T.S. df1 = t -1 df2 = nT - t R.R. Reject H0 if Use Table 8. Essentially an AOV on the zij

Minitab Stat > ANOVA > Test for Equal Variances Response Resist Factors Sand ConfLvl 95.0000 Bonferroni confidence intervals for standard deviations Lower Sigma Upper N Factor Levels 1.70502 3.28634 14.4467 5 15 1.89209 3.64692 16.0318 5 20 1.07585 2.07364 9.1157 5 25 1.07585 2.07364 9.1157 5 30 1.48567 2.86356 12.5882 5 35 Bartlett's Test (normal distribution) Test Statistic: 1.890 P-Value : 0.756 Levene's Test (any continuous distribution) Test Statistic: 0.463 P-Value : 0.762 Minitab Stat > ANOVA > Test for Equal Variances Minitab Help Use Bartlett’s test when the data come from normal distributions; Bartlett’s test is not robust to departures from normality. Use Levene’s test when the data come from continuous, but not necessarily normal, distributions. The computational method for Levene’s Test is a modification of Levene’s procedure [10] developed by [2]. This method considers the distances of the observations from their sample median rather than their sample mean. Using the sample median rather than the sample mean makes the test more robust for smaller samples. Do not reject H0 since p-value > 0.05 (traditional a)

SAS Program proc glm data=stress; class sand; model resistance = sand / solution; means sand / hovtest=bartlett; means sand / hovtest=levene(type=abs); means sand / hovtest=levene(type=square); means sand / hovtest=bf; /* Brown and Forsythe mod of Levene */ title1 'Compression resistance in concrete beams as'; title2 ' a function of percent sand in the mix'; run; Hovtest only works when one factor in (right hand side) model.

SAS hovtest=bartlett; hovtest=levene(type=abs); Bartlett's Test for Homogeneity of resistance Variance Source DF Chi-Square Pr > ChiSq sand 4 1.8901 0.7560 SAS Levene's Test for Homogeneity of resistance Variance ANOVA of Absolute Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr > F sand 4 8.8320 2.2080 0.95 0.4573 Error 20 46.6080 2.3304 hovtest=levene(type=abs); Levene's Test for Homogeneity of resistance Variance ANOVA of Squared Deviations from Group Means Sum of Mean Source DF Squares Square F Value Pr > F sand 4 202.2 50.5504 0.85 0.5076 Error 20 1182.8 59.1400 hovtest=levene(type=square); Brown and Forsythe's Test for Homogeneity of resistance Variance ANOVA of Absolute Deviations from Group Medians Sum of Mean Source DF Squares Square F Value Pr > F sand 4 7.4400 1.8600 0.46 0.7623 Error 20 80.4000 4.0200 hovtest=bf;

SPSS Test of Homogeneity of Variances RESIST .947 4 20 .457 Levene Statistic df1 df2 Sig. Since the p-value (0.457) is greater than our (typical) a =0.05 Type I error risk level, we do not reject the null hypothesis. This is Levene’s original test in which the zij are centered on group means and not medians.

R Tests of Homogeneity of Variances bartlett.test(): Bartlett’s Test. fligner.test(): Fligner-Killeen Test (nonparametric).

Checking for Normality Reminder: Normality of the RESIDUALS is assumed. The original data are assumed normal also, but each group may have a different mean if HA is true. Practice is to first fit the model, THEN output the residuals, then test for normality of the residuals. This APPROACH is always correct. TOOLS Histogram of all residuals (eij). Normal probability (Q-Q) plot. Formal test for normality.

Histogram of Residuals proc glm data=stress; class sand; model resistance = sand / solution; output out=resid r=r_resis p=p_resis ; title1 'Compression resistance in concrete beams as'; title2 ' a function of percent sand in the mix'; run; proc capability data=resid; histogram r_resis / normal; ppplot r_resis / normal square ;

Probability Plots A scatter plot of the percentiles on the residuals versus the percentiles of a standard normal distribution. The basic idea is that if the residuals are truly normally distributed, values for these percentiles should lie on a straight line. Compute and sort the residuals e(1), e(2),…, e(n). Associate to each residual a standard normal percentile. [ z(i) = normsinv((i-.5)/n)]. Plot z(i) versus e(i). Compare to straight line (don’t care so much about which line).

Spreadsheet & R In EXCEL: scatterplot of percentile versus Normal percentile. Use AddLine option. Percentile pi = (i-0.5)/n Normal percentile =NORMSINV(pi) In R with residuals in “y”: qqnorm(y) qqline(y)

Excel Probability Plot

Probability Plot Minitab SAS (note axes changed) These look normal!

Formal Normality Tests Many, many tests (a favorite pass-time of statisticians is developing new tests for normality.) Kolmogorov-Smirnov test. Shapiro-Wilks test (n < 50). D’Agostino’s test (n>=50) All quite conservative – they fail to reject the null hypothesis of normality more often than they should.

Shapiro-Wilk’s W test e1, e2, …, en represent data ranked from smallest to largest. H0: The population has a normal distribution. HA: The population does not have a normal distribution. T.S. Coefficients ai come from a table. If n is even R.R. Reject H0 if W < W0.05 If n is odd. Critical values of Wa come from a table.

Shapiro-Wilk Coefficients

Shapiro-Wilk Coefficients

Shapiro-Wilk W Table

D’Agostino’s Test e1, e2, …, en represent data ranked from smallest to largest. H0: The population has a normal distribution. HA: The population does not have a normal distribution. T.S. R.R. (two sided test) Reject H0 if Y0.025 and Y0.975 come from a table of percentiles of the Y statistic.

Transformations to Achieve Homoscedasticity What can we do if the homoscedasticity (equal variances) assumption is rejected? Declare that the AOV model is not an adequate model for the data. Look for alternative models. (Later.) Try to “cheat” by forcing the data be homoscedastic through a transformation of the response variable Y. (Variance Stabilizing Transformations.)

Square Root Transformation Response is positive and continuous. This transformation works when we notice the variance changes as a linear function of the mean. k>0 Useful for count data (Poisson Distributed). For small values of Y, use Y+.5. Typical use: Counts of items when counts are between 0 and 10.

Logarithmic Transformation Response is positive and continuous. This transformation tends to work when the variance is a linear function of the square of the mean k>0 Replace Y by Y+1 if zero occurs. Useful if effects are multiplicative (later). Useful If there is considerable heterogeneity in the data. Typical use: Growth over time. Concentrations. Counts of times when counts are greater than 10.

ARCSINE SQUARE ROOT Response is a proportion. With proportions, the variance is a linear function of the mean times (1-mean) where the sample mean is the expected proportion. Y is a proportion (decimal between 0 and 1). Zero counts should be replaced by 1/4, and N by N-1/4 before converting to percentages Typical use: Proportion of seeds germinating. Proportion responding.

Reciprocal Transformation Response is positive and continuous. This transformation works when the variance is a linear function of the fourth root of the mean. Use Y+1 if zero occurs. Useful if the reciprocal of the original scale has meaning. Typical use: Survival time.

Power Family of Transformations (1) Suppose we apply the power transformation: Suppose the true situation is that the variance is proportional to the k-th power of the mean. In the transformed variable we will have: If p is taken as 1-k, then the variance of Z will not depend on the mean, i.e. the variance will be constant. This is a Variance stabilizing transformation.

Power Family of Transformations (2) With replicated data, k can sometimes be found empirically by fitting: Estimate: k can be estimated by least squares (regression – Next Unit). If is zero use the logarithmic transformation.

Box and Cox Transformations suggested transformation geometric mean of the original data. Exponent, l, is unknown. Hence the model can be viewed as having an additional parameter which must be estimated (choose the value of l that minimizes the residual sum of squares).

Handling Heterogeneity no Regression? ANOVA yes Fit Effect Model Fit linear model accept OK Test for Homoscedasticity Plot residuals reject Not OK Transform OK Box/Cox Family Power Family Traditional Transformed Data

Transformations to Achieve Normality Regression? ANOVA yes Fit linear model Estimate group means Probability plot Formal Tests yes OK Residuals Normal? no Transform Different Model