Download presentation
Presentation is loading. Please wait.
Published byGervais Flowers Modified over 9 years ago
1
Statistical models in R - Part II R has several statistical functions packages We have already covered a few of the functions t-tests (one- and two-sample, paired) wilcox tests hypothesis testing chi-squared tests Here we will cover: linear and multiple regression analysis of variance correlation coefficients Explore further on your own
2
Linear regression To really get at the regression model, you need to learn how to access the data found by the lm command The lm function is that for a linear model Here is a short list: > summary(lm(y ~ x))# to view the results # y variable is a function of x > resid(lm(y ~ x))# to access the residuals > coef(lm(y ~ x))# to view the coefficients > fitted(lm(y ~ x))# to get the fitted values > summary(lm(y ~ x))# to view the results # y variable is a function of x > resid(lm(y ~ x))# to access the residuals > coef(lm(y ~ x))# to view the coefficients > fitted(lm(y ~ x))# to get the fitted values
3
Multiple linear regression Linear regression was used to model the effect one variable, an explanatory variable, on another Multiple linear regression does the same, only there are multiple explanatory variables or regressors In this case, the model formula syntax is pretty easy to use In simple regression we used: z~x To add another explanatory variable you just “add" it to the right side of the formula That is, to add ‘y’ we use z~x + y instead of simply z~x
4
Multiple linear regression > x = 1:10 > y = sample(1:100,10) > z = x+y # notice no error term -- sigma = 0 > lm(z ~ x+y) # we use lm() as before... Coefficients: (Intercept) x y 4.2e-15 1.0e+00 1.0e+00 # model finds b_0 = 0, b_1 = 1, b_2 = 1 as expected > x = 1:10 > y = sample(1:100,10) > z = x+y # notice no error term -- sigma = 0 > lm(z ~ x+y) # we use lm() as before... Coefficients: (Intercept) x y 4.2e-15 1.0e+00 1.0e+00 # model finds b_0 = 0, b_1 = 1, b_2 = 1 as expected Let's investigate the model:
5
Multiple linear regression > z = x+y + rnorm(10,0,2) # now sigma = 2 > lm(z ~ x+y)... Coefficients: (Intercept) x y 0.4694 0.9765 0.9891 # found b_0 =.4694, b_1 = 0.9765, b_2 = 0.9891 > z = x+y + rnorm(10,0,10) # more noise -- sigma = 10 > lm(z ~ x+y)... Coefficients: (Intercept) x y 10.5365 1.2127 0.7909 > z = x+y + rnorm(10,0,2) # now sigma = 2 > lm(z ~ x+y)... Coefficients: (Intercept) x y 0.4694 0.9765 0.9891 # found b_0 =.4694, b_1 = 0.9765, b_2 = 0.9891 > z = x+y + rnorm(10,0,10) # more noise -- sigma = 10 > lm(z ~ x+y)... Coefficients: (Intercept) x y 10.5365 1.2127 0.7909 Continuation…
6
Multiple linear regression The lm command only returns the coefficients (and the formula call) by default. The two methods summary and anova can yield more information The output of summary is similar as for simple regression In the example of multiple linear regression, the R command is given as: summary(lm(z ~ x+y ))
7
Analysis of variance The t-test was used to test hypotheses about the means of two independent samples. For example, to test if there is a difference between control and treatment groups The analysis of variance (ANOVA) allows one to compare means for more than 2 independent samples
8
One-way analysis of variance Example: Scholarship grading Suppose a school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, suppose 6 are used The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing that the differences should be due to chance errors if the graders all grade equally To illustrate, suppose we have just 24 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5, with 5 being the best Suppose a school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, suppose 6 are used The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing that the differences should be due to chance errors if the graders all grade equally To illustrate, suppose we have just 24 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5, with 5 being the best
9
One-way analysis of variance Data for the scholarship grading example: > x = c(4,3,4,5,2,3,4,5) # enter this into our R session > y = c(4,4,5,5,4,5,4,4) > z = c(3,4,2,4,5,5,4,4) > scores = data.frame(x,y,z) # make a data frame > boxplot(scores)# compare the three distributions > x = c(4,3,4,5,2,3,4,5) # enter this into our R session > y = c(4,4,5,5,4,5,4,4) > z = c(3,4,2,4,5,5,4,4) > scores = data.frame(x,y,z) # make a data frame > boxplot(scores)# compare the three distributions Grader 143452345 Grader 244554544 Grader 334245544 From the boxplots, it appears that grader 2 is different from graders 1 and 3
10
One-way analysis of variance Scholarship grading example: Analysis of variance allows us to investigate if all the graders have the same mean The R function to do the analysis of variance hypothesis test (oneway.test) requires the data to be in a different format It wants to have the data with a single variable holding the scores, and a factor describing the grader or category - the stack command will do this for us
11
One-way analysis of variance Scholarship grading example: > scores = stack(scores) # look at scores if not clear > names(scores) [1] "values" "ind“ > oneway.test(values ~ ind, data=scores, var.equal=T) > scores = stack(scores) # look at scores if not clear > names(scores) [1] "values" "ind“ > oneway.test(values ~ ind, data=scores, var.equal=T) Notice: we set explicitly that the variances are equal with var.equal=T Result: We see a p-value of 0.34 which means we accept the null hypothesis of equal means
12
One-way analysis of variance Scholarship grading example: The anova function gives more detail - you need to call it on the result of lm > anova(lm(values ~ ind, data=scores)) Notice that the output is identical to that given by oneway.test Alternatively, you could use the aov function to replace the combination of anova(lm()) However, to get a similar output you need to apply the summary command to the output of aov (for more on this, get help: enter ?aov)
13
One-way analysis of variance > t4data=read.delim("t4data.txt") > attach(t4data) > names(t4data) > plot(FT4~Group) > boxplot(FT4~Group,notch=T) > t4data=read.delim("t4data.txt") > attach(t4data) > names(t4data) > plot(FT4~Group) > boxplot(FT4~Group,notch=T) 1 > aov(FT4~Group) > lm(FT4~Group) > a1=aov(FT4~Group) > summary(a1) > names(a1) > a1$coefficients > aov(FT4~Group) > lm(FT4~Group) > a1=aov(FT4~Group) > summary(a1) > names(a1) > a1$coefficients 2 > l1=lm(FT4~Group) > anova(l1) > names(l1) > l1$coefficients > l1$effects > l1=lm(FT4~Group) > anova(l1) > names(l1) > l1$coefficients > l1$effects 3 Example: T4data
14
Two-way analysis of variance > ISdata=read.delim(“ISdata.txt") > attach(ISdata) > names(ISdata) > boxplot(IS~HT+BMI) > table(HT,BMI) > t2=tapply(IS,list(HT,BMI),median) > t2[2,3] > lm(IS~BMI*HT)# ‘*’ means all interactions > lm(IS~BMI+HT+BMI:HT) > anova(lm(IS~BMI+HT+BMI:HT)) > anova(lm(IS~BMI+HT)) > anova(lm(IS~HT+BMI)) > aov(IS~BMI+HT) > ISdata=read.delim(“ISdata.txt") > attach(ISdata) > names(ISdata) > boxplot(IS~HT+BMI) > table(HT,BMI) > t2=tapply(IS,list(HT,BMI),median) > t2[2,3] > lm(IS~BMI*HT)# ‘*’ means all interactions > lm(IS~BMI+HT+BMI:HT) > anova(lm(IS~BMI+HT+BMI:HT)) > anova(lm(IS~BMI+HT)) > anova(lm(IS~HT+BMI)) > aov(IS~BMI+HT) Example: ISdata
15
Two-way analysis of variance Example: ISdata cont. > TukeyHSD(aov(IS~BMI+HT)) > par(mfrow=c(1,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > par(mfrow=c(2,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > plot(TukeyHSD(aov(IS~HT+BMI),which="BMI")) > plot(TukeyHSD(aov(IS~HT+BMI),which="HT")) > TukeyHSD(aov(IS~BMI+HT)) > par(mfrow=c(1,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > par(mfrow=c(2,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > plot(TukeyHSD(aov(IS~HT+BMI),which="BMI")) > plot(TukeyHSD(aov(IS~HT+BMI),which="HT"))
16
Correlation coefficient Example: t4data > ls() > attach(t4data) > names(t4data) > plot(FT4,FT3) > plot(FT4,FT3,pch=c(2,4)[Gender])# according to gender > plot(FT4,FT3,pch=c(2,4)[Gender],col=c(2,4)[Gender]) > table(Gender) > cor(FT4,FT3)# to find R > cor(FT4,FT3)^2# to find R 2 > cor.test(FT4,FT3) > cor.test(FT4,FT3,method="s")# Spearman R > cor.test(FT4,FT3,method="k")# Kendall tau > ls() > attach(t4data) > names(t4data) > plot(FT4,FT3) > plot(FT4,FT3,pch=c(2,4)[Gender])# according to gender > plot(FT4,FT3,pch=c(2,4)[Gender],col=c(2,4)[Gender]) > table(Gender) > cor(FT4,FT3)# to find R > cor(FT4,FT3)^2# to find R 2 > cor.test(FT4,FT3) > cor.test(FT4,FT3,method="s")# Spearman R > cor.test(FT4,FT3,method="k")# Kendall tau
17
Miscellaneous exercises Due to time restrictions, we cannot cover all areas Find accompanying “R_miscellaneous_practical.doc” The document contains some useful R commands Work through the R commands in your own time
18
Sources of help The R project website http://www.r-project.orghttp://www.r-project.org The 'official' introduction to R at http://cran.r-project.org/doc/manuals/R-intro.pdf Manuals, tutorials, etc. provided by users of R, located at http://cran.R-project.org/other-docs.html http://cran.R-project.org/other-docs.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.