Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample,

Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample, paired)  wilcox tests  hypothesis testing  chi-squared tests  Here we will cover:  linear and multiple regression  analysis of variance  correlation coefficients  Explore further on your own

Linear regression  To really get at the regression model, you need to learn how to access the data found by the lm command  The lm function is that for a linear model  Here is a short list: > summary(lm(y ~ x))# to view the results # y variable is a function of x > resid(lm(y ~ x))# to access the residuals > coef(lm(y ~ x))# to view the coefficients > fitted(lm(y ~ x))# to get the fitted values > summary(lm(y ~ x))# to view the results # y variable is a function of x > resid(lm(y ~ x))# to access the residuals > coef(lm(y ~ x))# to view the coefficients > fitted(lm(y ~ x))# to get the fitted values

Multiple linear regression  Linear regression was used to model the effect one variable, an explanatory variable, on another  Multiple linear regression does the same, only there are multiple explanatory variables or regressors  In this case, the model formula syntax is pretty easy to use  In simple regression we used: z~x  To add another explanatory variable you just “add" it to the right side of the formula  That is, to add ‘y’ we use z~x + y instead of simply z~x

Multiple linear regression > x = 1:10 > y = sample(1:100,10) > z = x+y # notice no error term -- sigma = 0 > lm(z ~ x+y) # we use lm() as before... Coefficients: (Intercept) x y 4.2e-15 1.0e+00 1.0e+00 # model finds b_0 = 0, b_1 = 1, b_2 = 1 as expected > x = 1:10 > y = sample(1:100,10) > z = x+y # notice no error term -- sigma = 0 > lm(z ~ x+y) # we use lm() as before... Coefficients: (Intercept) x y 4.2e-15 1.0e+00 1.0e+00 # model finds b_0 = 0, b_1 = 1, b_2 = 1 as expected Let's investigate the model:

Multiple linear regression > z = x+y + rnorm(10,0,2) # now sigma = 2 > lm(z ~ x+y)... Coefficients: (Intercept) x y 0.4694 0.9765 0.9891 # found b_0 =.4694, b_1 = 0.9765, b_2 = 0.9891 > z = x+y + rnorm(10,0,10) # more noise -- sigma = 10 > lm(z ~ x+y)... Coefficients: (Intercept) x y 10.5365 1.2127 0.7909 > z = x+y + rnorm(10,0,2) # now sigma = 2 > lm(z ~ x+y)... Coefficients: (Intercept) x y 0.4694 0.9765 0.9891 # found b_0 =.4694, b_1 = 0.9765, b_2 = 0.9891 > z = x+y + rnorm(10,0,10) # more noise -- sigma = 10 > lm(z ~ x+y)... Coefficients: (Intercept) x y 10.5365 1.2127 0.7909 Continuation…

Multiple linear regression  The lm command only returns the coefficients (and the formula call) by default.  The two methods summary and anova can yield more information  The output of summary is similar as for simple regression  In the example of multiple linear regression, the R command is given as: summary(lm(z ~ x+y ))

Analysis of variance  The t-test was used to test hypotheses about the means of two independent samples. For example, to test if there is a difference between control and treatment groups  The analysis of variance (ANOVA) allows one to compare means for more than 2 independent samples

One-way analysis of variance Example: Scholarship grading  Suppose a school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, suppose 6 are used  The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing that the differences should be due to chance errors if the graders all grade equally  To illustrate, suppose we have just 24 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5, with 5 being the best  Suppose a school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, suppose 6 are used  The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing that the differences should be due to chance errors if the graders all grade equally  To illustrate, suppose we have just 24 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5, with 5 being the best

One-way analysis of variance Data for the scholarship grading example: > x = c(4,3,4,5,2,3,4,5) # enter this into our R session > y = c(4,4,5,5,4,5,4,4) > z = c(3,4,2,4,5,5,4,4) > scores = data.frame(x,y,z) # make a data frame > boxplot(scores)# compare the three distributions > x = c(4,3,4,5,2,3,4,5) # enter this into our R session > y = c(4,4,5,5,4,5,4,4) > z = c(3,4,2,4,5,5,4,4) > scores = data.frame(x,y,z) # make a data frame > boxplot(scores)# compare the three distributions Grader 143452345 Grader 244554544 Grader 334245544  From the boxplots, it appears that grader 2 is different from graders 1 and 3

One-way analysis of variance Scholarship grading example:  Analysis of variance allows us to investigate if all the graders have the same mean  The R function to do the analysis of variance hypothesis test (oneway.test) requires the data to be in a different format  It wants to have the data with a single variable holding the scores, and a factor describing the grader or category - the stack command will do this for us

One-way analysis of variance Scholarship grading example: > scores = stack(scores) # look at scores if not clear > names(scores) [1] "values" "ind“ > oneway.test(values ~ ind, data=scores, var.equal=T) > scores = stack(scores) # look at scores if not clear > names(scores) [1] "values" "ind“ > oneway.test(values ~ ind, data=scores, var.equal=T) Notice: we set explicitly that the variances are equal with var.equal=T Result: We see a p-value of 0.34 which means we accept the null hypothesis of equal means

One-way analysis of variance Scholarship grading example:  The anova function gives more detail - you need to call it on the result of lm > anova(lm(values ~ ind, data=scores))  Notice that the output is identical to that given by oneway.test  Alternatively, you could use the aov function to replace the combination of anova(lm())  However, to get a similar output you need to apply the summary command to the output of aov (for more on this, get help: enter ?aov)

One-way analysis of variance > t4data=read.delim("t4data.txt") > attach(t4data) > names(t4data) > plot(FT4~Group) > boxplot(FT4~Group,notch=T) > t4data=read.delim("t4data.txt") > attach(t4data) > names(t4data) > plot(FT4~Group) > boxplot(FT4~Group,notch=T) 1 > aov(FT4~Group) > lm(FT4~Group) > a1=aov(FT4~Group) > summary(a1) > names(a1) > a1$coefficients > aov(FT4~Group) > lm(FT4~Group) > a1=aov(FT4~Group) > summary(a1) > names(a1) > a1$coefficients 2 > l1=lm(FT4~Group) > anova(l1) > names(l1) > l1$coefficients > l1$effects > l1=lm(FT4~Group) > anova(l1) > names(l1) > l1$coefficients > l1$effects 3 Example: T4data

Two-way analysis of variance > ISdata=read.delim(“ISdata.txt") > attach(ISdata) > names(ISdata) > boxplot(IS~HT+BMI) > table(HT,BMI) > t2=tapply(IS,list(HT,BMI),median) > t2[2,3] > lm(IS~BMI*HT)# ‘*’ means all interactions > lm(IS~BMI+HT+BMI:HT) > anova(lm(IS~BMI+HT+BMI:HT)) > anova(lm(IS~BMI+HT)) > anova(lm(IS~HT+BMI)) > aov(IS~BMI+HT) > ISdata=read.delim(“ISdata.txt") > attach(ISdata) > names(ISdata) > boxplot(IS~HT+BMI) > table(HT,BMI) > t2=tapply(IS,list(HT,BMI),median) > t2[2,3] > lm(IS~BMI*HT)# ‘*’ means all interactions > lm(IS~BMI+HT+BMI:HT) > anova(lm(IS~BMI+HT+BMI:HT)) > anova(lm(IS~BMI+HT)) > anova(lm(IS~HT+BMI)) > aov(IS~BMI+HT) Example: ISdata

Two-way analysis of variance Example: ISdata cont. > TukeyHSD(aov(IS~BMI+HT)) > par(mfrow=c(1,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > par(mfrow=c(2,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > plot(TukeyHSD(aov(IS~HT+BMI),which="BMI")) > plot(TukeyHSD(aov(IS~HT+BMI),which="HT")) > TukeyHSD(aov(IS~BMI+HT)) > par(mfrow=c(1,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > par(mfrow=c(2,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > plot(TukeyHSD(aov(IS~HT+BMI),which="BMI")) > plot(TukeyHSD(aov(IS~HT+BMI),which="HT"))

Correlation coefficient Example: t4data > ls() > attach(t4data) > names(t4data) > plot(FT4,FT3) > plot(FT4,FT3,pch=c(2,4)[Gender])# according to gender > plot(FT4,FT3,pch=c(2,4)[Gender],col=c(2,4)[Gender]) > table(Gender) > cor(FT4,FT3)# to find R > cor(FT4,FT3)^2# to find R 2 > cor.test(FT4,FT3) > cor.test(FT4,FT3,method="s")# Spearman R > cor.test(FT4,FT3,method="k")# Kendall tau > ls() > attach(t4data) > names(t4data) > plot(FT4,FT3) > plot(FT4,FT3,pch=c(2,4)[Gender])# according to gender > plot(FT4,FT3,pch=c(2,4)[Gender],col=c(2,4)[Gender]) > table(Gender) > cor(FT4,FT3)# to find R > cor(FT4,FT3)^2# to find R 2 > cor.test(FT4,FT3) > cor.test(FT4,FT3,method="s")# Spearman R > cor.test(FT4,FT3,method="k")# Kendall tau

Miscellaneous exercises  Due to time restrictions, we cannot cover all areas  Find accompanying “R_miscellaneous_practical.doc”  The document contains some useful R commands  Work through the R commands in your own time

Sources of help  The R project website http://www.r-project.orghttp://www.r-project.org  The 'official' introduction to R at http://cran.r-project.org/doc/manuals/R-intro.pdf  Manuals, tutorials, etc. provided by users of R, located at http://cran.R-project.org/other-docs.html http://cran.R-project.org/other-docs.html

Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample,

Similar presentations

Presentation on theme: "Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample,

Similar presentations

Presentation on theme: "Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample,"— Presentation transcript:

Similar presentations

About project

Feedback