Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample,

Slides:



Advertisements
Similar presentations
Correlation and regression
Advertisements

Methods for Dummies Isobel Weinberg & Alexandra Westley.
Chapter 11 Inference for Distributions of Categorical Data
Chapter 13: Inference for Distributions of Categorical Data
Basics of ANOVA Why ANOVA Assumptions used in ANOVA
Understanding Variables Emily H. Wughalter, Ed.D. Professor, Department of Kinesiology Spring 2010.
Final Review Session.
Intro to Statistics for the Behavioral Sciences PSYC 1900
January 5, afternoon session 1 Statistics Micro Mini Statistics Review January 5-9, 2009 Beth Ayers.
Two Groups Too Many? Try Analysis of Variance (ANOVA)
Simple Linear Regression Analysis
Statistical hypothesis testing – Inferential statistics II. Testing for associations.
Two-Way Analysis of Variance STAT E-150 Statistical Methods.
Basic Data Analysis Using R Xiao He 1. AGENDA 1.Data cleaning (e.g., missing values) 2.Descriptive statistics 3.t-tests 4.ANOVA 5.Linear regression Data.
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
Statistics for the Social Sciences Psychology 340 Fall 2013 Thursday, November 21 Review for Exam #4.
The Chi-Square Distribution 1. The student will be able to  Perform a Goodness of Fit hypothesis test  Perform a Test of Independence hypothesis test.
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
Means Tests Hypothesis Testing Assumptions Testing (Normality)
+ Chapter 12: Inference for Regression Inference for Linear Regression.
A Repertoire of Hypothesis Tests  z-test – for use with normal distributions and large samples.  t-test – for use with small samples and when the pop.
Analyzing Data: Comparing Means Chapter 8. Are there differences? One of the fundament questions of survey research is if there is a difference among.
Multiple regression - Inference for multiple regression - A case study IPS chapters 11.1 and 11.2 © 2006 W.H. Freeman and Company.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Statistics 1: tests and linear models. How to get started? Exploring data graphically: Scatterplot HistogramBoxplot.
Then click the box for Normal probability plot. In the box labeled Standardized Residual Plots, first click the checkbox for Histogram, Multiple Linear.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Hypothesis testing Intermediate Food Security Analysis Training Rome, July 2010.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Relationship between two variables Two quantitative variables: correlation and regression methods Two qualitative variables: contingency table methods.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Experiment: understand how inputs (explanatory variables) affect outputs (responses) Basic Experimental Design Treatments: the input variables. Typically,
 Muhamad Jantan & T. Ramayah School of Management, Universiti Sains Malaysia Data Analysis Using SPSS.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Yesterday Correlation Regression -Definition
Statistics: Unlocking the Power of Data Lock 5 Exam 2 Review STAT 101 Dr. Kari Lock Morgan 11/13/12 Review of Chapters 5-9.
Analysis of RT distributions with R Emil Ratko-Dehnert WS 2010/ 2011 Session 07 –
3-1 MGMG 522 : Session #3 Hypothesis Testing (Ch. 5)
Categorical Independent Variables STA302 Fall 2013.
Three Broad Purposes of Quantitative Research 1. Description 2. Theory Testing 3. Theory Generation.
The Analysis of Variance. One-Way ANOVA  We use ANOVA when we want to look at statistical relationships (difference in means for example) between more.
General Linear Model.
Introducing Communication Research 2e © 2014 SAGE Publications Chapter Seven Generalizing From Research Results: Inferential Statistics.
Chapter 10 Inference for Regression
ANOVA, Regression and Multiple Regression March
Soc 3306a Lecture 7: Inference and Hypothesis Testing T-tests and ANOVA.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Remember You just invented a “magic math pill” that will increase test scores. On the day of the first test you give the pill to 4 subjects. When these.
Example x y We wish to check for a non zero correlation.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Significance Tests for Regression Analysis. A. Testing the Significance of Regression Models The first important significance test is for the regression.
Statistics (cont.) Psych 231: Research Methods in Psychology.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
REGRESSION AND CORRELATION SIMPLE LINEAR REGRESSION 10.2 SCATTER DIAGRAM 10.3 GRAPHICAL METHOD FOR DETERMINING REGRESSION 10.4 LEAST SQUARE METHOD.
Inferential Statistics Psych 231: Research Methods in Psychology.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 12 More About Regression 12.1 Inference for.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 9 l Simple Linear Regression 9.1 Simple Linear Regression 9.2 Scatter Diagram 9.3 Graphical.
Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.
F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE.
CHAPTER 15: THE NUTS AND BOLTS OF USING STATISTICS.
EXCEL: Multiple Regression
Applied Biostatistics: Lecture 2
CHAPTER 12 More About Regression
Correlation and Simple Linear Regression
Prepared by Lee Revere and John Large
Correlation and Simple Linear Regression
CHAPTER 12 More About Regression
Simple Linear Regression and Correlation
CHAPTER 12 More About Regression
Presentation transcript:

Statistical models in R - Part II  R has several statistical functions packages  We have already covered a few of the functions  t-tests (one- and two-sample, paired)  wilcox tests  hypothesis testing  chi-squared tests  Here we will cover:  linear and multiple regression  analysis of variance  correlation coefficients  Explore further on your own

Linear regression  To really get at the regression model, you need to learn how to access the data found by the lm command  The lm function is that for a linear model  Here is a short list: > summary(lm(y ~ x))# to view the results # y variable is a function of x > resid(lm(y ~ x))# to access the residuals > coef(lm(y ~ x))# to view the coefficients > fitted(lm(y ~ x))# to get the fitted values > summary(lm(y ~ x))# to view the results # y variable is a function of x > resid(lm(y ~ x))# to access the residuals > coef(lm(y ~ x))# to view the coefficients > fitted(lm(y ~ x))# to get the fitted values

Multiple linear regression  Linear regression was used to model the effect one variable, an explanatory variable, on another  Multiple linear regression does the same, only there are multiple explanatory variables or regressors  In this case, the model formula syntax is pretty easy to use  In simple regression we used: z~x  To add another explanatory variable you just “add" it to the right side of the formula  That is, to add ‘y’ we use z~x + y instead of simply z~x

Multiple linear regression > x = 1:10 > y = sample(1:100,10) > z = x+y # notice no error term -- sigma = 0 > lm(z ~ x+y) # we use lm() as before... Coefficients: (Intercept) x y 4.2e e e+00 # model finds b_0 = 0, b_1 = 1, b_2 = 1 as expected > x = 1:10 > y = sample(1:100,10) > z = x+y # notice no error term -- sigma = 0 > lm(z ~ x+y) # we use lm() as before... Coefficients: (Intercept) x y 4.2e e e+00 # model finds b_0 = 0, b_1 = 1, b_2 = 1 as expected Let's investigate the model:

Multiple linear regression > z = x+y + rnorm(10,0,2) # now sigma = 2 > lm(z ~ x+y)... Coefficients: (Intercept) x y # found b_0 =.4694, b_1 = , b_2 = > z = x+y + rnorm(10,0,10) # more noise -- sigma = 10 > lm(z ~ x+y)... Coefficients: (Intercept) x y > z = x+y + rnorm(10,0,2) # now sigma = 2 > lm(z ~ x+y)... Coefficients: (Intercept) x y # found b_0 =.4694, b_1 = , b_2 = > z = x+y + rnorm(10,0,10) # more noise -- sigma = 10 > lm(z ~ x+y)... Coefficients: (Intercept) x y Continuation…

Multiple linear regression  The lm command only returns the coefficients (and the formula call) by default.  The two methods summary and anova can yield more information  The output of summary is similar as for simple regression  In the example of multiple linear regression, the R command is given as: summary(lm(z ~ x+y ))

Analysis of variance  The t-test was used to test hypotheses about the means of two independent samples. For example, to test if there is a difference between control and treatment groups  The analysis of variance (ANOVA) allows one to compare means for more than 2 independent samples

One-way analysis of variance Example: Scholarship grading  Suppose a school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, suppose 6 are used  The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing that the differences should be due to chance errors if the graders all grade equally  To illustrate, suppose we have just 24 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5, with 5 being the best  Suppose a school is trying to grade 300 different scholarship applications. As the job is too much work for one grader, suppose 6 are used  The scholarship committee would like to ensure that each grader is using the same grading scale, as otherwise the students aren't being treated equally. One approach to checking if the graders are using the same scale is to randomly assign each grader 50 exams and have them grade. Then compare the grades for the 6 graders knowing that the differences should be due to chance errors if the graders all grade equally  To illustrate, suppose we have just 24 tests and 3 graders (not 300 and 6 to simplify data entry). Furthermore, suppose the grading scale is on the range 1-5, with 5 being the best

One-way analysis of variance Data for the scholarship grading example: > x = c(4,3,4,5,2,3,4,5) # enter this into our R session > y = c(4,4,5,5,4,5,4,4) > z = c(3,4,2,4,5,5,4,4) > scores = data.frame(x,y,z) # make a data frame > boxplot(scores)# compare the three distributions > x = c(4,3,4,5,2,3,4,5) # enter this into our R session > y = c(4,4,5,5,4,5,4,4) > z = c(3,4,2,4,5,5,4,4) > scores = data.frame(x,y,z) # make a data frame > boxplot(scores)# compare the three distributions Grader Grader Grader  From the boxplots, it appears that grader 2 is different from graders 1 and 3

One-way analysis of variance Scholarship grading example:  Analysis of variance allows us to investigate if all the graders have the same mean  The R function to do the analysis of variance hypothesis test (oneway.test) requires the data to be in a different format  It wants to have the data with a single variable holding the scores, and a factor describing the grader or category - the stack command will do this for us

One-way analysis of variance Scholarship grading example: > scores = stack(scores) # look at scores if not clear > names(scores) [1] "values" "ind“ > oneway.test(values ~ ind, data=scores, var.equal=T) > scores = stack(scores) # look at scores if not clear > names(scores) [1] "values" "ind“ > oneway.test(values ~ ind, data=scores, var.equal=T) Notice: we set explicitly that the variances are equal with var.equal=T Result: We see a p-value of 0.34 which means we accept the null hypothesis of equal means

One-way analysis of variance Scholarship grading example:  The anova function gives more detail - you need to call it on the result of lm > anova(lm(values ~ ind, data=scores))  Notice that the output is identical to that given by oneway.test  Alternatively, you could use the aov function to replace the combination of anova(lm())  However, to get a similar output you need to apply the summary command to the output of aov (for more on this, get help: enter ?aov)

One-way analysis of variance > t4data=read.delim("t4data.txt") > attach(t4data) > names(t4data) > plot(FT4~Group) > boxplot(FT4~Group,notch=T) > t4data=read.delim("t4data.txt") > attach(t4data) > names(t4data) > plot(FT4~Group) > boxplot(FT4~Group,notch=T) 1 > aov(FT4~Group) > lm(FT4~Group) > a1=aov(FT4~Group) > summary(a1) > names(a1) > a1$coefficients > aov(FT4~Group) > lm(FT4~Group) > a1=aov(FT4~Group) > summary(a1) > names(a1) > a1$coefficients 2 > l1=lm(FT4~Group) > anova(l1) > names(l1) > l1$coefficients > l1$effects > l1=lm(FT4~Group) > anova(l1) > names(l1) > l1$coefficients > l1$effects 3 Example: T4data

Two-way analysis of variance > ISdata=read.delim(“ISdata.txt") > attach(ISdata) > names(ISdata) > boxplot(IS~HT+BMI) > table(HT,BMI) > t2=tapply(IS,list(HT,BMI),median) > t2[2,3] > lm(IS~BMI*HT)# ‘*’ means all interactions > lm(IS~BMI+HT+BMI:HT) > anova(lm(IS~BMI+HT+BMI:HT)) > anova(lm(IS~BMI+HT)) > anova(lm(IS~HT+BMI)) > aov(IS~BMI+HT) > ISdata=read.delim(“ISdata.txt") > attach(ISdata) > names(ISdata) > boxplot(IS~HT+BMI) > table(HT,BMI) > t2=tapply(IS,list(HT,BMI),median) > t2[2,3] > lm(IS~BMI*HT)# ‘*’ means all interactions > lm(IS~BMI+HT+BMI:HT) > anova(lm(IS~BMI+HT+BMI:HT)) > anova(lm(IS~BMI+HT)) > anova(lm(IS~HT+BMI)) > aov(IS~BMI+HT) Example: ISdata

Two-way analysis of variance Example: ISdata cont. > TukeyHSD(aov(IS~BMI+HT)) > par(mfrow=c(1,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > par(mfrow=c(2,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > plot(TukeyHSD(aov(IS~HT+BMI),which="BMI")) > plot(TukeyHSD(aov(IS~HT+BMI),which="HT")) > TukeyHSD(aov(IS~BMI+HT)) > par(mfrow=c(1,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > par(mfrow=c(2,2)) > plot(TukeyHSD(aov(IS~BMI+HT),which="BMI")) > plot(TukeyHSD(aov(IS~BMI+HT),which="HT")) > plot(TukeyHSD(aov(IS~HT+BMI),which="BMI")) > plot(TukeyHSD(aov(IS~HT+BMI),which="HT"))

Correlation coefficient Example: t4data > ls() > attach(t4data) > names(t4data) > plot(FT4,FT3) > plot(FT4,FT3,pch=c(2,4)[Gender])# according to gender > plot(FT4,FT3,pch=c(2,4)[Gender],col=c(2,4)[Gender]) > table(Gender) > cor(FT4,FT3)# to find R > cor(FT4,FT3)^2# to find R 2 > cor.test(FT4,FT3) > cor.test(FT4,FT3,method="s")# Spearman R > cor.test(FT4,FT3,method="k")# Kendall tau > ls() > attach(t4data) > names(t4data) > plot(FT4,FT3) > plot(FT4,FT3,pch=c(2,4)[Gender])# according to gender > plot(FT4,FT3,pch=c(2,4)[Gender],col=c(2,4)[Gender]) > table(Gender) > cor(FT4,FT3)# to find R > cor(FT4,FT3)^2# to find R 2 > cor.test(FT4,FT3) > cor.test(FT4,FT3,method="s")# Spearman R > cor.test(FT4,FT3,method="k")# Kendall tau

Miscellaneous exercises  Due to time restrictions, we cannot cover all areas  Find accompanying “R_miscellaneous_practical.doc”  The document contains some useful R commands  Work through the R commands in your own time

Sources of help  The R project website  The 'official' introduction to R at  Manuals, tutorials, etc. provided by users of R, located at