Multi Linear Regression Lab

Slides:



Advertisements
Similar presentations
Lecture 10 F-tests in MLR (continued) Coefficients of Determination BMTRY 701 Biostatistical Methods II.
Advertisements

4/14/ lecture 81 STATS 330: Lecture 8. 4/14/ lecture 82 Collinearity Aims of today’s lecture: Explain the idea of collinearity and its connection.
Inference for Regression
Multiple Regression Predicting a response with multiple explanatory variables.
x y z The data as seen in R [1,] population city manager compensation [2,] [3,] [4,]
Econ 140 Lecture 131 Multiple Regression Models Lecture 13.
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Multiple Regression Models
Examining Relationship of Variables  Response (dependent) variable - measures the outcome of a study.  Explanatory (Independent) variable - explains.
Nemours Biomedical Research Statistics April 2, 2009 Tim Bunnell, Ph.D. & Jobayer Hossain, Ph.D. Nemours Bioinformatics Core Facility.
7/2/ Lecture 51 STATS 330: Lecture 5. 7/2/ Lecture 52 Tutorials  These will cover computing details  Held in basement floor tutorial lab,
Checking Regression Model Assumptions NBA 2013/14 Player Heights and Weights.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Example of Simple and Multiple Regression
PCA Example Air pollution in 41 cities in the USA.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Correlation and Regression
Soc 3306a Lecture 9: Multivariate 2 More on Multiple Regression: Building a Model and Interpreting Coefficients.
Regression. Population Covariance and Correlation.
Chapter 14 Inference for Regression AP Statistics 14.1 – Inference about the Model 14.2 – Predictions and Conditions.
Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Tutorial 4 MBP 1010 Kevin Brown. Correlation Review Pearson’s correlation coefficient – Varies between – 1 (perfect negative linear correlation) and 1.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Linear Models Alan Lee Sample presentation for STATS 760.
Lesson 14 - R Chapter 14 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Tutorial 5 Thursday February 14 MBP 1010 Kevin Brown.
Introduction Many problems in Engineering, Management, Health Sciences and other Sciences involve exploring the relationships between two or more variables.
Stats Methods at IC Lecture 3: Regression.
Predicting Energy Consumption in Buildings using Multiple Linear Regression Introduction Linear regression is used to model energy consumption in buildings.
Chapter 2 Linear regression.
Lecture 11: Simple Linear Regression
GS/PPAL Section N Research Methods and Information Systems
Chapter 14 Introduction to Multiple Regression
Regression and Correlation
Modeling in R Sanna Härkönen.
Chapter 12 Simple Linear Regression and Correlation
Sections Review.
Résolution de l’ex 1 p40 t=c(2:12);N=c(55,90,135,245,403,665,1100,1810,3000,4450,7350) T=data.frame(t,N,y=log(N));T; > T t N y
CHAPTER 7 Linear Correlation & Regression Methods
Correlation and Simple Linear Regression
STATS 330: Lecture 16 Case Study 7/17/ lecture 16
Multiple Regression Analysis and Model Building
Correlation and regression
Chapter 14: Correlation and Regression
Chapter 6 Predicting Future Performance
Checking Regression Model Assumptions
Correlation and Simple Linear Regression
Lecture Slides Elementary Statistics Thirteenth Edition
Correlation and Regression
Checking Regression Model Assumptions
Console Editeur : myProg.R 1
Welcome to the class! set.seed(843) df <- tibble::data_frame(
Chapter 12 Simple Linear Regression and Correlation
Multiple Regression Models
Hypothesis testing and Estimation
Correlation and Simple Linear Regression
Pemeriksaan Sisa dan Data Berpengaruh Pertemuan 17
Simple Linear Regression and Correlation
An Introduction to Correlational Research
SIMPLE LINEAR REGRESSION
Topic 8 Correlation and Regression Analysis
Chapter 14 Inference for Regression
Chapter 13 Additional Topics in Regression Analysis
Chapter 6 Predicting Future Performance
Chapter 8 Regression analysis I Instructor: Li, Han.
Linear Regression and Correlation
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Multi Linear Regression Lab

Lab – Multiple Linear Regression Basic tasks you need to know: Can you read in a CSV file? Can you create multiple scatter plots and a correlation matrix? Can you see if it is worth creating a linear regression model from the task 2? Can you build a model using the correct function? Can you print a summary and interpret the coefficients, i.e., have added some unnecessary x variables? Can you refine the model? Can you plot the residuals and interpret them?

Example 1: Predicting life expectancy We need data to build to build our regression model (Task 1): > stateX77 <- read.csv( 'C:/state_data_77.csv', header=T ) This data set includes: "State": names of states "Popul": population estimate as of July 1, 1975 "Income": per capita income (1974) "Illit": illiteracy (1970, percent of population) "LifeExp": life expectancy in years (1969–71) "Murder": murder and non-negligent manslaughter rate per 100,000 population (1976) "HSGrad": percent high-school graduates (1970) "Frost": mean number of days with minimum temperature below freezing (1931–1960) in capital or large city "Area": land area in square miles

Visualise Lets create a multiple scatter plots (Task 2): > pairs(stateX77) There appears to be some negative and positive linear relationships between the data. We can confirm this by looking at the correlation matrix (Task 2): > cor(stateX77) Error in cor(stateX77) : 'x' must be numeric One of the variables is not numeric. > str(stateX77) There are 9 variables and the first column is not numeric. We can access individual columns as follows: > cor(stateX77[,2:9]) To make it appear more neat we can round the numbers: > round( cor(stateX77[,2:9]), 2)

Build Model Let’s build a model for life expectancy and see if we can predict life expectancy based on the other variables. There some weak to strong correlations between that the life expectancy variable and some of the others and the plots indicate that they may be linear (Task 3). In building the model we may have some theory that will suggest what variables we should be interested in or need to include. For us, we will start with all the variables and then remove variables which aren’t contributing much to the model.

For the present let’s not use State when we build the model (Task 4). > stateX77.fit1 <- lm( LifeExp ~ Popul + Income + Illit + Murder + HSGrad + Frost + Area, data=stateX77) Now we have built the model lets inspect it (Task 5): > summary( stateX77.fit1 )

Summary Call: lm(formula = LifeExp ~ Popul + Income + Illit + Murder + HSGrad + Frost + Area, data = stateX77) Residuals: Min 1Q Median 3Q Max -1.48895 -0.51232 -0.02747 0.57002 1.49447 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 7.094e+01 1.748e+00 40.586 < 2e-16 *** Popul 5.180e-05 2.919e-05 1.775 0.0832 . Income -2.180e-05 2.444e-04 -0.089 0.9293 Illit 3.382e-02 3.663e-01 0.092 0.9269 Murder -3.011e-01 4.662e-02 -6.459 8.68e-08 *** HSGrad 4.893e-02 2.332e-02 2.098 0.0420 * Frost -5.735e-03 3.143e-03 -1.825 0.0752 . Area -7.383e-08 1.668e-06 -0.044 0.9649 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.7448 on 42 degrees of freedom Multiple R-squared: 0.7362, Adjusted R-squared: 0.6922 F-statistic: 16.74 on 7 and 42 DF, p-value: 2.534e-10

Eject Insignificance Let’s look at how the coefficients contribute to the prediction. Roughly speaking, the higher the p-value the less it contributes to the model. Let’s remove the variables that aren’t significant – all of them! (Task 6) > stateX77.fit2 <- lm( LifeExp ~ Popul + Murder + HSGrad + Frost, data=stateX77)

Can R Help? Don what should be kept or left out in a regression model has been investigated for many years and there are number of standard approaches. There is even a function in R that does this for us: > final.stateX77 <- step( stateX77.fit1 ) How does ‘final.stateX77’ compare to ‘stateX77.fit2’?

Residual Checking We still need to check the residuals of our model. Ideally, they should follow a normal distribution. If they don’t then we need to examine the source data more carefully and rebuild the model. At this stage we will only check the residuals (Task 7). We want to see all the points along the line or as close to it as possible. > qqnorm( stateX77.fit2$residuals ) > qqnorm( stateX77.fit2$residuals, ylab="Residuals" ) > qqline( stateX77.fit2$residuals ) What can you see? There is a problem here – the points do not seem to fall along the line.

Our conclusions are that the residuals show some departure from a normal distribution and we need to revise the model further. Despite the fact that we have a relatively high adjusted R-Squared value which indicates that the model explains 0.69% the variability of the response data around its mean; The residual plot suggest there is some bias in our model.

Your Turn – Milk Production The Dairy Herd Improvement Cooperative (DHI) in Upstate New York collects and analyzes data on milk production. One question of interest here is how to develop a suitable model to predict current milk production from a set of measured variables. The response variable (current milk production in pounds) and the predictor variables are given in Table 1.1. Samples are taken once a month during milking. The period that a cow gives milk is called lactation. Number of lactations is the number of times a cow has calved or given milk. The recommended management practice is to have the cow produce milk for about 305 days and then allow a 60- day rest period before beginning the next lactation.

http://eu.wiley.com/WileyCDA/WileyTitle/productCd- 0470905840.html

What can you discover? Given the dataset in ‘milk_production.csv’ build a multiple liner regression model. Show the multiple scatter plots and correlation matrix for the data. (10 marks) Build the initial model for predicting milk production and show the summary of this model. (20 marks) In your opinion can the model be improved further? If so show the summary of the improved model. (10 marks) Plot the residuals and give a short interpretation of them. (10 marks)