Download presentation
Presentation is loading. Please wait.
1
Multi Linear Regression Lab
2
Lab – Multiple Linear Regression
Basic tasks you need to know: Can you read in a CSV file? Can you create multiple scatter plots and a correlation matrix? Can you see if it is worth creating a linear regression model from the task 2? Can you build a model using the correct function? Can you print a summary and interpret the coefficients, i.e., have added some unnecessary x variables? Can you refine the model? Can you plot the residuals and interpret them?
3
Example 1: Predicting life expectancy
We need data to build to build our regression model (Task 1): > stateX77 <- read.csv( 'C:/state_data_77.csv', header=T ) This data set includes: "State": names of states "Popul": population estimate as of July 1, 1975 "Income": per capita income (1974) "Illit": illiteracy (1970, percent of population) "LifeExp": life expectancy in years (1969–71) "Murder": murder and non-negligent manslaughter rate per 100,000 population (1976) "HSGrad": percent high-school graduates (1970) "Frost": mean number of days with minimum temperature below freezing (1931–1960) in capital or large city "Area": land area in square miles
4
Visualise Lets create a multiple scatter plots (Task 2):
> pairs(stateX77) There appears to be some negative and positive linear relationships between the data. We can confirm this by looking at the correlation matrix (Task 2): > cor(stateX77) Error in cor(stateX77) : 'x' must be numeric One of the variables is not numeric. > str(stateX77) There are 9 variables and the first column is not numeric. We can access individual columns as follows: > cor(stateX77[,2:9]) To make it appear more neat we can round the numbers: > round( cor(stateX77[,2:9]), 2)
5
Build Model Let’s build a model for life expectancy and see if we can predict life expectancy based on the other variables. There some weak to strong correlations between that the life expectancy variable and some of the others and the plots indicate that they may be linear (Task 3). In building the model we may have some theory that will suggest what variables we should be interested in or need to include. For us, we will start with all the variables and then remove variables which aren’t contributing much to the model.
6
For the present let’s not use State when we build the model (Task 4).
> stateX77.fit1 <- lm( LifeExp ~ Popul + Income + Illit + Murder + HSGrad + Frost + Area, data=stateX77) Now we have built the model lets inspect it (Task 5): > summary( stateX77.fit1 )
7
Summary Call: lm(formula = LifeExp ~ Popul + Income + Illit + Murder + HSGrad + Frost + Area, data = stateX77) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) e e < 2e-16 *** Popul e e Income e e Illit e e Murder e e e-08 *** HSGrad e e * Frost e e Area e e --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 42 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 7 and 42 DF, p-value: 2.534e-10
8
Eject Insignificance Let’s look at how the coefficients contribute to the prediction. Roughly speaking, the higher the p-value the less it contributes to the model. Let’s remove the variables that aren’t significant – all of them! (Task 6) > stateX77.fit2 <- lm( LifeExp ~ Popul + Murder + HSGrad + Frost, data=stateX77)
9
Can R Help? Don what should be kept or left out in a regression model has been investigated for many years and there are number of standard approaches. There is even a function in R that does this for us: > final.stateX77 <- step( stateX77.fit1 ) How does ‘final.stateX77’ compare to ‘stateX77.fit2’?
10
Residual Checking We still need to check the residuals of our model. Ideally, they should follow a normal distribution. If they don’t then we need to examine the source data more carefully and rebuild the model. At this stage we will only check the residuals (Task 7). We want to see all the points along the line or as close to it as possible. > qqnorm( stateX77.fit2$residuals ) > qqnorm( stateX77.fit2$residuals, ylab="Residuals" ) > qqline( stateX77.fit2$residuals ) What can you see? There is a problem here – the points do not seem to fall along the line.
11
Our conclusions are that the residuals show some departure from a normal distribution and we need to revise the model further. Despite the fact that we have a relatively high adjusted R-Squared value which indicates that the model explains 0.69% the variability of the response data around its mean; The residual plot suggest there is some bias in our model.
12
Your Turn – Milk Production
The Dairy Herd Improvement Cooperative (DHI) in Upstate New York collects and analyzes data on milk production. One question of interest here is how to develop a suitable model to predict current milk production from a set of measured variables. The response variable (current milk production in pounds) and the predictor variables are given in Table 1.1. Samples are taken once a month during milking. The period that a cow gives milk is called lactation. Number of lactations is the number of times a cow has calved or given milk. The recommended management practice is to have the cow produce milk for about 305 days and then allow a 60- day rest period before beginning the next lactation.
13
http://eu.wiley.com/WileyCDA/WileyTitle/productCd- 0470905840.html
14
What can you discover? Given the dataset in ‘milk_production.csv’ build a multiple liner regression model. Show the multiple scatter plots and correlation matrix for the data. (10 marks) Build the initial model for predicting milk production and show the summary of this model. (20 marks) In your opinion can the model be improved further? If so show the summary of the improved model. (10 marks) Plot the residuals and give a short interpretation of them. (10 marks)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.