Dr J Frost (jfrost@tiffin.kingston.sch.uk) www.drfrostmaths.com S1: Chapter 7 Regression Dr J Frost (jfrost@tiffin.kingston.sch.uk) www.drfrostmaths.com Last modified: 22nd January 2016
What is regression? Exam mark (𝑦) 𝑦=20+3𝑥 Time spent revising (𝑥) I record people’s exam marks as well as the time they spent revision. I want to predict how well someone will do based on the time they spent revision. How would I do this? What we’ve done here is come up with a model to explain the data, i.e. a line 𝒚=𝒂+𝒃𝒙. We’ve then tried to set 𝒂 and 𝒃 such that the resulting 𝒚 value matches the actual exam marks as close as possible. The ‘regression’ bit is the act of setting the parameters of our model (here the gradient and y-intercept of the line of best fit) to best explain the data.
What is regression? Rabbit population (𝑦) Time (𝑥) In this chapter we only cover linear regression, where our chosen model is a straight line. But in general we could use any model that might best explain the data. Population tends to grow exponentially rather than linearly, so we might make our model 𝑦=𝑎× 𝑏 𝑥 and then try to use regression to work out the best 𝑎 and 𝑏 to use.
Explanatory and Response Variables Exam mark (𝑦) Time spent revising (𝑥) ! An independent (or explanatory) variable is one that is set independently of other variables. It goes on the x-axis. ! A dependent (or response) variable is one whose values are determined by the values of the independent variable. It goes on the y-axis.
So how do we numerically find the line of best fit? 𝑦 The residuals are the errors between the 𝑦 value predicted by the model and the y value of each data point. 𝑒 1 𝑒 2 𝑒 3 𝑒 4 𝑒 5 𝑒 6 𝑒 7 𝑥 We minimise the total of the squares of the residuals. Σ 𝑒 𝑖 2 Why squared? This is known as a least squares regression line.
So how do we numerically find the line of best fit? Notice that in regression, we write the terms in ascending powers of 𝑥, contrary to algebraic convention. Hence 𝑎 is the 𝑦-intercept, not the gradient. 𝑦 𝑒 1 𝑒 2 𝑒 3 𝑒 4 𝑒 5 𝑒 6 𝑒 7 𝒚=𝒂+𝒃𝒙 The mean of x and y is on the line, i.e. 𝑦 =𝑎+𝑏 𝑥 . Hence this gives us 𝑎. To remember the gradient, I think chromosomes of men and women. Men come out top! 𝑥 It turns out (using differentiation techniques you’ll see in C2) that the 𝑎 and 𝑏 we use to minimise the total (squared) error is: 𝒃= 𝑺 𝒙𝒚 𝑺 𝒙𝒙 𝒂= 𝒚 −𝒃 𝒙
Example Mass, 𝒙 (kg) 20 40 60 80 100 Length, 𝒚 (cm) 48 55.1 56.3 61.2 68 Calculate 𝑆 𝑥𝑥 and 𝑆 𝑦𝑦 (You may use that 𝛴𝑥=300, 𝛴 𝑥 2 =22 000, 𝑥 =60, 𝛴𝑥𝑦=18 238, 𝛴 𝑦 2 =16 879.14, 𝛴𝑦=288.6, 𝑦 =57.72) 𝑆 𝑥𝑥 =4000 𝑆 𝑥𝑦 =922 b) Calculate the regression line of 𝑦 on 𝑥. 𝑏=0.2305 𝑎=43.89 𝑆𝑜 𝑦=43.89+0.2305𝑥 ? ? ? ? ? 𝒃= 𝑺 𝒙𝒚 𝑺 𝒙𝒙 𝒂= 𝒚 −𝒃 𝒙 Broculator Tip: Your calculator will calculate 𝑎 and 𝑏 while in STATS mode (under the Reg menu)
Test Your Understanding May 2009 Q5 Note that once finding 𝑎 and 𝑏, you still need to write the equation at the end for the final mark! A common error is to do 𝑆 𝑤𝑙 𝑆 𝑤𝑤 . The first row (the explanatory variable) is always the ‘𝑥’ one. For ‘comment on reliability of estimate’ questions, always one of: ! Reliable (1) because inside the range of the data/interpolating (1) Unreliable (1) because outside the range of the data/extrapolating (1). Reliable (1) because just outside the range of the data (1). ? ? ?
Exercises On provided sheet. Answers on next slides. ? ? ? ? ? (Note that Q7 and 8 uses ‘coding’. We will cover this next lesson) Help with wordy questions: “Explain why this diagram would support the fitting of a regression line of 𝑦 onto 𝑥.” The variables have a linear relationship, i.e. the points are close to the implied straight line of best fit. “Interpret the gradient/slope of the line/interpret 𝑏” As (x) increases by 1, (y) increases/decreases by ___. “Interpret the y-intercept/interpret 𝑎” The value (y) takes when (x) is 0. “Which is the explanatory variable? Explain your answer.” (x) is the explanatory variable because (x) influences (y) Explain method of least squares. "We minimise the square of the residuals" (draw a diagram) ? ? ? ? ?
Exercises ? ? ? ?
Exercises ? ? ? ? ? ?
Exercises ? ? ? ? ? ?
Exercises ? ? ? ? ?
Exercises ? ? ? ? ?
Exercises ? ? ? ? ? ?
Coding We’ve previously considered how coding affects a means, variances and the PMCC. So how do they affect the regression line? Eight samples of carbon steel were produced with different percentages, 𝑐 of carbon in them. Each sample was heated in a furnace until it melted and the temperature, 𝑚 in °C, at which it melted was recorded. The results were coded such that 𝑥=10𝑐 and 𝑦= 𝑚−700 5 . Suppose that we found the regression line of 𝑦 on 𝑥 was 𝑦=36.216−4.048𝑥. Then what is the regression line in terms of the original variables 𝑐 and 𝑚? ? Just replace the variables using the substitution and rearrange. That’s it! 𝑚−700 5 =36.216−4.048 10𝑐 𝑚=881.08−202.4𝑐
More Examples The length 𝑥 and height 𝑦 of an Ewok was coded using 𝑞=𝑥−30 and 𝑟=2𝑦+11. If the equation of the regression line of 𝑟 on 𝑞 is: 𝑟=−3+20𝑞 what is the equation of the regression line of 𝑦 on 𝑥? ? 𝟐𝒚+𝟏𝟏=𝟐𝟎 𝒙−𝟑𝟎 −𝟑 𝒚=−𝟑𝟎𝟕+𝟏𝟎𝒙 The maths mark 𝑥 and English mark 𝑦 of some stormtroopers is coded using 𝑎= 𝑥 2 and 𝑏=𝑦−10. If the equation of the regression line of 𝑏 on 𝑎 is: 𝑏=4+5𝑎 What is the equation of the regression line of 𝑦 on 𝑥? ? 𝒚−𝟏𝟎=𝟓 𝒙 𝟐 +𝟒 𝒚=𝟏𝟒+𝟐.𝟓𝒙
Exercises (continued) ? ?
Exercises ? ? ?
Just For Fun…