Regression Techniques
Linear Regression A technique used to predict/model a continuous value from a set of continuous independent variables and their response/outcomes. The linear relationship can be shown as: y =𝑚𝑥+𝑐 Here, x is the independent variable which controls the dependent/outcome variable y m is the slope of the regression line and c is the intercept term
One variable linear regression: An example Years of Experience Salary ($K) 3 30 8 57 9 64 13 72 36 6 43 11 59 21 90 1 20 16 83 One variable linear regression: An example y = salary ($K) and x = experience (years) If an employee has 10 years of experience, what will be his expected salary? Target = minm (actual salary – expectation) y=𝑚𝑥+𝑐 (m = ? and c = ?) Regression coefficient, 𝑚= 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) ( 𝑦 𝑖 − 𝑦 ) 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) 2 Intercept, c= 𝑦 − 𝑚 𝑥
One variable linear regression Best model will have least error = minm (actual salary – expectation) 𝜀 𝑖 = 𝑦 𝑖 −𝑚 𝑥 𝑖 −𝑐 at i-th value 𝜀 1 = 𝑦 1 −3.5 𝑥 1 −23.6 =30 −3.5∗3 −23.6=−4.1 Sum of squared error, SSE = 𝑖−1 𝑁 𝜀 𝑖 2 Root-mean-squared error, RMSE = ( 𝑖−1 𝑁 𝜀 𝑖 2 𝑁 ) = 𝑆𝑆𝐸 𝑁
Goodness of fit Variability explained by the regression model can be explained by the R2 ( 𝑅 2 = 1- SSE/SST) 𝑅 2 =1− 𝑖=1 𝑁 ( 𝑓(𝑥 𝑖 )− 𝑦 ) 2 𝑖=1 𝑁 ( 𝑦 𝑖 − 𝑦 ) 2 = 1 - ( 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 )( 𝑦 𝑖 − 𝑦 )) 2 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) 2 𝑖=1 𝑁 ( 𝑦 𝑖 − 𝑦 ) 2 𝑅 2 varies from 0 to 1 𝑅 2 = 1 means perfect prediction with SSE = 0 𝑅 2 = 0 means poor prediction through average line (SSE = SST) Correlation co-efficient, r measures the strength of linear relationship or correlation between y and x r= 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) (𝑦 𝑖 − 𝑦 ) 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) 2 𝑖=1 𝑁 ( 𝑦 𝑖 − 𝑦 ) 2 = 𝑅 2
linear regression: an example Intercept Estimate – estimated co-efficient from linear regression std. error – variation of the coefficient from estimated value t value – ratio of estimate over std. error (target is to achieve higher t- value) pr (>t) – possibility of the estimate value close to 0 (target is to achieve a minimum pr value) Multiple R2 Adjusted R2 – correction factor for the number of explanatory variables 𝑅 2 =1−( 𝑁−1 𝑁−𝑑 )(1− 𝑅 2 ) , d = total variables P-value – significance of this model (the lower the better)
Exploring the Predictor variables
Relation between dependent and independent variables: Corr and R2
Multiple Linear Regression A set of predictor variables controlling the outcome, y The linear regression function can be extended as: y= 𝑚 1 𝑥 1 + 𝑚 2 𝑥 2 + …..+ 𝑐
Combining multiple variables for Regression Multiple co-linearity can impact on R2 Adding Year with Age or WinterRain with Age have similar impact, but removing any of these can degrade the model performance
Considering All variables for Regression Combining all variables have reduced the adjusted R2 Not all variables are significant Best model will be the one with maxm R2 and minm SSE However, an R2 = 1 is not always an indicator of a good predictor of the dependent variable Further testing on independent dataset can ensure the reliability of the regression model
Independent testing of the Model The best model has 4 Variables (AGST, Age, HarvestRain and WinterRain) And R2 = 0.83 Tested the best performing regression model on a new dataset SSE = 0.069 SST = 0.336 R2 = 1- SSE/SST = 0.7944
Regression with Categorical variables Linear regression is inapplicable to situations where response variables are categorical such as yes/no, fail/pass, alive/dead, good/bad Logistic regression – models the probability of occurrences of some categorical event as a linear function of a set of predictor variables The model assumes that the response variable Y follows a binomial distribution of logistic curve or ‘S’ shape curve log 𝑒 𝑌 1−𝑌 = 𝛽 0 + 𝑖=1 𝑁 𝛽 𝑖 𝑥 𝑖 converts to Y= 1 1+ 𝑒 −(𝛽 0 + 𝑖=1 𝑁 𝛽 𝑖 𝑥 𝑖 ) N = no. of predictor variables Y can be varied between 0 and 1 Y (𝛽 0 + 𝑖=1 𝑁 𝛽 𝑖 𝑥 𝑖 )
AN example Credit classification dataset https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/
Attributes of the Dataset
Trends in Good and Bad credit approval group
Finding facts Is there any trend/pattern within the dataset that can link to Good/Bad credit?
The Logistic regression model Considering Age, Loan amount and Duration of current credit account as predictors to predict the probability of Good or Bad Loan AIC is a factor computed from the ratio of number of variables over the number of observations