Regression Techniques

Regression Techniques

Linear Regression A technique used to predict/model a continuous value from a set of continuous independent variables and their response/outcomes. The linear relationship can be shown as: y =𝑚𝑥+𝑐 Here, x is the independent variable which controls the dependent/outcome variable y m is the slope of the regression line and c is the intercept term

One variable linear regression: An example
Years of Experience Salary ($K) 3 30 8 57 9 64 13 72 36 6 43 11 59 21 90 1 20 16 83 One variable linear regression: An example y = salary ($K) and x = experience (years) If an employee has 10 years of experience, what will be his expected salary? Target = minm (actual salary – expectation) y=𝑚𝑥+𝑐 (m = ? and c = ?) Regression coefficient, 𝑚= 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) ( 𝑦 𝑖 − 𝑦 ) 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) 2 Intercept, c= 𝑦 − 𝑚 𝑥

One variable linear regression
Best model will have least error = minm (actual salary – expectation) 𝜀 𝑖 = 𝑦 𝑖 −𝑚 𝑥 𝑖 −𝑐 at i-th value 𝜀 1 = 𝑦 1 −3.5 𝑥 1 −23.6 =30 −3.5∗3 −23.6=−4.1 Sum of squared error, SSE = 𝑖−1 𝑁 𝜀 𝑖 2 Root-mean-squared error, RMSE = ( 𝑖−1 𝑁 𝜀 𝑖 2 𝑁 ) = 𝑆𝑆𝐸 𝑁

Goodness of fit Variability explained by the regression model can be explained by the R2 ( 𝑅 2 = 1- SSE/SST) 𝑅 2 =1− 𝑖=1 𝑁 ( 𝑓(𝑥 𝑖 )− 𝑦 ) 𝑖=1 𝑁 ( 𝑦 𝑖 − 𝑦 ) 2 = ( 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 )( 𝑦 𝑖 − 𝑦 )) 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) 2 𝑖=1 𝑁 ( 𝑦 𝑖 − 𝑦 ) 2 𝑅 2 varies from 0 to 1 𝑅 2 = 1 means perfect prediction with SSE = 0 𝑅 2 = 0 means poor prediction through average line (SSE = SST) Correlation co-efficient, r measures the strength of linear relationship or correlation between y and x r= 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) (𝑦 𝑖 − 𝑦 ) 𝑖=1 𝑁 ( 𝑥 𝑖 − 𝑥 ) 2 𝑖=1 𝑁 ( 𝑦 𝑖 − 𝑦 ) = 𝑅 2

linear regression: an example
Intercept Estimate – estimated co-efficient from linear regression std. error – variation of the coefficient from estimated value t value – ratio of estimate over std. error (target is to achieve higher t- value) pr (>t) – possibility of the estimate value close to 0 (target is to achieve a minimum pr value) Multiple R2 Adjusted R2 – correction factor for the number of explanatory variables 𝑅 2 =1−( 𝑁−1 𝑁−𝑑 )(1− 𝑅 2 ) , d = total variables P-value – significance of this model (the lower the better)

Exploring the Predictor variables

Relation between dependent and independent variables: Corr and R2

Multiple Linear Regression
A set of predictor variables controlling the outcome, y The linear regression function can be extended as: y= 𝑚 1 𝑥 1 + 𝑚 2 𝑥 2 + …..+ 𝑐

Combining multiple variables for Regression
Multiple co-linearity can impact on R2 Adding Year with Age or WinterRain with Age have similar impact, but removing any of these can degrade the model performance

Considering All variables for Regression
Combining all variables have reduced the adjusted R2 Not all variables are significant Best model will be the one with maxm R2 and minm SSE However, an R2 = 1 is not always an indicator of a good predictor of the dependent variable Further testing on independent dataset can ensure the reliability of the regression model

Independent testing of the Model
The best model has 4 Variables (AGST, Age, HarvestRain and WinterRain) And R2 = 0.83 Tested the best performing regression model on a new dataset SSE = 0.069 SST = 0.336 R2 = 1- SSE/SST =

Regression with Categorical variables
Linear regression is inapplicable to situations where response variables are categorical such as yes/no, fail/pass, alive/dead, good/bad Logistic regression – models the probability of occurrences of some categorical event as a linear function of a set of predictor variables The model assumes that the response variable Y follows a binomial distribution of logistic curve or ‘S’ shape curve log 𝑒 𝑌 1−𝑌 = 𝛽 0 + 𝑖=1 𝑁 𝛽 𝑖 𝑥 𝑖 converts to Y= 𝑒 −(𝛽 0 + 𝑖=1 𝑁 𝛽 𝑖 𝑥 𝑖 ) N = no. of predictor variables Y can be varied between 0 and 1 Y (𝛽 0 + 𝑖=1 𝑁 𝛽 𝑖 𝑥 𝑖 )

AN example Credit classification dataset

Attributes of the Dataset

Trends in Good and Bad credit approval group

Finding facts Is there any trend/pattern within the dataset that can link to Good/Bad credit?

The Logistic regression model
Considering Age, Loan amount and Duration of current credit account as predictors to predict the probability of Good or Bad Loan AIC is a factor computed from the ratio of number of variables over the number of observations

Regression Techniques

Similar presentations

Presentation on theme: "Regression Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression Techniques

Similar presentations

Presentation on theme: "Regression Techniques"— Presentation transcript:

Similar presentations

About project

Feedback