Download presentation
Presentation is loading. Please wait.
1
Regression Techniques
2
Linear Regression A technique used to predict/model a continuous value from a set of continuous independent variables and their response/outcomes. The linear relationship can be shown as: y =šš„+š Here, x is the independent variable which controls the dependent/outcome variable y m is the slope of the regression line and c is the intercept term
3
One variable linear regression: An example
Years of Experience Salary ($K) 3 30 8 57 9 64 13 72 36 6 43 11 59 21 90 1 20 16 83 One variable linear regression: An example y = salary ($K) and x = experience (years) If an employee has 10 years of experience, what will be his expected salary? Target = minm (actual salary ā expectation) y=šš„+š (m = ? and c = ?) Regression coefficient, š= š=1 š ( š„ š ā š„ ) ( š¦ š ā š¦ ) š=1 š ( š„ š ā š„ ) 2 Intercept, c= š¦ ā š š„
4
One variable linear regression
Best model will have least error = minm (actual salary ā expectation) š š = š¦ š āš š„ š āš at i-th value š 1 = š¦ 1 ā3.5 š„ 1 ā23.6 =30 ā3.5ā3 ā23.6=ā4.1 Sum of squared error, SSE = šā1 š š š 2 Root-mean-squared error, RMSE = ( šā1 š š š 2 š ) = šššø š
5
Goodness of fit Variability explained by the regression model can be explained by the R2 ( š
2 = 1- SSE/SST) š
2 =1ā š=1 š ( š(š„ š )ā š¦ ) š=1 š ( š¦ š ā š¦ ) 2 = ( š=1 š ( š„ š ā š„ )( š¦ š ā š¦ )) š=1 š ( š„ š ā š„ ) 2 š=1 š ( š¦ š ā š¦ ) 2 š
2 varies from 0 to 1 š
2 = 1 means perfect prediction with SSE = 0 š
2 = 0 means poor prediction through average line (SSE = SST) Correlation co-efficient, r measures the strength of linear relationship or correlation between y and x r= š=1 š ( š„ š ā š„ ) (š¦ š ā š¦ ) š=1 š ( š„ š ā š„ ) 2 š=1 š ( š¦ š ā š¦ ) = š
2
6
linear regression: an example
Intercept Estimate ā estimated co-efficient from linear regression std. error ā variation of the coefficient from estimated value t value ā ratio of estimate over std. error (target is to achieve higher t- value) pr (>t) ā possibility of the estimate value close to 0 (target is to achieve a minimum pr value) Multiple R2 Adjusted R2 ā correction factor for the number of explanatory variables š
2 =1ā( šā1 šāš )(1ā š
2 ) , d = total variables P-value ā significance of this model (the lower the better)
7
Exploring the Predictor variables
8
Relation between dependent and independent variables: Corr and R2
9
Multiple Linear Regression
A set of predictor variables controlling the outcome, y The linear regression function can be extended as: y= š 1 š„ 1 + š 2 š„ 2 + ā¦..+ š
10
Combining multiple variables for Regression
Multiple co-linearity can impact on R2 Adding Year with Age or WinterRain with Age have similar impact, but removing any of these can degrade the model performance
11
Considering All variables for Regression
Combining all variables have reduced the adjusted R2 Not all variables are significant Best model will be the one with maxm R2 and minm SSE However, an R2 = 1 is not always an indicator of a good predictor of the dependent variable Further testing on independent dataset can ensure the reliability of the regression model
12
Independent testing of the Model
The best model has 4 Variables (AGST, Age, HarvestRain and WinterRain) And R2 = 0.83 Tested the best performing regression model on a new dataset SSE = 0.069 SST = 0.336 R2 = 1- SSE/SST =
13
Regression with Categorical variables
Linear regression is inapplicable to situations where response variables are categorical such as yes/no, fail/pass, alive/dead, good/bad Logistic regression ā models the probability of occurrences of some categorical event as a linear function of a set of predictor variables The model assumes that the response variable Y follows a binomial distribution of logistic curve or āSā shape curve log š š 1āš = š½ 0 + š=1 š š½ š š„ š converts to Y= š ā(š½ 0 + š=1 š š½ š š„ š ) N = no. of predictor variables Y can be varied between 0 and 1 Y (š½ 0 + š=1 š š½ š š„ š )
14
AN example Credit classification dataset
15
Attributes of the Dataset
16
Trends in Good and Bad credit approval group
17
Finding facts Is there any trend/pattern within the dataset that can link to Good/Bad credit?
18
The Logistic regression model
Considering Age, Loan amount and Duration of current credit account as predictors to predict the probability of Good or Bad Loan AIC is a factor computed from the ratio of number of variables over the number of observations
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.