Regression Models Fit data Time-series data: Forecast Other data: Predict.

Regression Models Fit data Time-series data: Forecast Other data: Predict

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-2 Use in Data Mining One of major analytic models –Linear regression The standard – ordinary least squares regression Can use for discriminant analysis Can apply stepwise regression –Nonlinear regression More complex (but less reliable) data fitting –Logistic regression When data are categorical (usually binary)

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-4 OLS Regression Uses intercept and slope coefficients (  ) to minimize squared error terms over all i observations Fits the data with a linear model Time-series data: –Observations over past periods –Best fit line (in terms of minimizing sum of squared errors)

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-7 Regression Tests FIT: –SSE – sum of squared errors Synonym: SSR – sum of squared residuals –R 2 – proportion explained by model –Adjusted R 2 – adjusts calculation to penalize for number of independent variables Significance –F-test - test of overall model significance –t-test - test of significant difference between model coefficient & zero –P – probability that the coefficient is zero (or at least the other side of zero from the coefficient)

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-8 Regression Model Tests SSE (sum of squared errors) –For each observation, subtract model value from observed, square difference, total over all observations –By itself means nothing –Can compare across models (lower is better) –Can use to evaluate proportion of variance in data explained by model R 2 –Ratio of explained squared dependent variable values (MSR) to sum of squares (SST) SST = MSR plus SSE –0 ≤ R 2 ≤ 1

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-9 Multiple Regression Can include more than one independent variable –Trade-off: Too many variables – many spurious, overlapping information Too few variables – miss important content –Adding variables will always increase R 2 –Adjusted R 2 penalizes for additional independent variables

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-10 Example: Hiring Data Dependent Variable – Sales Independent Variables: –Years of Education –College GPA –Age –Gender –College Degree

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-11 Regression Model Sales =269025 -17148*YrsEdP = 0.175 -7172*GPAP = 0.812 +4331*AgeP = 0.116 -23581*MaleP = 0.266 +31001*DegreeP = 0.450 R 2 = 0.252Adj R 2 = -0.015 Weak model, no IV significant at 0.10

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-12 Improved Regression Model Sales =173284 - 9991*YrsEdP = 0.098* +3537*AgeP = 0.141 -18730*MaleP = 0.328 R 2 = 0.218Adj R 2 = 0.070

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-13 Logistic Regression Data often ordinal or nominal Regression based on continuous numbers not appropriate –Need dummy variables Binary – either are or are not –LOGISTIC REGRESSION (probability of either 1 or 0) Two or more categories –DISCRIMINANT ANALYSIS (perform regression for each outcome; pick one that fit’s best)

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-14 Logistic Regression For dependent variables that are nominal or ordinal Probability of acceptance of –case i to class j Sigmoidal function –(in English, an S curve from 0 to 1)

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-15 Insurance Claim Model Fraud =81.824 -2.778 * AgeP = 0.789 -75.893 * MaleP = 0.758 + 0.017 * ClaimP = 0.757 -36.648 * TicketsP = 0.824 + 6.914 * PriorP = 0.935 -29.362 * Attorney SmithP = 0.776 Can get probability by running score through logistic formula

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-16 Linear Discriminant Analysis Group objects into predetermined set of outcome classes Regression one means of performing discriminant analysis –2 groups: find cutoff for regression score –More than 2 groups: multiple cutoffs

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-17 Centroid Method (NOT regression) Binary data Divide training set into two groups by binary outcome –Standardize data to remove scales Identify means for each independent variably by group (the CENTROID) Calculate distance function

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-19 Standardized & Sorted Fraud Data AgeClaimTicketsPriorOutcome 10.6010.50 0.90.64110 00.88000 0.6330.7070.6670.5000 0.050101 10.16101 0.5250.0801.0000.0001

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-20 Distance Calculations NewTo 0To 1 Age0.50(0.633-0.5) 2 0.018(0.525-0.5) 2 0.001 Claim0.30(0.707-0.3) 2 0.166(0.08-0.3) 2 0.048 Tickets0(0.667-0) 2 0.445(1-0) 2 1.000 Prior1(0.5-1) 2 0.250(0-1) 2 1.000 Totals0.8792.049

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-21 Discriminant Analysis with Regression Standardized data, Binary outcomes Intercept 0.430P = 0.670 Age-0.421P = 0.671 Gender 0.333P = 0.733 Claim-0.648P = 0.469 Tickets 0.584P = 0.566 Prior Claims-1.091P = 0.399 Attorney 0.573P = 0.607 R 2 = 0.804 Cutoff average of group averages: 0.429

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-22 Case: Stepwise Regression Stepwise Regression –Automatic selection of independent variables Look at F scores of simple regressions Add variable with greatest F statistic Check partial F scores for adding each variable not in model Delete variables no longer significant If no external variables significant, quit Considered inferior to selection of variables by experts

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-23 Credit Card Bankruptcy Prediction Foster & Stine (2004), Journal of the American Statistical Association Data on 244,000 credit card accounts –12-month period –1 percent default –Cost of granting loan that defaults almost $5,000 –Cost of denying loan that would have paid about $50

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-24 Data Treatment Divided observations into 5 groups –Used one for training –Any smaller would have problems due to insufficient default cases –Used 80% of data for detailed testing Regression performed better than C5 model –Even though C5 used costs, regression didn’t

McGraw-Hill/Irwin©2007 The McGraw-Hill Companies, Inc. All rights reserved 6-25 Summary Regression a basic classical model –Many forms Logistic regression very useful in data mining –Often have binary outcomes –Also can use on categorical data Can use for discriminant analysis –To classify

Regression Models Fit data Time-series data: Forecast Other data: Predict.

Similar presentations

Presentation on theme: "Regression Models Fit data Time-series data: Forecast Other data: Predict."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression Models Fit data Time-series data: Forecast Other data: Predict.

Similar presentations

Presentation on theme: "Regression Models Fit data Time-series data: Forecast Other data: Predict."— Presentation transcript:

Similar presentations

About project

Feedback