Download presentation
Presentation is loading. Please wait.
1
Chapter 6 Regression Algorithms in Data Mining
Fit data Time-series data: Forecast Other data: Predict
2
Contents Describes OLS (ordinary least square) regression and Logistic regression Describes linear discriminant analysis and centroid discriminant analysis Demonstrates techniques on small data sets Reviews the real applications of each model Shows the application of models to larger data sets
3
Use in Data Mining Telecommunication Industry, turnover (churn)
One of major analytic models for classification problem. Linear regression The standard – ordinary least squares regression Can use for discriminant analysis Can apply stepwise regression Nonlinear regression More complex (but less reliable) data fitting Logistic regression When data are categorical (usually binary)
4
OLS Model
5
Fits the data with a linear model Time-series data:
OLS Regression Uses intercept and slope coefficients (b) to minimize squared error terms over all i observations Fits the data with a linear model Time-series data: Observations over past periods Best fit line (in terms of minimizing sum of squared errors)
6
Regression Output R2 : 0.987 Intercept: 0.642 t=0.286 P=0.776
Week: t=53.27 P=0 Requests = *Week
7
Example R2 SSE SST
8
Example
9
A graph of the time-series model
10
Time-Series Forecast
11
Regression Tests FIT: Significance SSE – sum of squared errors
Synonym: SSR – sum of squared residuals R2 – proportion explained by model Adjusted R2 – adjusts calculation to penalize for number of independent variables Significance F-test - test of overall model significance t-test - test of significant difference between model coefficient & zero P – probability that the coefficient is zero (or at least the other side of zero from the coefficient) See page. 103
12
Regression Model Tests
SSE (sum of squared errors) For each observation, subtract model value from observed, square difference, total over all observations By itself means nothing Can compare across models (lower is better) Can use to evaluate proportion of variance in data explained by model R2 Ratio of explained squared dependent variable values (MSR) to sum of squares (SST) SST = MSR plus SSE 0 ≤ R2 ≤ 1 See page. 104
13
Multiple Regression Can include more than one independent variable
Trade-off: Too many variables – many spurious, overlapping information Too few variables – miss important content Adding variables will always increase R2 Adjusted R2 penalizes for additional independent variables
14
Example: Hiring Data Dependent Variable – Sales Independent Variables:
Years of Education College GPA Age Gender College Degree See page
15
Regression Model Sales = 269025 -17148*YrsEd P = 0.175
-7172*GPA P = 0.812 +4331*Age P = 0.116 -23581*Male P = 0.266 +31001*Degree P = 0.450 R2 = Adj R2 = Weak model, no significant at 0.10
16
Improved Regression Model
Sales = - 9991*YrsEd P = 0.098* +3537*Age P = 0.141 -18730*Male P = 0.328 R2 = Adj R2 = 0.070
17
Data often ordinal or nominal
Logistic Regression Data often ordinal or nominal Regression based on continuous numbers not appropriate Need dummy variables Binary – either are or are not LOGISTIC REGRESSION (probability of either 1 or 0) Two or more categories DISCRIMINANT ANALYSIS (perform regression for each outcome; pick one that fit’s best)
18
Logistic Regression For dependent variables that are nominal or ordinal Probability of acceptance of case i to class j Sigmoidal function (in English, an S curve from 0 to 1)
19
Insurance Claim Model Fraud = 81.824 -2.778 * Age P = 0.789
* Male P = 0.758 * Claim P = 0.757 * Tickets P = 0.824 * Prior P = 0.935 * Atty Smith P = 0.776 Can get probability by running score through logistic formula See page. 107~109
20
Linear Discriminant Analysis
Group objects into predetermined set of outcome classes Regression one means of performing discriminant analysis 2 groups: find cutoff for regression score More than 2 groups: multiple cutoffs
21
Centroid Method (NOT regression)
Binary data Divide training set into two groups by binary outcome Standardize data to remove scales Identify means for each independent variable by group (the CENTROID) Calculate distance function
22
Fraud Data Age Claim Tickets Prior Outcome 52 2000 1 OK 38 1800 19 600
1 OK 38 1800 19 600 2 21 5600 Fraud 41 4200
23
Standardized & Sorted Fraud Data
Age Claim Tickets Prior Outcome 1 0.60 0.5 0.9 0.64 0.88 0.633 0.707 0.667 0.500 0.05 0.16 0.525 0.080 1.000 0.000
24
Distance Calculations
New To 0 To 1 Age 0.50 ( )2 0.018 ( )2 0.001 Claim 0.30 ( )2 0.166 ( )2 0.048 Tickets ( )2 0.445 (1-0)2 1.000 Prior 1 (0.5-1)2 0.250 (0-1)2 Totals 0.879 2.049
25
Discriminant Analysis with Regression Standardized data, Binary outcomes
Intercept P = 0.670 Age P = 0.671 Gender P = 0.733 Claim P = 0.469 Tickets P = 0.566 Prior Claims P = 0.399 Attorney P = 0.607 R2 = 0.804 Cutoff average of group averages: 0.429
26
Case: Stepwise Regression
Automatic selection of independent variables Look at F scores of simple regressions Add variable with greatest F statistic Check partial F scores for adding each variable not in model Delete variables no longer significant If no external variables significant, quit Considered inferior to selection of variables by experts
27
Data on 244,000 credit card accounts
Credit Card Bankruptcy Prediction Foster & Stine (2004), Journal of the American Statistical Association Data on 244,000 credit card accounts 12-month period 1 percent default Cost of granting loan that defaults almost $5,000 Cost of denying loan that would have paid about $50
28
Data Treatment Divided observations into 5 groups
Used one for training Any smaller would have problems due to insufficient default cases Used 80% of data for detailed testing Regression performed better than C5 model Even though C5 used costs, regression didn’t
29
Summary Regression a basic classical model
Many forms Logistic regression very useful in data mining Often have binary outcomes Also can use on categorical data Can use for discriminant analysis To classify
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.