Presenter: Georgi Nalbantov Summer Course: Data Mining Regression Analysis Presenter: Georgi Nalbantov August 2009
Structure Regression analysis: definition and examples Classical Linear Regression LASSO and Ridge Regression (linear and nonlinear) Nonparametric (local) regression estimation: kNN for regression, Decision trees, Smoothers Support Vector Regression (linear and nonlinear) Variable/feature selection (AIC, BIC, R^2-adjusted)
Feature Selection, Dimensionality Reduction, and Clustering in the KDD Process U.M.Fayyad, G.Patetsky-Shapiro and P.Smyth (1995)
Common Data Mining tasks Clustering Classification Regression + X 2 X 2 + + + + + + + + + + + - + + + + + + - + + + - - + + + + + + + + - + - + X 1 X 1 X 1 k-th Nearest Neighbour Parzen Window Unfolding, Conjoint Analysis, Cat-PCA Linear Discriminant Analysis, QDA Logistic Regression (Logit) Decision Trees, LSSVM, NN, VS Classical Linear Regression Ridge Regression NN, CART
Linear regression analysis: examples
Linear regression analysis: examples
The Regression task Given: ( x1, y1 ), … , ( xm , ym ) n X 1 Given data on m explanatory variables and 1 explained variable, where the explained variable can take real values in 1, find a function that gives the “best” fit: Given: ( x1, y1 ), … , ( xm , ym ) n X 1 Find: : n 1 “best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k ) is minimal
Classical Linear Regression (OLS) Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of the explanatory variable assumed to be approximately linear (straight line) Model: b1 > 0 Positive Association b1 < 0 Negative Association b1 = 0 No Association
Classical Linear Regression (OLS) b0 Mean response when x=0 (y-intercept) b1 Change in mean response when x increases by 1 unit (slope) b0, b1 are unknown parameters (like m) b0+b1x Mean response when explanatory variable takes on the value x Task: Minimize the sum of squared errors:
Classical Linear Regression (OLS) Parameter: Slope in the population model (b1) Estimator: Least squares estimate: Estimated standard error: Methods of making inference regarding population: Hypothesis tests (2-sided or 1-sided) Confidence Intervals x1 y
Classical Linear Regression (OLS)
Classical Linear Regression (OLS)
Classical Linear Regression (OLS) Coefficient of determination (r2) : proportion of variation in y “explained” by the regression on x. where
Classical Linear Regression (OLS): Multiple regression Numeric Response variable (y) p Numeric predictor variables Model: Y = b0 + b1x1 + + bpxp + e Partial Regression Coefficients: bi effect (on the mean response) of increasing the ith predictor variable by 1 unit, holding all other predictors constant
Classical Linear Regression (OLS): Ordinary Least Squares estimation Population Model for mean response: Least Squares Fitted (predicted) equation, minimizing SSE:
Classical Linear Regression (OLS): Ordinary Least Squares estimation Model: OLS estimation: LASSO estimation: Ridge regression estimation:
LASSO and Ridge estimation of model coefficients sum(|beta|) sum(|beta|)
Nonparametric (local) regression estimation: k-NN, Decision trees, smoothers
Nonparametric (local) regression estimation: k-NN, Decision trees, smoothers
Nonparametric (local) regression estimation: k-NN, Decision trees, smoothers
Nonparametric (local) regression estimation: k-NN, Decision trees, smoothers How to Choose k or h? When k or h is small, single instances matter; bias is small, variance is large (undersmoothing): High complexity As k or h increases, we average over more instances and variance decreases but bias increases (oversmoothing): Low complexity Cross-validation is used to finetune k or h.
Linear Support Vector Regression Expenditures Age ● middle-sized area Expenditures Age ● small area biggest area ● ● ● ● Expenditures ● ● ● “Support vectors” Age “Suspiciously smart case” (overfitting) “Compromise case”, SVR (good generalisation) “Lazy case” (underfitting) The thinner the “tube”, the more complex the model
Nonlinear Support Vector Regression Map the data into a higher-dimensional space: Expenditures Age ●
Nonlinear Support Vector Regression Map the data into a higher-dimensional space: Expenditures Age ●
Nonlinear Support Vector Regression: Technicalities The SVR function: To find the unknown parameters of the SVR function, solve: Subject to: How to choose , , = RBF kernel: Find , , , and from a cross-validation procedure
SVR Technicalities: Model Selection Do 5-fold cross-validation to find and for several fixed values of .
SVR Study : Model Training, Selection and Prediction CVMSE (IR*, HR*, CR*) True returns (red) and raw predictions (blue) CVMSE (IR*, HR*, CR*)
SVR: Individual Effects
SVR Technicalities: SVR vs. OLS Performance on the test set Performance on the test set SVR MSE= 0.04 OLS MSE= 0.23
Technical Note: Number of Training Errors vs. Model Complexity Min. number of training errors, Model complexity test errors training errors complexity Functions ordered in increasing complexity Best trade-off MATLAB video here…
Variable selection for regression Akaike Information Criterion (AIC). Final prediction error:
Variable selection for regression Bayesian Information Criterion (BIC), also known as Schwarz criterion. Final prediction error: BIC tends to choose simpler models than AIC.
Variable selection for regression R^2-adjusted:
Conclusion / Summary / References Classical Linear Regression LASSO and Ridge Regression (linear and nonlinear) Nonparametric (local) regression estimation: kNN for regression, Decision trees, Smoothers Support Vector Regression (linear and nonlinear) Variable/feature selection (AIC, BIC, R^2-adjusted) (any introductory statistical/econometric book) http://www-stat.stanford.edu/~tibs/lasso.html , Bishop, 2006 Alpaydin, 2004, Hastie et. el., 2001 Smola and Schoelkopf, 2003 Hastie et. el., 2001, (any statistical/econometric book)