Presentation is loading. Please wait.

Presentation is loading. Please wait.

ICS 278: Data Mining Lecture 5: Regression Algorithms

Similar presentations


Presentation on theme: "ICS 278: Data Mining Lecture 5: Regression Algorithms"— Presentation transcript:

1 ICS 278: Data Mining Lecture 5: Regression Algorithms
Padhraic Smyth Department of Information and Computer Science University of California, Irvine Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

2 Notation Variables X, Y….. with values x, y (lower case)
Vectors indicated by X Components of X indicated by Xj with values xj “Matrix” data set D with n rows and p columns jth column contains values for variable Xj ith row contains a vector of measurements on object i, indicated by x(i) The jth measurement value for the ith object is xj(i) Unknown parameter for a model = q Can also use other Greek letters, like a, b, d, g ew Vector of parameters = q Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

3 Example: Multivariate Linear Regression
Task: predict real-valued Y, given real-valued vector X Score function, e.g., S(q) = Si [y(i) – f(x(i) ; q) ]2 Model structure: f(x ; q) = a0 + S aj xj Model parameters = q = {a0, a1, …… ap } Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

4 S = S e2 = e’ e = (y – X a)’ (y – X a)
= y’ y – a’ X’ y – y’ X a + a’ X’ X a = y’ y – 2 a’ X’ y + a’ X’ X a Taking derivative of S with respect to the components of a gives…. dS/da = -2X’y X’ X a Set this to 0 to find the extremum (minimum) of S as a function of a… - 2X’y X’ X a = 0 X’Xa = X’ y Letting X’X = C, and X’y = b, we have C a = b, i.e., a set of linear equations We could solve this directly by matrix inversion, i.e., a = C-1 b = ( X’ X )-1 X’ y …. but there are better ways to do this Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

5 Comments on Multivariate Linear Regression
prediction is a linear function of the parameters Score function: quadratic in predictions and parameters Derivative of score is linear in the parameters Leads to a linear algebra optimization problem, i.e., Ca = b Model structure is simple…. p-1 dimensional hyperplane in p-dimensions Linear weights => interpretability Useful as a baseline model to compare more complex models to Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

6 Limitations of Linear Regression
True relationship of X and Y might be non-linear Suggests generalizations to non-linear models Complexity: O(p3) - could be a problem for large p Correlation among the X variables Can cause numerical instability (C may be ill-conditioned) Problems in interpretability (identifiability) Includes all variables in the model… What if p=100 but only 3 variables are related to Y? Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

7 Finding the k best variables
Find the subset of k variables that predicts best: This is a generic problem when p is large (arises with all types of models, not just linear regression) Now we have models with different complexity.. E.g., p models with a single variable p(p-1)/2 models with 2 variables, etc… 2p possible models in total Note that when we add or delete a variable, the optimal weights on the other variables will change in general k best is not the same as the best k individual variables What does “best” mean here? Return to this later Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

8 Search Problem How can we search over all 2p possible models?
exhaustive search is clearly infeasible Heuristic search is used to search over model space: Forward search (greedy) Backward search (greedy) Generalizations (add or delete) Think of operators in search space Branch and bound techniques This type of variable selection problem is common to many data mining algorithms Outer loop that searches over variable combinations Inner loop that evaluates each combination Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

9 Defining what “best” means
How do we measure “best”? Best performance on the training data? K = p will be best (i.e., use all variables) So this is not useful Note: Performance on the training data will in general be optimistic Alternatives: Measure performance on a single validation set Measure performance using multiple validation sets Cross-validation Add a penalty term to the score function that “corrects” for optimism E.g., GCV Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

10 Empirical Learning Squared Error score (as an example: we could use other scores) S(q) = Si [y(i) – f(x(i) ; q) ]2 where S(q) is defined on the training data D We are really interested in finding the f(x; q) that best predicts y on future data, i.e., minimizing E [S] = E [y – f(x ; q) ]2 Empirical learning Minimize S(q) on the training data Dtrain If Dtrain is large and model is simple we are assuming that the best f on training data is also the best predictor f on future test data Dtest Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

11 Complexity and Generalization
Score Function Stest(q) Strain(q) Complexity = degrees of freedom in the model (e.g., number of variables) Optimal model complexity Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

12 Using Validation Data Use this data to find the best q for each model fk(x ; q) Training Data Use this data to calculate an estimate of Sk(q) for each fk(x ; q) and select k* = arg mink Sk(q) Validation Data Use this data to calculate an unbiased estimate of Sk*(q) for the selected model Test Data Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

13 2 different (but related) issues here
1. Finding the function f that minimizes S(q) for future data 2. Getting a good estimate of S(q), using the chosen function, on future data, e.g., we might have selected the best function f, but our estimate of its performance will be optimistically biased if our estimate of the score uses any of the same data used to fit and select the model. Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

14 Non-linear models, linear in parameters
We can add additional polynomial terms in our equations, e.g., all “2nd order” terms f(x ; q) = a0 + S aj xj + S bij xi xj Note that it is a non-linear functional form, but it is linear in the parameters (so still referred to as “linear regression”) We can just treat the xi xj terms as additional fixed inputs In fact we can add in any non-linear input functions!, e.g. f(x ; q) = a0 + S aj fj(x) Comments: - Exact same linear algebra for optimization (same math) - size of model space has now exploded -> very large search problem Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

15 Non-linear (both model and parameters)
We can generalize further to models that are nonlinear in all aspects f(x ; q) = a0 + S ak gk(bk0 +S bkj xj ) where the g’s are non-linear functions with fixed functional forms. In machine learning this is called a neural network In statistics this might be referred to as a generalized linear model or projection-pursuit regression For almost any score function of interest, e.g., squared error, the score function is a non-linear function of the parameters. Closed form (analytical) solutions are rare. Thus, we have a multivariate non-linear optimization problem (which may be quite difficult!) Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

16 Optimization of a non-linear score function
We seek the minimum of a function in d dimensions, where d is the number of parameters (d could be large!) There are a multitude of heuristic search techniques (see chapter 8) Steepest descent (follow the gradient) Newton methods (use 2nd derivative information) Conjugate gradient Line search Stochastic search Genetic algorithms Two cases: Convex (nice -> means a single global optimum) Non-convex (multiple local optima => need multiple starts) Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

17 Other non-linear models
Splines “patch” together low-order polynomials Works well in 1 dimension, less well in higher dimensions Memory-based models y’ = S w(x’,x) y, where y’s are from the training data w(x’,x) = function of distance of x from x’ Local linear regression y’ = a0 + S aj xj , where the alpha’s are fit at prediction time just to the (y,x) pairs that are close to x’ Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

18 Time-series prediction as regression
Measurements over time x1,…… xt We want to predict xt+1 given x1,…… xt Autoregressive model f( x1,…… xt ; q ) = S ak xt-k Number of coefficients K = memory of the model Can take advantage of regression techniques in general to solve this problem (e.g., linear in parameters, score function = squared error, etc) Generalizations Vector x Non-linear function instead of linear Add in terms for time-trend (linear, seasonal), for “jumps”, etc Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

19 Other aspects of regression
Diagnostics Useful in low dimensions Weighted regression Useful when rows have different weights Different score functions E.g. absolute error Predicting y values constrained to a certain range Predicting binary y values Regression as a generalization of classification Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

20 Classification as a special case of Regression
Let y take values 0 or 1 Regression (in general) tries to learn an f function that matches E[y|x] For binary y we have E[y|x] = 1. p(y=1|x) + 0 p(y=0|x) = p(y=1|x) Thus, if regression drives f towards E[y|x], then for binary classification problems a regression model will try to approximate p(y=1|x) (posterior class probabilities) e.g., this is what neural networks do Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

21 Predicting an output between 0 and 1
We often have a problem where y lies between 0 and 1 probability that a person with attributes X will survive 10 years proportion of people in Zip code X who will buy a product We could use linear regression, but….. Instead we can use the logistic function log p(y=1|x)/log p(y=0|x) = a0 + S aj xj Equivalently, p(y=1|x) = 1/[1 + exp(- a0 - S aj xj ) ] We model the log-odds as a linear function of the input variables. This is known as logistic regression. Could be viewed as a simple neural network with a single hidden unit Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

22 Logistic Regression How do we estimate the alpha parameters?
Problem: linear algebra solution no longer applies since our function form (and score) is now non-linear in the parameters Common approach Score function = likelihood = probability of observed data Select parameters to maximize the log-likelihood (“maximum likelihood”) S(q) = Si log p( y(i) | x(i) ; a) = Si y(i) log p( y(i)=1| x(i) ; a) + [1-y(i)] log(1- p( y(i)=1| x(i) ; a)) Can compute the 2nd derivative directly as weighted matrix Forms the basis for an iterative 2nd order Newton scheme Known as iteratively reweighted least-squares Log-likelihood here is convex: so it is quite stable (only one global maximum!). Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

23 Tree-Structured Regression
Functional form of model is a “regression tree” Univariate thresholds at internal nodes Constant or linear surfaces at the leaf nodes Yields piecewise constant (or linear) surface (like classification tree, but for regression) Very crude functional form…. But Can be very useful in high-dimensional problems Can be very interpretable Search problem Finding the optimal tree is NP-hard (for any reasonable score) Practice: greedy algorithms Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

24 More on regression trees
Greedy search algorithm For each variable, find the best split point such the mean of Y either side of the split minimizes the mean-squared error Select the variable with the minimum average error Partition the data using the threshold Recursively apply this selection procedure to each “branch” What size tree? A full tree will likely overfit the data Common methods for tree selection Grow a large tree and select an appropriate subtree by Xvalidation Grow a number of small fixed-sized trees and average their predictions Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

25 Model Averaging Can average over parameters and models
Why? Any point estimate of parameters or a single model has only a small chance of the being the best Averaging makes our predictions more stable and less sensitive to random variations in a particular data set (good for less stable models like trees) Model averaging flavors Fully Bayesian: parameters and models “empirical Bayesian”: learn weights over multiple models E.g., stacking and bagging Build multiple simple models in a systematic way and combine them, e.g., boosting Will say more about this in later lectures Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

26 Components of Data Mining Algorithms
Representation: Determining the nature and structure of the representation to be used; Score function quantifying and comparing how well different representations fit the data Search/Optimization method Choosing an algorithmic process to optimize the score function; and Data Management Deciding what principles of data management are required to implement the algorithms efficiently. Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

27 What’s in a Data Mining Algorithm?
Task Representation Score Function Search/Optimization Data Management Models, Parameters Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

28 Y = Weighted linear sum of X’s Regression coefficients
Multivariate Linear Regression Task Regression Y = Weighted linear sum of X’s Representation Score Function Least-squares Search/Optimization Linear algebra Data Management None specified Models, Parameters Regression coefficients Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

29 X = Weighted linear sum of earlier X’s Regression coefficients
Autoregressive Time Series Models Task Time Series Regression X = Weighted linear sum of earlier X’s Representation Score Function Least-squares Search/Optimization Linear algebra Data Management None specified Models, Parameters Regression coefficients Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

30 Y = nonlin function of X’s
Neural Networks Task Regression Representation Y = nonlin function of X’s Score Function Least-squares Search/Optimization Gradient descent Data Management None specified Models, Parameters Network weights Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

31 Log-odds(Y) = linear function of X’s
Logistic Regression Task Regression Log-odds(Y) = linear function of X’s Representation Score Function Log-likelihood Search/Optimization Iterative (Newton) method Data Management None specified Models, Parameters Logistic weights Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

32 Software MATLAB R Commercial tools
Many free “toolboxes” on the Web for regression and prediction e.g., see and in particular the CompStats toolbox R General purpose statistical computing environment (successor to S) Free (!) Widely used by statisticians, has a huge library of functions and visualization tools Commercial tools SAS, other statistical packages Data mining packages Often are not progammable: offer a fixed menu of items Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

33 Suggested Reading in Text
Chapter 4: General statistical aspects of model fitting Pages 93 to 116, plus Section 4.7 on sampling Chapter 5: “reductionist” view of learning algorithms (can skim this) Chapter 6: Different forms of functional forms for modeling Pages 165 to 183 Chapter 8: Section 8.3 on multivariate optimization Chapter 9: linear regression and related methods Can skip Section 11.3 Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine

34 Good References N. R. Draper and H. Smith, Applied Regression Analysis, 2nd edition, Wiley, 1981 (the “bible” for classical regression methods) T. Hastie, R. Tibshirani, and J. Friedman, Elements of Statistical Learning, Springer Verlag, 2001 (statistically-oriented overview of modern ideas in regression and classification) Data Mining Lectures Lecture 5: Regression Padhraic Smyth, UC Irvine


Download ppt "ICS 278: Data Mining Lecture 5: Regression Algorithms"

Similar presentations


Ads by Google