Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Machine Learning and Data Mining Linear regression
Chapter Outline 3.1 Introduction
Linear Regression.
Least squares CS1114
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Chapter 10 Curve Fitting and Regression Analysis
Model Assessment, Selection and Averaging
Data mining and statistical learning - lecture 6
Basis Expansion and Regularization Presenter: Hongliang Fei Brian Quanz Brian Quanz Date: July 03, 2008.
Data mining in 1D: curve fitting
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
x – independent variable (input)
Curve-Fitting Regression
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ICS 278: Data Mining Lecture 5: Regression Algorithms
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
REGRESSION AND CORRELATION
Data Mining Lectures Lectures 6/7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 6, 7: Regression Algorithms Padhraic Smyth Department.
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Data Mining Lectures Lecture 7/8: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lectures 7, 8: Regression Algorithms Padhraic Smyth Department.
Classification and Prediction: Regression Analysis
CS 277: Data Mining Regression Algorithms Padhraic Smyth Department of Computer Science University of California, Irvine.
CS 277: Data Mining Regression Algorithms Padhraic Smyth Department of Computer Science University of California, Irvine.
Today Wrap up of probability Vectors, Matrices. Calculus
Collaborative Filtering Matrix Factorization Approach
Objectives of Multiple Regression
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Efficient Model Selection for Support Vector Machines
Chapter 15 Modeling of Data. Statistics of Data Mean (or average): Variance: Median: a value x j such that half of the data are bigger than it, and half.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Classification Part 3: Artificial Neural Networks
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Chapter 12 Multiple Linear Regression Doing it with more variables! More is better. Chapter 12A.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Data Mining Volinsky - Columbia University 1 Chapter 4.2 Regression Topics Credits Hastie, Tibshirani, Friedman Chapter 3 Padhraic Smyth Lecture.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Computer Animation Rick Parent Computer Animation Algorithms and Techniques Optimization & Constraints Add mention of global techiques Add mention of calculus.
Linear Discrimination Reading: Chapter 2 of textbook.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
M Machine Learning F# and Accord.net.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
CS 277, Data Mining Review of Predictive Modeling
Chapter 2-OPTIMIZATION
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Searching a Linear Subspace Lecture VI. Deriving Subspaces There are several ways to derive the nullspace matrix (or kernel matrix). ◦ The methodology.
Regularized Least-Squares and Convex Optimization.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
CS 9633 Machine Learning Support Vector Machines
Deep Feedforward Networks
Collaborative Filtering Matrix Factorization Approach
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Computer Vision Lecture 19: Object Recognition III
Linear Discrimination
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of Information and Computer Science University of California, Irvine

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Notation Variables X, Y….. with values x, y (lower case) Vectors indicated by X Components of X indicated by X j with values x j “Matrix” data set D with n rows and p columns –jth column contains values for variable X j –ith row contains a vector of measurements on object i, indicated by x(i) –The jth measurement value for the ith object is x j (i) Unknown parameter for a model =  –Can also use other Greek letters, like  –Vector of parameters = 

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Example: Multivariate Linear Regression Task: predict real-valued Y, given real-valued vector X Score function, e.g., S(  ) =  i [ y (i) – f ( x (i) ;  ) ] 2 Model structure: f(x ;  ) =  0 +   j x j Model parameters =  = {  0,  1, ……  p }

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine S =  e 2 = e’ e = (y – X a)’ (y – X a) = y’ y – a’ X’ y – y’ X a + a’ X’ X a = y’ y – 2 a’ X’ y + a’ X’ X a Taking derivative of S with respect to the components of a gives…. dS/da = -2X’y + 2 X’ X a Set this to 0 to find the extremum (minimum) of S as a function of a…  - 2X’y + 2 X’ X a = 0  X’Xa = X’ y Letting X’X = C, and X’y = b, we have C a = b, i.e., a set of linear equations We could solve this directly by matrix inversion, i.e., a = C -1 b = ( X’ X ) -1 X’ y …. but there are more numerically-stable ways to do this (e.g., LU-decomposition)

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Comments on Multivariate Linear Regression prediction is a linear function of the parameters Score function: quadratic in predictions and parameters  Derivative of score is linear in the parameters  Leads to a linear algebra optimization problem, i.e., Ca = b Model structure is simple…. –p-1 dimensional hyperplane in p-dimensions –Linear weights => interpretability Useful as a baseline model –to compare more complex models to

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Limitations of Linear Regression True relationship of X and Y might be non-linear –Suggests generalizations to non-linear models Complexity: –O(p 3 ) - could be a problem for large p Correlation/Collinearity among the X variables –Can cause numerical instability (C may be ill-conditioned) –Problems in interpretability (identifiability) Includes all variables in the model… –But what if p=100 and only 3 variables are related to Y?

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Finding the k best variables Find the subset of k variables that predicts best: –This is a generic problem when p is large (arises with all types of models, not just linear regression) Now we have models with different complexity.. –E.g., p models with a single variable –p(p-1)/2 models with 2 variables, etc… –2 p possible models in total –Note that when we add or delete a variable, the optimal weights on the other variables will change in general k best is not the same as the best k individual variables What does “best” mean here? –Return to this later

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Search Problem How can we search over all 2 p possible models? –exhaustive search is clearly infeasible –Heuristic search is used to search over model space: Forward search (greedy) Backward search (greedy) Generalizations (add or delete) –Think of operators in search space Branch and bound techniques –This type of variable selection problem is common to many data mining algorithms Outer loop that searches over variable combinations Inner loop that evaluates each combination

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Empirical Learning Squared Error score (as an example: we could use other scores) S(  ) =  i [ y (i) – f ( x (i) ;  ) ] 2 where S(  ) is defined on the training data D We are really interested in finding the f(x;  ) that best predicts y on future data, i.e., minimizing E [S] = E [ y – f ( x ;  ) ] 2 Empirical learning –Minimize S(  ) on the training data D train –If D train is large and model is simple we are assuming that the best f on training data is also the best predictor f on future test data D test

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Complexity versus Goodness of Fit x y Training data

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Complexity versus Goodness of Fit x y x y Too simple? Training data

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Complexity versus Goodness of Fit x y x y x y Too simple? Too complex ? Training data

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Complexity versus Goodness of Fit x y x y x y x y Too simple? Too complex ?About right ? Training data

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Complexity and Generalization S train (  ) S test (  ) Complexity = degrees of freedom in the model (e.g., number of variables) Score Function e.g., squared error Optimal model complexity

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Defining what “best” means How do we measure “best”? –Best performance on the training data? K = p will be best (i.e., use all variables) So this is not useful Note: –Performance on the training data will in general be optimistic Alternatives: –Measure performance on a single validation set –Measure performance using multiple validation sets Cross-validation –Add a penalty term to the score function that “corrects” for optimism E.g., “regularized” regression: SSE + sum of weights squared

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Using Validation Data Training Data Validation Data Use this data to find the best  for each model f k (x ;  ) Use this data to (1)calculate an estimate of S k (  ) for each f k (x ;  ) and (2)select k* = arg min k S k (  )

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Using Validation Data Training Data Validation Data Test Data Use this data to find the best  for each model f k (x ;  ) Use this data to (1)calculate an estimate of S k (  ) for each f k (x ;  ) and (2)select k* = arg min k S k (  ) Use this data to calculate an unbiased estimate of S k* (  ) for the selected model

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Using Validation Data Training Data Validation Data Test Data Use this data to find the best  for each model f k (x ;  ) Use this data to (1)calculate an estimate of S k (  ) for each f k (x ;  ) and (2)select k* = arg min k S k (  ) Use this data to calculate an unbiased estimate of S k* (  ) for the selected model can generalize to cross-validation….

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine 2 different (but related) issues here 1. Finding the function f that minimizes S(  ) for future data 2. Getting a good estimate of S(  ), using the chosen function, on future data, –e.g., we might have selected the best function f, but our estimate of its performance will be optimistically biased if our estimate of the score uses any of the same data used to fit and select the model.

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Non-linear models, linear in parameters We can add additional polynomial terms in our equations, e.g., all “2 nd order” terms f(x ;  ) =  0 +   j x j +   ij x i x j Note that it is a non-linear functional form, but it is linear in the parameters (so still referred to as “linear regression”) –We can just treat the x i x j terms as additional fixed inputs –In fact we can add in any non-linear input functions!, e.g. f(x ;  ) =  0 +   j f j (x) Comments: -Exact same linear algebra for optimization (same math) -Number of parameters has now exploded -> greater chance of overfitting -Ideally would like to select only the useful quadratic terms -Can generalize this idea to higher-order interactions

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Non-linear (both model and parameters) We can generalize further to models that are nonlinear in all aspects f(x ;  ) =  0 +   k g k (  k0    kj x j ) where the g’s are non-linear functions with fixed functional forms. In machine learning this is called a neural network In statistics this might be referred to as a generalized linear model or projection-pursuit regression For almost any score function of interest, e.g., squared error, the score function is a non-linear function of the parameters. Closed form (analytical) solutions are rare. Thus, we have a multivariate non-linear optimization problem (which may be quite difficult!)

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Optimization of a non-linear score function We seek the minimum of a function in d dimensions, where d is the number of parameters (d could be large!) There are a multitude of heuristic search techniques (see chapter 8) –Steepest descent (follow the gradient) –Newton methods (use 2 nd derivative information) –Conjugate gradient –Line search –Stochastic search –Genetic algorithms Two cases: –Convex (nice -> means a single global optimum) –Non-convex (multiple local optima => need multiple starts)

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Other non-linear models Splines –“patch” together different low-order polynomials over different parts of the x-space –Works well in 1 dimension, less well in higher dimensions Memory-based models y’ =  w (x’,x) y, where y’s are from the training data w (x’,x) = function of distance of x from x’ Local linear regression y’ =  0 +   j x j, where the alpha’s are fit at prediction time just to the (y,x) pairs that are close to x’

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine To be continued in Lecture 8

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Suggested Reading in Text Chapter 4: –General statistical aspects of model fitting –Pages 93 to 116, plus Section 4.7 on sampling Chapter 5: –“reductionist” view of learning algorithms (can skim this) Chapter 6: –Different forms of functional forms for modeling –Pages 165 to 183 Chapter 8: –Section 8.3 on multivariate optimization Chapter 9: –linear regression and related methods –Can skip Section 11.3

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine Useful References N. R. Draper and H. Smith, Applied Regression Analysis, 2 nd edition, Wiley, 1981 (the “bible” for classical regression methods in statistics) T. Hastie, R. Tibshirani, and J. Friedman, Elements of Statistical Learning, Springer Verlag, 2001 (statistically-oriented overview of modern ideas in regression and classificatio, mixes machine learning and statistics)