Download presentation
Presentation is loading. Please wait.
1
Data mining and statistical learning, lecture 5 Outline Summary of regressions on correlated inputs Ridge regression PCR (principal components regression) PLS (partial least squares regression) Model selection using cross-validation Linear classification models Logistic regression Regression on indicator functions Linear discriminant analysis (LDA)
2
Data mining and statistical learning, lecture 5 Ridge regression The ridge regression coefficients minimize a penalized residual sum of squares: or Normally, inputs are centred prior to the estimation of regression coefficients
3
Data mining and statistical learning, lecture 5 Regression methods using derived input directions - Partial Least Squares Regression Extract linear combinations of the inputs as derived features, and then model the target (response) as a linear function of these features x1x1 x1x1 xpxp z1z1 z2z2 zMzM … … y Select the intermediates so that the covariance with the response variable is maximized Normally, the inputs are standardized to mean zero and variance one prior to the PLS analysis
4
Data mining and statistical learning, lecture 5 PLS vs PCR - absorbance records for chopped meat In general, PLS models has fewer factors than PCR models
5
Data mining and statistical learning, lecture 5 Common characteristics of ridge regression, PCR, and PLS Ridge regression, PCR, and PLS can all handle high- dimensional inputs In contrast to ordinary least squares regression, the cited methods can be used for prediction even if the number of inputs ( x -variables) exceeds the number of cases For minimizing prediction error, ridge regression, PCR, and PLS are generally preferable to variable subset selection in ordinary least squares regression
6
Data mining and statistical learning, lecture 5 Behaviour of ridge regression, PCR, and PLS Ridge regression, PCR, and PLS tend to behave similarly Ridge regression shrinks all directions, but shrinks low-variance directions more Principal components regression leaves M high-variance directions alone, and discards the rest Partial least squares regression tends to shrink the low-variance directions, but may inflate some of the higher variance directions
7
Data mining and statistical learning, lecture 5 Model selection: ordinary cross-validation For each model, do the following: (i) Fit the model to the training set of inputs and responses (ii) Use the fitted model to predict the response value in the test set and compute the prediction error Select the model that produces the smallest PRESS-value PRESS = Prediction Error Sum of Squares Test set Training set
8
Data mining and statistical learning, lecture 5 Model selection: leave-one-out cross-validation For each model, do the following: (i) Leave out one case and fit the model to the remaining data (ii) Use the fitted model to predict the response value in the case that was left out and compute the prediction error (iii) Repeat steps (i) and (ii) for all cases and compute the PRESS-value (prediction error sum of squares) Select the model that produces the smallest PRESS-value PRESS = Prediction Error Sum of Squares
9
Data mining and statistical learning, lecture 5 Model selection: K-fold (block) cross-validation Divide the data set into m blocks of size K and do the following for each model: (i) Leave out one block of cases and fit the model to the remaining data (ii) Use the fitted model to predict the response values in the block that was left out and compute the sum of squared prediction errors (iii) Repeat steps (i) and (ii) for all blocks and compute the PRESS- value (prediction error sum of squares) Select the model that produces the smallest PRESS-value Block 1 Block j Block m
10
Data mining and statistical learning, lecture 5 Classification The task of assigning objects to one of several predefined categories Detecting spams among e-mails Credit scoring Classifying tumours as malignant or benign
11
Data mining and statistical learning, lecture 5 Customer relations management - an example Consider a database in which 2470 customers have been registered For each customer the enterprise has recorded a binary response variable Y ( Y = 1 : multiple purchases, Y = 0 : single purchase) and several predictors We shall model the probability that Y = 1.
12
Data mining and statistical learning, lecture 5 Logistic regression for a binary response variable Y - single input The log of the odds ratio is linear in x
13
Data mining and statistical learning, lecture 5 Logistic regression of multiple purchases vs first amount spent
14
Data mining and statistical learning, lecture 5 Logistic regression of multiple purchases vs first amount spent - inference from a model comprising a single input Response Information Variable Value Count Multiple_purchases 1 34 (Event) 0 66 Total 100 Logistic Regression Table Odds 95% CI Predictor Coef SE Coef Z P Ratio Lower Upper Constant -2.50310 0.450895 -5.55 0.000 First_amount_spent 0.0014381 0.0003063 4.69 0.000 1.00 1.00 1.00 Log-Likelihood = -43.215 Test that all slopes are zero: G = 41.776, DF = 1, P-Value = 0.000
15
Data mining and statistical learning, lecture 5 Logistic regression for a binary response variable - multiple inputs Consider a binary response variable Y Set p = P(Y = 1) Assume that the log odds ratio is a linear function of m predictors x 1, …, x m
16
Data mining and statistical learning, lecture 5 Logistic regression - inference from a model comprising two inputs The estimated coefficient -1.19297 represents the change in the log of P(low pulse)/P(high pulse) when the subject smokes compared to when he/she does not smoke, with the covariate Weight held constant The odds of smokers in the sample having a low pulse being 30% of the odds of non-smokers having a low pulse.
17
Data mining and statistical learning, lecture 5 Logistic regression for an ordinal response variable Y
18
Data mining and statistical learning, lecture 5 Logistic regression for an ordinal response variable Y
19
Data mining and statistical learning, lecture 5 Classification using logistic regression Assign the object to the class k that maximizes
20
Data mining and statistical learning, lecture 5 Regression of an indicator matrix Find a linear function which is (on average) one for objects in class 1 and otherwise (on average) zero Find a linear function which is (on average) one for objects in class 1 and otherwise (on average) zero Assign a new object to class 1 if
21
Data mining and statistical learning, lecture 5 3D-plot of an indicator matrix for class 1
22
Data mining and statistical learning, lecture 5 3D-plot of an indicator matrix for class 2
23
Data mining and statistical learning, lecture 5 Regression of an indicator matrix - discriminating function
24
Data mining and statistical learning, lecture 5 Regression of an indicator matrix - discriminating function Estimate discriminant functions for each class, and then classify a new object to the class with the largest value for its discriminant function
25
Data mining and statistical learning, lecture 5 Linear discriminant analysis (LDA) LDA is an optimal classification method when the data arise from Gaussian distributions with different means and a common covariance matrix
26
Data mining and statistical learning, lecture 5 Software recommendation SAS Proc DISCRIM Proc discrim data=mining.lda; CLASS class; VAR x1 x2; Run;
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.