Projection on Latent Variables PLS Partial Least Squares but also Projection on Latent Variables
First marginal (partial) regression through the origin, of X1 on Y Step 1 First marginal (partial) regression through the origin, of X1 on Y
Second marginal regression through the origin, of X2 on Y
COVARIANCE DIRECTION COSINES STEP 2
The marginal slopes w after normalization describe the first LATENT VARIABLE
The scores t on latent variable are computed STEP 3 The scores t on latent variable are computed
STEP 4 Regression of the response variable y on the latent variable t. Used to compute the model and the residuals to be used to compute the next latent variables.
KNOWLEDGE of DATA (EXPERIENCE) A very simple numerical example: Object Predictor Predictor Response 1 2 1 5.363 5.360 26.810 2 9.979 9.974 49.880 3 35.447 35.440 177.210 4 36.040 36.041 180.180 5 52.107 52.109 260.549 6 72.069 72.069 360.360 You can easily note that the two predictors are about equal. Really they are obtained by two repetition of the measure of the same quantity. KNOWLEDGE of DATA (EXPERIENCE)
Strategy A y = a + b x1 2y = b2 2 25 2 We know that there are really only one predictor, measured two times, so that we decide to use only the first With the method of least squares we compute slope and intercept. This is the regression model: y = a + b x1 with a = 0.01380 and b = 4.99966 With 2 variance of the predictor (obviously the same for the two predictors) the variance of the estimate of the response is (because of the law of propagation of variances) 2y = b2 2 25 2
We used only the chemical knowledge Strategy A We used only the chemical knowledge 2y 25 2
Strategy B y = a + b x1 with a = -0.00685 and b = 5.00013 We know that the mean of two repetitions has variance one half that of a single repetition. So we decide to use as a single predictor, the mean m of the two determinations. With the method of least squares we compute slope and intercept. This is the regression model: y = a + b x1 with a = -0.00685 and b = 5.00013
With 2 variance of the predictors the variance of the mean is: 2/2 So the variance of the response is computed as: 2y = b2 2/2 12.5 2
With strategy B, We used both the chemical knowledge and the knowledge of statistics 2y = 12.5 2
We use the least square multiple regression (MLR or OLS) STRATEGY C We use the least square multiple regression (MLR or OLS) With the method of least squares we compute two slopes and intercept. This is the regression model: y = a + b1 x1 + b2 x2 with a = 0.0128, b1 = 1.5019 and b2 = 3.49844 The variance of the estimate of the response is obtained from the law of propagation of variances as: 2y = b21 2 + b22 2 14.5 2
We were very lucky!!! BUT ……. b1 = 1.5019 and b2 = 3.49844 In his effort to minimize the sum of the squares of residuals OLS can be even worse.. In fact it is possible to notice that the sum of the two slopes: b1 = 1.5019 and b2 = 3.49844 is 5.00034, about the same as the unique slope obtained with strategies A and B. Apparently with two almost equal predictors what is important is the sum of the slopes. It must be about 5. So the result b1 = 15 and b2 = -10 seems acceptable, BUT …….
b21 = 152 = 225 and b22 = 102 = 100 2y = b21 2 + b22 2 325 2
Conclusion: OLS, using all the experimental information, never gives a model better than that of the strategy B (knowledge of data and of statistics) and the result can be worse than that obtained from strategy A that uses only a fraction of the information
First step: Regression of the two predictors on the response variable: Strategy PLS First step: Regression of the two predictors on the response variable: x1 = c1 + d1 y c1 = -0.00276, d1 = 0.20001 x2 = c2 + d2 y c2 = 0.00550, d2 = 0.19998
Normalization of slopes Strategy PLS Second step Normalization of slopes Result: w1 = 0.70716 w2 = 0.70705
Strategy PLS Step 3 Definition of a LATENT VARIABLE, combination of the two predictors by means of the coefficients W: t = w1 x1 + w2 x2
Strategy PLS STEP 4 Regression of the response on the latent variable. We obtain the regression model as a function of the latent variable: y = e + f t with e = -0.00685 and f = 3.53564
taking into account that From y = e + f t taking into account that t = w1 x1 + w2 x2 = 0.70716 x1 + 0.70705 x2 and that f = 3.53564 we obtain: y = -0.00685 + 2.50026 x1 + 2.49987 x2 (PLS closed form)
we can compute the variance of the response: Finally from y = -0.00685 + 2.50026 x1 + 2.4998 we can compute the variance of the response: 2y = b21 2 + b22 2 = (6.2513+6.2494) 2 = 12.5007 2
PLS is an intelligent technique The PLS model gives the same uncertainty on the response as that of Strategy B (knowledge of data and statistics, use of all the information) PLS “understand” that the two predictors have the same importance (slopes more or less equal) PLS is an intelligent technique