Seasonal Forecasting Using the Climate Predictability Tool Principal Components Regression Simon Mason simon@iri.columbia.edu Seasonal Forecasting Using the Climate Predictability Tool
Linear Regression in CPT In CPT linear regression is performed using the MLR (multiple linear regression) option. The MLR (multiple linear regression) option allows for more than one predictor: But what happens when we have lots of predictors (k is large)? …
Problems with Multiple Linear Regression (MLR) Multicolinearity - Predictors are strongly correlated. Predicting MAM 1961 – 2010 rainfall for Thailand from NIÑO4 SSTs: Correlation between NINO4Jan and NINO4Feb is 0.97. The correlation between June and July SST is 0.88 For the first half of the data (1961 – 1985) only:
Problems with Multiple Linear Regression Multiplicity - Too many predictors from which to choose. With more than a handful of candidate predictors, the probability of including a least one spurious predictor (and therefore of subsequently making a bad prediction) becomes very high.
Exercise Using the NINO indices, how well can we predict rainfall over Thailand at increasing lead-times? Create a file combining 2 or more lead-times as separate predictors. Repeat the calculations using this new file. Does the skill improve? Compare the regression equation for the three predictors with the equations for the three months individually. Now try calculating a seasonal average of the predictors for lead-times of interest. Which gives the best results: one month predictors, predictors for multiple months, or seasonal predictors?
Principal Components The principal components are defined like a weighted average of the original data: If the sum of the “weights” added to 1.0 then the principal components would be a true weighted average. However, the squares of the weights are made to add to 1.0; the variance of the original data is then retained.
Principal Components Regression Instead of using the original data as predictors, we can use the principal components as predictors in the same simple regression model. The PCR option contains the information in many of the original predictors, and so a complex MLR model can be simplified considerably:
Principal Components A principal component is a weighted sum of a set of original variables, with the weights set so that the principal component has maximum variance. Instead of using all the gridboxes as individual predictors, we define a spatial pattern (or “mode” or “principal component”) of, in this case, SSTs, and then calculate how similar the observed pattern of SSTs is to this mode. We then use this measure of similarity as the predictor. So in the example shown, the predictor indicates whether we have large-scale warming in the central Pacific Ocean (i.e., something akin to El Nino). If there is large-scale warming, the predictor scores strongly positive (e.g., as in 1997); if the observed SSTs are opposite of the mode, the predictor scores strongly negative; if the observed SSTs did not resemble the mode at all then the predictor scores zero. A completely different spatial pattern can then be defined to give another predictor, etc. The modes are defined to have as much variance as possible, which essentially means that they are defined so that in as many of the years as possible the observed SSTs resemble the modes. We can therefore use only a few modes to represent the total variability in the original gridded dataset. In representing the original data with only a few modes we now have a small number of predictors in our prediction model, and thus considerably reduce the multiplicity problem, plus because the modes are uncorrelated with each other (the patterns of SST are defined to be completely different) we eliminate the multicolinearity problem. Scores and loadings for first principal component of February 1961 – 2000 sea-surface temperatures.
Principal Components The score indicates how intensely developed the loading pattern is for each year. ????
Principal Components Separate patterns (“modes”) of variability can be defined. We can use just a few of these modes to represent the SST variability throughout the domain. Scores and loadings for second principal component of February 1961 – 2000 sea-surface temperatures.
Why PCR? When using principal components of sea-surface temperatures the components have desirable features: They explain maximum amounts of variance, and therefore are representative of sea temperature variability over large areas; They are uncorrelated, and so errors in estimating the regression parameters are minimized; Only a few need be retained and so the dangers of fishing are minimized.
Summary Multiple regression has two serious problems: multicolinearity: if predictors are correlated the coefficients become difficult to understand, and can be very sensitive to the sample; multiplicity: if there are lots of predictors, the chances of one or more of them working well by accident becomes very large. Principal components regression can resolve the multicolinearity problem; it can reduce the multiplicity problem.
Exercise Use gridded SSTs to predict Thailand rainfall. What considerations can we apply for selecting an appropriate SST domain and setting the number of modes?
CPT Help Desk web: iri.columbia.edu/cpt/ @climatesociety cpt@iri.columbia.edu @climatesociety …/climatesociety