Principle Component Analysis What is it? Why use it? –Filter on your data –Gain insight on important processes The PCA Machinery –How to do it –Examples (Matlab demo: script on website) Things to keep in mind Caveats
Principle Component Analysis (PCA) Principle Component Analysis: one way to find order in a data set. PCA is a way to represent your data in a very compact form by identifying the most frequently recurring (energetic) spatial structures in the data, and projecting the data onto these structures. PCA can -- sometimes --- be a way to identify true physical modes of the system PCA is also known as Factor Analysis, Empirical Orthogonal Function (EOF) Analysis, and a host of other names -- depending on the discipline you were raised in.
Principle Component Analysis (PCA) PCA of a data set Z results in a series of patterns (empirical orthogonal functions, EOFs) with accompanying time series (principle components, PCs) that are an alternative representation of the exact data set –Lets say we have M data points (locations) at N times The EOFs are orthogonal to each other, and so too are the PCs Formally, the PCA is the eigenanalysis of the covariance matrix C: –The EOFs are the eigenvectors E of C
PCA (cont) The EOFs (aka empirical modes) contain all of the variance and structure (covariance) in the data Since the EOFs are orthogonal, the eigenvalues tell you how much of the total variance in the data set is explained by each mode (pattern)
PCA Application #1 Since PCA selects for patterns that explain all the variance and covariance in the data, the EOFs with largest variance explained also tend to to be large spatial scales –EOFs will small variance also tend to have localized patterns. The common assumption is that these are unwanted noise Hence, reconstituting the data using only the modes with largest variance and truncating the sum at, say 90% of the total variance is a way to filter out small scale features that are assumed to be uninteresting noise A filter on the data to reduce small scale (unwanted) variance or instrumental noise
PCA Application #2 When there is a lot of structure in the data set, it only takes only a few EOFs to express most of the variance and covariance in the whole data set –For example, it might take only two patterns to explain 90% of the variability in the whole data set –The PCs of these leading EOFs can then be analyzed to ascertain the temporal properties of these special patterns A way to identify special (physically meaningful) structures in a big data set
PCA Machinery Say we have a data set stored in the matrix Z. The data are gathered at M locations, and at N time increments: After removing the time mean at each point, the covariance matrix of Z is C:
PCA Machinery The Eigenvectors (EOFs) are the eigenmodes of C The Eigenvalues express the amount of variance in each orthogonal eigenmode –The sum of the eigenvalues is the total variance in the data The data can be expressed in terms of the EOFs E and PCs P (more later): Z = E P
PCA Machinery The EOFs (E) and PCs P are of the form The PCs P are found by projecting the data onto the eigenvectors:
Examples of PCA Matlab demo
When using PCA as a filter You need to figure out how many modes to retain In general keep enough to explain most (e.g., 90%) of the total variance in the data set
When using PCA to find a special (physically meaningful) pattern Need to make sure eigenmodes are not noise and are distinct (not overlapping) Expected slope due to noise Only distinct eigenvalue/vector
Caveats The method tends to favor places where variance is large –Example: circulation vs geopential –Hence, EOF analysis on geopotential (dynamic height) would tend to favor midlatitudes compared to an EOF analysis of winds (currents)
Caveats (cont) The technique works best when data are linearly related across space (because the eigenvalue decomposition is a linear decomposition of the covariance matrix). When there are nonlinear relationships in space (which is almost always the case), you have to be very careful when you assign physical meaning to the eigenvectors. In a paleo context, analogous troubles arise if the proxy index is not linearly related to the climate variable that you are reconstructing.
Caveats (cont) Since all of the variance and covariance is contained within the eigenvectors, the EOFs tend to have large spatial structures. –Since, in the atmosphere and ocean, large spatial structures tend to also be lower frequency phenomenon, the EOFs will tend to emphasize large scale, lower frequency phenomenon.
Caveats (cont) WARNING: When the eigenvalues are not well separated, the eigenanalysis often will scramble information between the modes, and one should be very cautious about interpreting these modes as physically. In fact, in general, don't try to interpret them physically. An example of such a problem can be seen using the supplied Matlab program.
Caveats (cont) When are the EOFs true physical modes? Define the true physical modes as the eigensolution to the linear dynamical system where x is the state vector and M is a matrix that contains the physics and thermodynamics. The eigenvectors of M are the true physical modes of the system
Caveats (cont) If M is not Hermitian, then the eigenvectors of M are not orthogonal. Hence, there can not be a one-to-one relationship between the true modes and the EOF modes of the output from this system. What would make M non-Hermitian? Anything that makes M not symmetric. For example: –sheared mean flow –coupling between the atmosphere and ocean (because they have different Rossby numbers). Hence, the EOFs are almost never true modes of the dynamical system. –They can be close, however, and so there are times when it is useful and appropriate to think of the two as being nearly synonymous.