An introduction to principal component analysis Ralph Burton, IAS Simon Vosper, Met Office Stephen Mobbs, IAS.

Slides:



Advertisements
Similar presentations
Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Advertisements

An Introduction to Multivariate Analysis
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
The General Linear Model Or, What the Hell’s Going on During Estimation?
Dimension reduction (1)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Maximum Covariance Analysis Canonical Correlation Analysis.
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Predictability and Chaos EPS and Probability Forecasting.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft NDACC H2O workshop, Bern, July 2006 Water vapour profiles by ground-based FTIR Spectroscopy:
Atmospheric phase correction for ALMA Alison Stirling John Richer Richard Hills University of Cambridge Mark Holdaway NRAO Tucson.
Observations and modelling of IOP6: response of the valley winds to the upstream profile R. Burton 1, S. Vosper 2, P. Sheridan 2, S. Mobbs 1 1 Institute.
Inversion Effects on Lee-wave Rotors Simon Vosper, Stephen Mobbs, Ralph Burton Institute for Atmospheric Science University of Leeds, UK.
Chapter 2 Describing Data Sets
Background Tropopause theta composites Summary Development of TPVs is greatest in the Baffin Island vicinity in Canada, with development possibly having.
Principal Component Analysis. Consider a collection of points.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Factor Analysis Psy 524 Ainsworth.
Principal Components An Introduction
Objectives of Multiple Regression
Issues in Experimental Design Reliability and ‘Error’
Moonlight reflecting off ice crystals in cirrostratus clouds can cause a halo to appear around the moon. Such a halo often indicates that precipitation.
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY Session 2: Basic techniques for innovation data analysis. Part I: Statistical inferences.
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Verification & Validation
880.P20 Winter 2006 Richard Kass 1 Confidence Intervals and Upper Limits Confidence intervals (CI) are related to confidence limits (CL). To calculate.
/12Z /00Z /12Z /00Z SLP rising (TC weakening). ETC intensifying. A Technique to predict the outcome of extratropical transition.
Review of Statistics and Linear Algebra Mean: Variance:
The vertical resolution of the IASI assimilation system – how sensitive is the analysis to the misspecification of background errors? Fiona Hilton and.
Canonical Correlation Analysis and Related Techniques Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia.
Chapter 9 Factor Analysis
Advanced Correlational Analyses D/RS 1013 Factor Analysis.
INTRODUCTION TO ANALYSIS OF VARIANCE (ANOVA). COURSE CONTENT WHAT IS ANOVA DIFFERENT TYPES OF ANOVA ANOVA THEORY WORKED EXAMPLE IN EXCEL –GENERATING THE.
Time series Model assessment. Tourist arrivals to NZ Period is quarterly.
Interpreting Principal Components Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University L i n.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Ping Zhu, AHC5 234, Office Hours: M/W/F 10AM - 12 PM, or by appointment M/W/F,
The climate and climate variability of the wind power resource in the Great Lakes region of the United States Sharon Zhong 1 *, Xiuping Li 1, Xindi Bian.
R Determining the underlying structures in modelled orographic flow R. R. Burton 1, S. B. Vosper 2 and S. D. Mobbs 1 1 Institute for Atmospheric Science,
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Math 5364/66 Notes Principal Components and Factor Analysis in SAS Jesse Crawford Department of Mathematics Tarleton State University.
Simple Linear Regression In the previous lectures, we only focus on one random variable. In many applications, we often work with a pair of variables.
AIRS Radiance and Geophysical Products: Methodology and Validation Mitch Goldberg, Larry McMillin NOAA/NESDIS Walter Wolf, Lihang Zhou, Yanni Qu and M.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Lecture 12 Factor Analysis.
Christina Bonfanti University of Miami- RSMAS MPO 524.
Correlation & Regression Analysis
Boundary layer depth verification system at NCEP M. Tsidulko, C. M. Tassone, J. McQueen, G. DiMego, and M. Ek 15th International Symposium for the Advancement.
Page 1© Crown copyright Modelling the stable boundary layer and the role of land surface heterogeneity Anne McCabe, Bob Beare, Andy Brown EMS 2005.
Validation of Satellite-derived Clear-sky Atmospheric Temperature Inversions in the Arctic Yinghui Liu 1, Jeffrey R. Key 2, Axel Schweiger 3, Jennifer.
Instruments. In Situ In situ instruments measure what is occurring in their immediate proximity. E.g., a thermometer or a wind vane. Remote sensing uses.
3 “Products” of Principle Component Analysis
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Oceanography 569 Oceanographic Data Analysis Laboratory Kathie Kelly Applied Physics Laboratory 515 Ben Hall IR Bldg class web site: faculty.washington.edu/kellyapl/classes/ocean569_.
Central limit theorem revisited Throw a dice twelve times- the distribution of values is not Gaussian Dice Value Number Of Occurrences.
Principal Components Analysis ( PCA)
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
ThermodynamicsM. D. Eastin We need to understand the environment around a moist air parcel in order to determine whether it will rise or sink through the.
ECMWF/EUMETSAT NWP-SAF Satellite data assimilation Training Course
Predictability of orographic drag for realistic atmospheric profiles
Spatial Modes of Salinity and Temperature Comparison with PDO index
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
How well can we determine the tropopause
Interpreting Principal Components
X.1 Principal component analysis
Principal Component Analysis (PCA)
ALL the following plots are subject to the filtering :
Product moment correlation
Principal Component Analysis
Presentation transcript:

An introduction to principal component analysis Ralph Burton, IAS Simon Vosper, Met Office Stephen Mobbs, IAS

Outline of talk 1. PCA: what the analysis can do 2. Simple examples of use 3. Application to radiosonde data: detection of inversions 4. Summary

INTRODUCTION: PCA An objective method for determining underlying patterns in data. Many meteorological (usually climatological) applications. Very simple matter to determine the underlying structures… …interpreting the structures is the difficult part; often the results have no obvious physical significance.

What you need: some data TimeTemp. 1Temp. 2Temp. 3RHCloud cover % % %1 ……………… ……………… %3 some variables

Mathematical aspects 1. Form the data matrix X containing your data; X is of size K x N (K stations, measurement points, grid points, etc; N samples) 2. Calculate the covariance matrix S, based on X; 3. Solve Se = e for the eigenvectors e and eigenvalues  (K EOFs and eigenvalues) 4. Solve P = Xe to calculate the principal components (N PCs) Many off-the-shelf packages, e.g. IDL, have PCA routines.

PCA – what you get PCA produces three types of analysis:  The empirical orthogonal functions (EOFs): the patterns, or structures, in the data;  The principal components (PCs): a time series, reflecting the relative contribution of each EOF at a given time  The eigenvalues: give the overall importance of each EOF N.B. The theory states that the EOFs must be orthogonal to each other, regardless of the underlying physical processes…

EOFs: Simple example Daily maximum termperatures for November 1985 from Ilkley, Bradford and Jersey were subjected to two separate PC analyses: I.Ilkley and Bradford II.Ilkley and Jersey This will reveal if there is any relationship between the temperatures at these locations for the selected times. Here, the PCA will have two variables sampled at thirty points.

temperature in Ilkley /degrees C temp. in Bradford /degrees C

E1E1 E2E2 EOF 1 explains 99.4% of the total variance in the data temperature in Ilkley /degrees C temp. in Bradford /degrees C

temperature in Ilkley /degrees C temp. in Jersey /degrees C

E1E1 E2E2 EOF 1 explains 83% of the total variance in the data temperature in Ilkley /degrees C temp. in Jersey /degrees C

PCA results In this simple example, the EOFs may be interpreted as defining an alternative co-ordinate system in which to view the data: 1 2 EOF 1: Reflects the maximum temperature in the Ilkley – Bradford/Jersey area; EOF 2: variations (possibly random) departing from the overall regional value.

PC time series Principal components are a time series which represent how much each EOF contributes. Thus:  A relatively large value of PC i implies that EOF i is dominant at that point  A relatively low value of PC i implies that EOF i is not contributing much to the struture

In this idealised example, EOF1 accounts for 100% of the variance in the data. Consider a time series of pressures, measured at three points; 9 samples. Data compression. distance /km pressure /hPa Sample number PC1 score EOF

Which EOFs are significant? - eigenvalues An initial problem is to determine the “signal” from the “noise”; not all EOFs are significant. The most widely used and robust method is to compare the PCA of your data with a PCA of random data; the so-called Rule N Rule N 1. Substitute randomly generated data for your data; 2. Perform PCA on this random data; retain eigenvalues 3. Repeat steps 1-2 a large number (O1000) times, a “Monte-Carlo” (MC) simulation; 4. Calculate the mean eigenvalues from the above; 5. Compare your data eigenvalues with the Monte- Carlo eigenvalues.

Example: national lottery results. Are there patterns in lottery results?… A PCA of two years-worth of lottery results was performed (not including the bonus ball): But… EOF 1 EOF1 explains 23% of the variance in the data!! Pick: lowest value, highest value, then 4 lower values… It could be you…

Rule N states that for a PC to be significant, the corresponding eigenvalue must be higher than the 95% confidence limit on the MC simulations. …unfortunately, the patterns in lottery data cannot be distinguished from noise. A set of 1000 Monte-Carlo simulations were compared with the lottery data :

More typically… PC number e-value Keep the first two eigenvalues Keep the first three eigenvalues

Thus, we must be very careful in interpreting PCA results: Are the results significant (in the sense just described)? Can the results be interpreted in a physical manner? * * *

Application: inversion detecting Inversions are thought to play a crucial part in the formation of rotor clouds on the Falkland Islands. Thus, an algorithm for detecting inversions is desirable However, it is actually quite difficult to construct a robust algorithm which works for all inversions. temp. height temp. height temp. height temp. height H1H1 H2H2 T1T1 T2T2 ?? Easy…Not easy…

PCA was applied to radiosonde data from Mount Pleasant Airport (MPA), Falkland Islands The PCA allows the dominant thermal structures to be revealed objectively; no algorithm is used to estimate where the inversion starts/stops etc. A series of 499 ascents were used. The lowest 2km of each profile was selected. temperature height MPA Orography in vicinity of MPA

Physical interpretation  The first EOF reflects the strength of the inversion; a higher PC score will imply a stronger inversion.  EOF2 acts to change the vertical location of the inversion.

Time PC1 score PC1 score showing peaks in the time series

Event #Comments 1direction highly variable; gusts up to 40kts 2gusts up to 45 kts 3direction variable, gusts up to 30 kts 4N/A 5gusts up to 30 kts 6direction highly variable; gusts up to 40kts 7gusts up to 65 kts 8gusts up to 35 kts 9gusts up to 30 kts 10gusts up to 60 kts 11N/A Ground observations at the 11 events

Anemograph trace for time 1 Direction Speed

Anemograph trace for time 7 60 kts Direction Speed

3dVOM Measurements Event no. 1: 09/02/01

3dVOM Measurements Event no. 2: 26/02/01

3dVOM Measurements Event no. 3: 30/03/01

3dVOM Measurements Event no. 4: 10/04/01

3dVOM Measurements Event no. 5: 06/05/01

3dVOM Measurements Event no. 6: 27/06/01

3dVOM Measurements Event no. 7: 20/08/01

3dVOM Measurements Event no. 8: 30/09/01

3dVOM Measurements Event no. 9: 06/10/01

3dVOM Measurements Event no. 10: 17/10/01

It appears that high PC1, coupled with a Northerly upstream wind direction, occurs during severe weather at the ground, as reflected in both the model and the observations. * * *

Application to nowcasting It has been seen that high PC1 scores appear to be related to what is going on at ground level, in terms of wind at least. Can a “new” ascent be assimilated into the matrix to determine its significance? temperature height solid line - high PC1 score (event 7) dashed line - very low PC1 score

To test the validity of this approach, append a week’s worth of ascents with no inversion, followed by the strong inversion. As can be seen, the time series gives a peak when the inversion is present. date PC1 score

Application to forecasting Can a similar approach be used to predict extreme events? Answer: use UM forecast profiles instead of sonde profiles. ; Event 7 The sonde and forecast profiles show good agree- ment here. N.B. the resolution of the UM profile is lower than that for the sonde.

Time PC score A set of UM forecast profiles were subjected to a PCA; the EOFs (not shown) are similar to those for the sonde profiles. The PCs are shown below. Solid line – sonde Dashed line – UM

Result of the intercomparison The first PC for sonde and UM profiles show good agreement; The first PC for sonde ascents can be related to severe weather at the ground; The first PC for UM profiles may be used in a PCA to deduce severe weather.

Summary PCA has been successfully applied to a series of radio- sonde ascents:  The first EOF reflects the strength of the inversion;  The time series of PCs shows a series of distinct peaks (or “events”);  During most of these events, both modelling studies and observations show severe weather at the ground  …application to forecasting.