Principal Components: A Conceptual Introduction

Slides:



Advertisements
Similar presentations
Factor Analysis Continued
Advertisements

* Barclay s Premier League table * 1 Manchester United * 2 Manchester City * 3 Arsenal
Lecture 7: Principal component analysis (PCA)
Lecture Presentation Software to accompany Investment Analysis and Portfolio Management Seventh Edition by Frank K. Reilly & Keith C. Brown Chapter.
A quick introduction to the analysis of questionnaire data John Richardson.
Face Recognition Jeremy Wyatt.
Weights of Observations
Today: Central Tendency & Dispersion
Objectives of Multiple Regression
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Correlation and Covariance
Brian Duddy.  Two players, X and Y, are playing a card game- goal is to find optimal strategy for X  X has red ace (A), black ace (A), and red two (2)
Some matrix stuff.
Canonical Correlation Analysis and Related Techniques Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia.
Statistical analysis Outline that error bars are a graphical representation of the variability of data. The knowledge that any individual measurement.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
Interpreting Principal Components Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University L i n.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University.
Principal Components: A Mathematical Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
The bookmakers’ odds are: Home win: 9-5 Draw: 12-5 Away win: 7-5 What is the probability of each outcome, according to this information? The Big Match.
By: Sean jany. Players August 2005 Tues 9/820:05DebreceniUCL(Q)H3-0 Sat 13/812:45EvertonFAPLA2-0 Sat 20/812:45Aston VillaFAPLH1-0 Wed 24/8 19:30DebreceniUCL.
A-league soccer teams Adelaide United, Newcastle Jets, Central Coast Mariners, Queensland Roar, Melbourne Victory, Perth Glory, Sydney F.C and Wellington.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Copyright © 2012 by Nelson Education Limited. Chapter 12 Association Between Variables Measured at the Ordinal Level 12-1.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Football. The Barclays Premier League has now been contested for 23 seasons following the formation of the Premier League in 1992, with the inaugural.
Descriptive Statistics The means for all but the C 3 features exhibit a significant difference between both classes. On the other hand, the variances for.
Copyright © 2009 Pearson Education, Inc.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard)   Week 5 Multiple Regression  
Association Between Variables Measured at the Ordinal Level
Statistical analysis.
Step 1: Specify a null hypothesis
Dependent-Samples t-Test
Market-Risk Measurement
Bagging and Random Forests
Othello Artificial Intelligence With Machine Learning
CHAPTER 6, INDEXES, SCALES, AND TYPOLOGIES
Statistical analysis.
Statistics: The Z score and the normal distribution
Introductory Statistics
Factor analysis Advanced Quantitative Research Methods
6.1 The Role of Probability in Statistics: Statistical Significance
APPROACHES TO QUANTITATIVE DATA ANALYSIS
Discriminant Analysis
One Metric to Rule Them All
Principal Component Analysis (PCA)
Anagrams rasttionlanan radsnb incomatucmnoi Lgoo ycneoom emoelnvpdte.
Student Activity 1: Fair trials with two dice
1 Chapter 1: Introduction to Statistics. 2 Variables A variable is a characteristic or condition that can change or take on different values. Most research.
Standard Deviation.
English Premier league football statistics to win!
English Premier league football statistics to win!
Comparing Groups.
Interpreting Principal Components
6.1 The Role of Probability in Statistics: Statistical Significance
Standard Deviation.
Descriptive Statistics vs. Factor Analysis
NIM - a two person game n objects are in one pile
Prediction and Accuracy
Confidence intervals for the difference between two means: Independent samples Section 10.1.
Principal Component Analysis
Seasonal Forecasting Using the Climate Predictability Tool
Exploring Numerical Data
Standard Deviation.
Canonical Correlation Analysis and Related Techniques
InferentIal StatIstIcs
Standard Deviation.
Presentation transcript:

Principal Components: A Conceptual Introduction Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University L i n k i n g S c i e n c e t o S o c i e t y

L i n k i n g S c i e n c e t o S p o r t ! What makes a good soccer team? Everybody(?) has their favourite soccer team. But which is the best team, and how can we determine that it is the best? We usually justify our choice of best team by describing it in rather vague ways such as “good at scoring goals”, “excellent defensive line”, “fair players”. We need some quantifiable metrics rather than vague descriptions. L i n k i n g S c i e n c e t o S p o r t !

Soccer-Playing Metrics L i n k i n g S c i e n c e t o S p o r t ! Metrics can be defined for measuring the quality of a soccer team objectively. Each metric could be measured over a season or a number of seasons. L i n k i n g S c i e n c e t o S p o r t !

Soccer-Playing Metrics L i n k i n g S c i e n c e t o S p o r t ! Frequency of home wins (home wins). Frequency of home losses (home losses). Frequency of home goals scored (home for). Frequency of home goals ceded (home against). Frequency of away wins (away wins). Frequency of away losses (away losses). Frequency of away goals scored (away for). Frequency of away goals ceded (away against). Number of bookings (bookings). Average attendance (attendance). L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! English Premiership Teams 2003/04 Arsenal Aston Villa Birmingham Blackburn Rovers Bolton Wanderers Charlton Athletic Chelsea Everton Fulham Leeds United Leicester City Liverpool Manchester City Manchester United Middlesbrough Newcastle United Portsmouth Southampton Tottenham Hotspur Wolverhampton Wanderers L i n k i n g S c i e n c e t o S p o r t !

The Premiership Metric L i n k i n g S c i e n c e t o S p o r t ! In the Premiership the teams are ranked according to the number of games they win and draw, and then by goal difference if there are ties. where I.e., a weighted sum of the metrics is used to rank the teams. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! A General Metric A good team should score highly on all the metrics (note that losses, against and bookings can be measured so that high scores indicate good play by multiplying these scores by -1). If we can combine the original metrics into one new metric that captures as much of the information in the ten metrics as possible, we will have a new general metric that we can use as an overall measure of the quality of a soccer team. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Variance The differences between the teams on the various metrics provides the information we can use to distinguish good from bad teams. On some metrics (e.g., attendance) the differences are large, but on others (e.g., home losses) most teams score about the same. The variance of each metric tells us the total amount of information we have to distinguish the teams. The total information available to distinguish the teams is the sum of the variances of each metric. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Since virtually all of the total variance is contributed by attendance, teams need to perform well on this metric. Alternatively, the metrics could be standardized to give them equal weight. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Standardize? If we want to give each metric the same weight we should standardize the data first otherwise a team which performs poorly on a metric with high variance is likely to score badly overall – it will be difficult to make up the large deficit from metrics on which teams tend to score similarly. The variance of the standardized metrics is 1.0. Therefore the total standardized variance will be 10.0 (the number of metrics). L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! The Average The simplest combined score is to average the scores (or standardized scores) on each metric. But information is lost: the variance of the average scores is only about 0.59, compared to the total variance of 10.0). L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! The Average Also, the simple average is not very informative: if we ask why a team is good, the only way to answer is to refer to all ten metrics, which is inefficient for two reasons: there are too many metrics to which to refer; some of the metrics are very similar, so if we know that a team scored well on one metric we can assume that it probably scored well on a similar metric … L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Correlations Between the Metrics Some of the metrics seem to measure similar characteristics. For example, home for and away for both relate to the team’s goal-scoring achievements. Correlations between the metrics can be used to tell us whether the metrics are measuring similar aspects of the quality of a soccer team. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Correlations Between the Metrics Sum of diagonals = 10. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Independent Metrics Positive correlations between the metrics show that they are measuring similar aspects of the quality of a soccer team. We would like to combine the metrics somehow so that common aspects are measured on a single metric, and each combination measures a different aspect of the quality of a soccer team (i.e., the correlations between these new metrics is zero). The single metric must have high variance so that teams can be distinguished effectively. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Independent Metrics Objectives: New metrics that meet these objectives are called principal components. the new metrics are uncorrelated; each metric in turn summarizes as much information as possible (its variance is maximized); there is no loss of information. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Principal Components Principal components are weighted sums of the original metrics. Weighted sums are like weighted averages, except that the weights do not have to add up to 1.0. Instead, with principal components the squares of the weights add up to 1.01. The weights are known as eigenvectors, and are frequently referred to as loadings. The weighted sums are the scores on the new metrics. The new metrics are called principal components. 1 A few authors draw the following distinction: for EOFs the sum of the squared weights is 1; for principal components the sum is equal to the length of the eigenvalue. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Covariances Between the Principal Components Sum of diagonals = 10 L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Eigenvalues The variances of the principal components are called eigenvalues. The total variance explained by all the principal components is the same as that of the original standardized metrics, and so no information is lost. But most of the total variance is explained by only a few components. Compare the variance of the average of the standardized score (0.59). Principal components with variances > 1.0 have more information than any of the original standardized metrics. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Soccer Team Principal Component 1 L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Soccer Team Principal Component 1 We can obtain a score for a team by calculating the weighted average of its scores on the 10 original metrics: We can get a score for each team … L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Soccer Team Principal Component 1 L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Soccer-Player Principal Component 1 The score tells us whether the team out-performs their opponents, while playing fairly, and drawing large crowds. L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Soccer-Player Principal Component 2 L i n k i n g S c i e n c e t o S p o r t !

L i n k i n g S c i e n c e t o S p o r t ! Soccer-Player Principal Component 2 The score tells us whether the team plays better at home or away. L i n k i n g S c i e n c e t o S p o r t !