Download presentation
Presentation is loading. Please wait.
Published byRoland Richter Modified over 6 years ago
1
Dimension Reduction via PCA (Principal Component Analysis)
2
Motivation Derivation/calculation of PCA PCA in practice Things to note Examples in R and Python
3
Motivation In “big data” we often discuss how to handle big n (number of rows). However, often there are issues with big p (number of variables/parameters) Additionally, variables are often correlated which causes some to be redundant at some level. e.g. Web traffic data has visits and page views. Often correlated. We would like a way to scale down p in order to analyze a smaller, uncorrelated subset of data. We can use this smaller data set in other algorithms like linear regression, clustering, etc.
4
Geometric Introduction
We begin with a sample dataset of 20 observations We have two dimensions, or variables, called x and y X and Y are correlated We’d like to somehow project the data onto 1 dimension in an intelligent way Note: y is not a dependent variable here. Think of x and y as something like height and weight and we want to predict basketball performance
5
Geometric Introduction
Projecting raw data onto the x-axis
6
Geometric Introduction
Projecting the raw data onto the y-axis
7
Geometric Introduction
We’d like to rotate and stretch our data with a linear combination so that our data is uncorrelated and “stretched” as much as possible Rotating the data, we get the new points in red. Data in red has more separation amongst the points.
8
Geometric Introduction
Our new X and Y are now uncorrelated.
9
Geometric Introduction
Now if we project our rotation onto one dimension, we still have a lot of variation amongst our points. Doing a similar projection onto one dimension for the raw data does not yield as much variation.
10
Algebraic motivation/explanation
We want to find a linear combination, a, that maximizes the variation of x Maximize 𝑣𝑎𝑟( 𝒂’𝒙) = 𝒂’𝑺𝒂 where 𝑆 is the variance of 𝒙 Since a is arbitrary and can increase by scaling a, we normalize 𝒂’𝑺𝒂 and maximize λ where 𝝀= 𝒂 ′ 𝑺𝒂 𝒂 ′ 𝒂 𝝀 𝒂 ′ 𝒂= 𝒂 ′ 𝑺𝒂 𝒂 ′ 𝑺𝒂 −𝝀 𝒂 ′ 𝒂 = 0 𝒂 ′ 𝑺𝒂 −𝝀𝒂 = 0 𝑺 −𝝀 𝑰 a = 0 Implies 𝒂 is an eigenvector with eigenvalue λ
11
Algebraic motivation/explanation (cont)
For each parameter, p, we continue to maximize the variation such that 𝒂𝒌 is orthogonal to the other 𝒂 𝒚𝒊𝟏= 𝒂𝟏 ′ 𝒙𝒊 𝒚𝒊𝟐= 𝒂𝟐 ′ 𝒙𝒊 𝒚𝒊𝒑= 𝒂𝒑 ′ 𝒙𝒊 This gives us 𝑌 𝑛 𝑥 𝑝 = 𝑋 𝑛 𝑥 𝑝 ∗ 𝐴 𝑝 𝑥 𝑝 where Y are the transformation or “scores”. We can use just a few rows of 𝒀 to approximate 𝑿
12
Algebraic motivation/explanation (cont)
If we take a subset of A we can use a lower dimension 𝑌 𝑛 𝑥 1 = 𝑋 𝑛 𝑥 𝑝 ∗ 𝐴 𝑝 𝑥 1 And since A is orthogonal, A’ = A-1 𝑌 𝑛 𝑥 1 ∗ 𝐴′ 1 𝑥 𝑝 = 𝑋 𝑛 𝑥 𝑝
13
PCA in Practice Run principal component analysis algorithm
Determine number of components to use: Three common ways to do this: Retain enough components to account for k% of the total variability (where k is 80 or 90 or ) Accounted variability for 𝑖𝑡ℎ component can be found by 𝜆 𝑖 𝑖 𝜆 𝑖 Retain components whose variance (eigenvalue) is greater than the average eigenvalue (Σλi /p) Use scree plot to find natural break between “large/important” components and “small/unimportant” components Note: none of the 3 above methods have theoretical justification so sometimes, use what makes the most sense Use the “scores” with the corresponding number of dimensions in new algorithm
14
Things to Note PCA is NOT scale invariant!!!
Changing the units of each column will change the results of PCA. Parameters with high variance will influence the principal components more. A good best practice is to center and scale your variables (prcomp in R does not scale automatically) Because PCA is based off of orthogonal vectors, the signs for each vector are arbitrary. Meaning you could run PCA on the same data set and get all the same numbers except the sign would be flipped. This doesn’t really change anything but it is good to know. One downfall of PCA is we usually lose the interpretability of variables.
15
Examples See code
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.