Xuhua Xia Slide 1 Principal Components Analysis Objectives: –Understand the principles of principal components analysis (PCA) –Recognize conditions under.

Slides:



Advertisements
Similar presentations
Trends in Number of High School Graduates: National
Advertisements

PARTISAN CONTROL AND STATE DECISIONS ABOUT OBAMACARE FULL GO STATES (n = 22) Arkansas Michigan CALIFORNIA MINNESOTA COLORADO NEVADA CONNECTICUT New Hampshire.
Hwy Ops Div1 THE GREAT KAHUNA AWARD !!! TEA 2004 CONFERENCE, MOBILE, AL OCTOBER 09-11, 2004 OFFICE OF PROGRAM ADMINISTRATION HIPA-30.
The West` Washington Idaho 1 Montana Oregon California 3 4 Nevada Utah
Fifty Nifty United States
Multivariate statistics
TOTAL CASES FILED IN MAINE PER 1,000 POPULATION CALENDAR YEARS FILINGS PER 1,000 POPULATION This chart shows bankruptcy filings relative to.
BINARY CODING. Alabama Arizona California Connecticut Florida Hawaii Illinois Iowa Kentucky Maine Massachusetts Minnesota Missouri 0 Nebraska New Hampshire.
U.S. Civil War Map On a current map of the U.S. identify and label the Union States, the Confederate States, and U.S. territories. Create a map key and.
This chart compares the percentage of cases filed in Maine under chapter 13 with the national average between 1999 and As a percent of total filings,
Fasten your seatbelts we’re off on a cross country road trip!
Map Review. California Kentucky Alabama.
Judicial Circuits. If You Live In This State This Is Your Judicial Circuit Alabama11th Circuit Alaska 9th Circuit Arkansas 8th Circuit Arizona 9th Circuit.
1. AFL-CIO What percentage of the funds received by Alabama K-12 public schools in school year was provided by the state of Alabama? a)44% b)53%
The United States.
Medicare Advantage Enrollment: State Summary Five Slide Series, Volume 2 July 2013.
Directions: Label Texas, Arkansas, Louisiana, Mississippi, Tennessee, Alabama, Georgia, Florida, South Carolina, North Carolina, Virginia--- then color.
 As a group, we thought it be interesting to see how many of our peers drop out of school.  Since in the United States education is so important, we.
Warm Up Complete the Coordinate Practice #10. Content Objective: – Compare the physical and political regions. Language Objectives: – SWBAT define region.
CHAPTER 7 FILINGS IN MAINE CALENDAR YEARS 1999 – 2009 CALENDAR YEAR CHAPTER 7 FILINGS This chart shows total case filings in Maine for calendar years 1999.
By Carol Fahringer. I.The United States: Divided Into 8 Different Political Regions.
Study Cards The East (12) Study Cards The East (12) New Hampshire New York Massachusetts Delaware Connecticut New Jersey Rhode Island Rhode Island Maryland.
Hawaii Alaska (not to scale) Alaska GeoCurrents Customizable Base Map text.
US MAP TEST Practice
UNITED STATES HISTORY REGION PROJECT MONDAY, AUGUST 25, 2014.
Education Level. STD RATE Teen Pregnancy Rates Pre-teen Pregnancy Rate.
TOTAL CASE FILINGS - MAINE CALENDAR YEARS 1999 – 2009 CALENDAR YEAR Total Filings This chart shows total case filings in Maine for calendar years 1999.
50 Nifty United States Fifty nifty United States from thirteen original colonies; Fifty nifty stars in the flag that billows so beautif’ly in the breeze.
The United States is a system that can be broken into 5 major parts or regions.
Can you locate all 50 states? Grade 4 Mrs. Kuntz.
USA ILLUSTRATIONS – US CHARACTER Go ahead and replace it with your own text. This is an example text. Go ahead and replace it with your own text Go ahead.
1st Hour2nd Hour3rd Hour Day #1 Day #2 Day #3 Day #4 Day #5 Day #2 Day #3 Day #4 Day #5.
2012 IFTA / IRP MANAGERS’AND LAW ENFORCEMENT WORKSHOP
The United States Song Wee Sing America.
Expanded State Agency Use of NMLS
Fifty nifty United States from
The United States.
Supplementary Data Tables, Utilization and Volume
Sales Tax Raw Data State Sales Tax 1 Alabama 4% 2 Alaska 0% 3 Arizona
Physicians per 1,000 Persons
USAGE OF THE – GHz BAND IN THE USA
Content Objective: Language Objectives:
USA! E M O C L E W MAP OF USA To the Go ahead, use your tools below:
Name the State Flags Your group are to identify which state the flag belongs to and sign correctly to earn a point.
GLD Org Chart February 2008.
Membership Update July 13, 2016.
2008 presidential election
State Adoption of Uniform State Test
The States How many states are in the United States?
State Adoption of NMLS ESB
Supplementary Data Tables, Trends in Overall Health Care Market
Fifty nifty United States
AIDS Education & Training Center Program Regional Centers
Fifty Nifty United States
Table 2.3: Beds per 1,000 Persons by State, 2013 and 2014
Regions of the United States
DO NOW: TAKE OUT ANY FORMS OR PAPERS YOU NEED TO TURN IN
Regions of the United States
Supplementary Data Tables, Utilization and Volume
Regions How many do you know?.
Presidential Electoral College Map
2012 US Presidential Election Result
2008 presidential election
WASHINGTON MAINE MONTANA VERMONT NORTH DAKOTA MINNESOTA MICHIGAN
Expanded State Agency Use of NMLS
CBD Topical Sales Restrictions by State (as of May 23, 2019)
Percent of adults aged 18 years and older who have obesity †
In 2006, approximately 46% of all AIDS cases among adults and adolescents were in the South, followed by the Northeast (26%), the West (16%), and the Midwest.
AIDS Education & Training Center Program Regional Centers
USAGE OF THE 4.4 – 4.99 GHz BAND IN THE USA
Presentation transcript:

Xuhua Xia Slide 1 Principal Components Analysis Objectives: –Understand the principles of principal components analysis (PCA) –Recognize conditions under which PCA may be useful –Use SAS procedure PRINCOMP to perform a principal components analysis interpret PRINCOMP output.

Xuhua Xia Slide 2 Typical Form of Data A data set in a 8x3 matrix. The rows could be species and columns sampling sites X = A matrix is often referred to as a n x p matrix (n for number of rows and p for number of columns). Our matrix has 8 rows and 3 columns, and is an 8x3 matrix. A variance-covariance matrix has n = p, and is called n-dimensional square matrix.

Xuhua Xia Slide 3 What are Principal Components? Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet three criteria What are the three criteria? Y = b 1 X 1 + b 2 X 2 + … b n X n

Xuhua Xia Slide 4 What are Principal Components? The three criteria: –There are exactly p principal components (PCs), each being a linear combination of the observed variables; –The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated); –The components are extracted in order of decreasing variance.

Xuhua Xia Slide 5 A Simple Data Set XYX11Y11XYX11Y11 XY X Y Correlation matrix Covariance matrix

Xuhua Xia Slide 6 General Patterns The total variance is 3 (= 1 + 2) The two variables, X and Y, are perfectly correlated, with all points fall on the regression line. The spatial relationship among the 5 points can therefore be represented by a single dimension. PCA is a dimension-reduction technique. What would happen if we apply PCA to the data?

Xuhua Xia Slide 7 Graphic PCA X Y

Xuhua Xia Slide 8 SAS Program data pca; input x y; cards; ; proc princomp cov out=pcscore; proc print; var prin1 prin2; proc princomp data=pca out=pcscore; proc print; var prin1 prin2; run; Requesting the PCA to be carried out on the covariance matrix rather than the correlation matrix. Without specifying the covariance option, PCA will be carried out on the correlation matrix.

Xuhua Xia Slide 9 A positive definite matrix When you run the SAS program, the log file will warn that “The Correlation Matrix is not positive definite.”. What does that mean? A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non- zero vectors z with real entries, where z’ is the transpose of z. Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite: Replace the correlation matrix with the covariance matrix and solve for z.

Xuhua Xia Slide 10 SAS Output Eigenvalues of the Covariance Matrix Eigenvalue Difference Proportion Cumulative PRIN PRIN Eigenvectors PRIN1 PRIN2 X Y OBS PRIN1 PRIN Variance accounted for by each principal components Principal component scores What’s the variance in PC1? How are the values computed? PC1 = *X *X2

Xuhua Xia Slide 11 SAS Output OBS PRIN1 PRIN

Xuhua Xia Slide 12 Eigenvalues of the Correlation Matrix Eigenvalue Difference Proportion Cumulative PRIN PRIN Eigenvectors PRIN1 PRIN2 X Y OBS PRIN1 PRIN SAS Output Variance accounted for by each principal components Principal component scores What’s the variance in PC1?

Xuhua Xia Slide 13 Steps in a PCA Have at least two variables Generate a correlation or variance-covariance matrix Obtain eigenvalues and eigenvectors (This is called an eigenvalue problem, and will be illustrated with a simple numerical example) Generate principal component (PC) scores Plot the PC scores in the space with reduced dimensions All these can be automated by using SAS.

Xuhua Xia Slide 14 Covariance or Correlation Matrix? Abundance Sp1 Sp2

Xuhua Xia Slide 15 Covariance or Correlation Matrix?

Xuhua Xia Slide 16 Covariance or Correlation Matrix?

Xuhua Xia Slide 17 The Eigenvalue Problem The covariance matrix. The Eigenvalue is the set of values that satisfy this condition. The resulting eigenvalues (There are n eigenvalues for n variables). The sum of eigenvalues is equal to the sum of variances in the covariance matrix. Finding the eigenvalues and eigenvectors is called an eigenvalue problem (or a characteristic value problem).

Xuhua Xia Slide 18 Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition: A x = x In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x 1 and x 2.

Xuhua Xia Slide 19 Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x x 2 2 = 1 We therefore have From Previous Slide The first eigenvector is one associated with the largest eigenvalue. Solve x 1

Xuhua Xia Slide 20 Get the PC Scores First PC score Second PC score Original data (x and y)Eigenvectors The original data in a two dimensional space is reduced to one dimension..

Xuhua Xia Slide 21 What Are Principal Components? Principal components are a new set of variables, which are linear combinations of the observed ones, with these properties: –Because of the decreasing variance property, much of the variance (information in the original set of p variables) tends to be concentrated in the first few PCs. This implies that we can drop the last few PCs without losing much information. PCA is therefore considered as a dimension- reduction technique. –Because PCs are orthogonal, they can be used instead of the original variables in situations where having orthogonal variables is desirable (e.g., regression).

Xuhua Xia Slide 22 Index of hidden variables The ranking of Asian universities by the Asian Week –HKU is ranked second in financial resources, but seventh in academic research –How did HKU get ranked third? –Is there a more objective way of ranking? An illustrative example:

Xuhua Xia Slide 23 A Simple Data Set School 5 is clearly the best school School 1 is clearly the worst school

Xuhua Xia Slide 24 Graphic PCA

Xuhua Xia Slide 25 Crime Data in 50 States STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA HAWAII IDAHO ILLINOIS PROC PRINCOMP OUT=CRIMCOMP;

DATA CRIME; TITLE 'CRIME RATES PER 100,000 POP BY STATE'; INPUT STATENAME $1-15 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO; CARDS; Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York

North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming ; PROC PRINCOMP out=crimcomp; run; PROC PRINT; ID STATENAME; VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO; run; PROC GPLOT; PLOT PRIN2*PRIN1=STATENAME; TITLE2 'PLOT OF THE FIRST TWO PRINCIPAL COMPONENTS'; run; PROC PRINCOMP data=CRIME COV OUT=crimcomp; run; PROC PRINT; ID STATENAME; VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO; run; /* Add to have a map view*/ proc sort data=crimcomp out=crimcomp; by STATENAME; run; proc sort data=maps.us2 out=mymap; by STATENAME; run; data both; merge mymap crimcomp; by STATENAME; run; proc gmap data=both; id _map_geometry_; choro PRIN1 PRIN2/levels=15; /* choro PRIN1/discrete; */ run;

Xuhua Xia Slide 28 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO Correlation Matrix If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix.

Xuhua Xia Slide 29 Eigenvalue Difference Proportion Cumulative PRIN PRIN PRIN PRIN PRIN PRIN PRIN Eigenvalues

Xuhua Xia Slide 30 Eigenvectors PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO Do these eigenvectors mean anything? –All crimes are positively correlated with the first eigenvector, which is therefore interpreted as a measure of overall crime rate. –The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…...

Xuhua Xia Slide 31 PC Plot: Crime Data North and South Dakota Nevada, New York, California Mississippi, Alabama, Louisiana, South Carolina Maryland

Plot of PC1

Plot of PC2