Download presentation
Presentation is loading. Please wait.
Published bySara Thomas Modified over 9 years ago
1
1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma (RMS) Ewing family of tumors (EWS) Burkitt lymphomas (BL) Arrays: Training set= 63 arrays(23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 arrays(6 EWS, 5 RMS, 6 NB, 3 BL, 5 other) Genes: 2308 genes were selected because they showed minimal expression levels. 2. PLASTIC EXPLOSIVES: The data comes from a study for the detection of plastic explosives in suitcases using X-ray signals. The 23 variables are the discrete x-components of the xray absorption spectrum. The objective is to detect the suitcases with explosives. 2993 suitcases were use for training and 60 testing. (see web page for dataset).
2
2 Covariance Vs Correlation Matrix 1.Use covariance or correlation matrix? If variables are not in the same units Use Correlations 2.Dim(V) =Dim(R) = pxp and if p is large Dimension reduction.
3
3 Sample Correlation Matrix Scatterplot Matrix
4
4 - The data cloud is approximated by an ellipsoid - The axes of the ellipsoid represent the natural components of the data - The length of the semi-axis represent the variability of the component. Principal Components Geometrical Intuition Variable X 1 Variable X 2 Data Component 1 Component 2
5
5 -When some of the components show a very small variability they can be omitted. -The graphs shows that Component 2 has low variability so it can be removed. -The dimension is reduced from dim=2 to dim=1 DIMENSION REDUCTION Variable X 1 Variable X 2 Data Component 1 Component 2
6
6 Linear Algebra Linear algebra is useful to write computations in a convenient way. Singular Value Decomposition: X = U D V ’ nxp nxp pxp pxp X centered =>S = V D 2 V ’ pxp pxp pxp pxp Principal Components(PC): Columns of V. Eigenvalues (Variance of PC’s): Diagonal elements of D 2 Correlation Matrix: Subtract mean of rows of X and divide by standard deviation and calculate the covariance If p > n then SVD: X’ = U D V ’ and S = U D 2 U ’ pxn pxn nxn nxn
7
7 PRINCIPAL COMPONENTS TABLE Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 MURDER 0.329 0.588 0.190 -0.217 0.521 -0.377 0.223 RAPE 0.429 0.182 -0.221 0.299 0.746 -0.285 ROBBERY 0.392 0.489 -0.590 -0.467 0.190 ASSAULT 0.395 0.355 0.606 -0.543 0.217 BURGLARY 0.435 -0.219 -0.228 -0.505 -0.673 LARCENY 0.355 -0.380 -0.572 -0.227 0.589 AUTO 0.287 -0.546 0.543 0.424 0.352 0.145 Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation 2.0436891 1.0763811 0.8621946 0.5664485 0.50353374 Proportion of Variance 0.5966664 0.1655138 0.1061971 0.0458377 0.03622089 Cumulative Proportion 0.5966664 0.7621802 0.8683773 0.9142150 0.95043587 Analysis: Dimension Reduction: 2 components explain 76.2% of variability First component: represents the sum or average of all crimes because the loadings are very similar. PC1 = violent crimes + non-violent crimes Second component: Violent crimes: MURDER RAPE ROBBERY ASSAULT all have positive coefficients. Non-violent crimes: BURGLARY LARCENY AUTO all have negative coefficients. PC2 = violent crimes – non-violent crimes
8
8 PC1= Violent + NonViolent 45º rotation PC1= NonViolent PC2= Violent – NonViolent PC2= Violent Geometrical Intuition Violent PC1=Violent + Non-Violent PC2=Violent - Non-Violent Non-Violent PC2=Violent PC1=Non-Violent 45º
9
9 Biplot Combination of two graphs into one: 1. Graph of the observations in the coordinates of the two principal components. 2.Graph of the Variables projected into the plane of the two principal components. 3.The variables are represented as arrows, the observations as points or labels.
10
10 Variances and Biplot
11
11 Analysis after rotation : First Component: Non violent crimes Second component: Violent crimes
12
12 Principal components of 100 genes. PC2 Vs PC1. (a) Cells are the observations Genes are the variables (b) Genes are the observations Cells are the variables
13
13 Dimension reduction: Choosing the number of PC’s 1.k components explain some percentage of the variance: 70%,80%. 2.k eigenvalues are greater than the average (1) 3. Scree plot: Graph the eigenvalues and look for the last sharp decline and choose k as the number of points above the cut off. 4.Test the null hypothesis that the last m eigenvalues are equal (0) The same idea can be applied to factor analysis.
14
14 1.The top 5 eigenvalues explain 81% of variability. 2.Five eigenvalues greater than the average 2.5% 3.Scree Plot 4.Test statistic is 4 significant for 6 and highly significant for 2. average
15
15 More general biplots Graphical display of X in which two sets of markers are plotted. One set of markers a 1,…,a G represents the rows of X The other set of markers, b 1,…, b p, represents the columns of X. For example: X = UDV ’ X 2 = U 2 D 2 V 2 ’ A = U 2 D 2 a and B=V 2 D 2 b, a+b=1 so X 2 =AB ’ The biplot is the graph of A and B together in the same graph.
16
16 Biplot of the first two principal components. Biplot of the first two Principal components.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.