1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma.

Slides:



Advertisements
Similar presentations
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Advertisements

Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
Pattern Recognition for the Natural Sciences Explorative Data Analysis Principal Component Analysis (PCA) Lutgarde Buydens, IMM, Analytical Chemistry.
Multivariate statistics
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Response Surface Method Principle Component Analysis
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
Principal Component Analysis
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
More On Preprocessing Javier Cabrera. Outline 1.Transform the data into a scale suitable for analysis. 2.Remove the effects of systematic and obfuscating.
Unsupervised Learning - PCA The neural approach->PCA; SVD; kernel PCA Hertz chapter 8 Presentation based on Touretzky + various additions.
Principal component analysis (PCA)
09/05/2005 סמינריון במתמטיקה ביולוגית Dimension Reduction - PCA Principle Component Analysis.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Ordinary least squares regression (OLS)
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Goals of Factor Analysis (1) (1)to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify.
Principal component analysis (PCA)
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 9 Data Analysis Martin Russell.
Techniques for studying correlation and covariance structure
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
1 Labs for course #412 Analyzing Microarray Data using the mAdb System July 15-16, :00pm- 4:00pm First, look at the questions on the bottom of each.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks From Nature Medicine 7(6) 2001 By Javed.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
1 Course #412 Analyzing Microarray Data using the mAdb System April 1-2, :00 pm - 4:00pm Intended for users of the.
1 Sample Geometry and Random Sampling Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.
Lecture 12 Factor Analysis.
Analyzing Expression Data: Clustering and Stats Chapter 16.
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis
Oceanography 569 Oceanographic Data Analysis Laboratory Kathie Kelly Applied Physics Laboratory 515 Ben Hall IR Bldg class web site: faculty.washington.edu/kellyapl/classes/ocean569_.
Central limit theorem revisited Throw a dice twelve times- the distribution of values is not Gaussian Dice Value Number Of Occurrences.
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Principal Components Analysis ( PCA)
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Unsupervised Learning II Feature Extraction
Principal Component Analysis
Principal component analysis (PCA)
Unsupervised Learning
13.4 Product of Two Matrices
PREDICT 422: Practical Machine Learning
Information Management course
Exploring Microarray data
Information Management course
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
Measuring latent variables
Covariance Vs Correlation Matrix
SVD, PCA, AND THE NFL By: Andrew Zachary.
Principal Components Analysis
Principal Component Analysis (PCA)
Factor Analysis (Principal Components) Output
Measuring latent variables
Marios Mattheakis and Pavlos Protopapas
Unsupervised Learning
Presentation transcript:

1 Dimension Reduction Examples: 1. DNA MICROARRAYS: Khan et al (2001): 4 types of small round blue cell tumors (SRBCT) Neuroblastoma (NB) Rhabdomyosarcoma (RMS) Ewing family of tumors (EWS) Burkitt lymphomas (BL) Arrays: Training set= 63 arrays(23 EWS, 20 RMS, 12 NB, 8 BL) Testing set= 25 arrays(6 EWS, 5 RMS, 6 NB, 3 BL, 5 other) Genes: 2308 genes were selected because they showed minimal expression levels. 2. PLASTIC EXPLOSIVES: The data comes from a study for the detection of plastic explosives in suitcases using X-ray signals. The 23 variables are the discrete x-components of the xray absorption spectrum. The objective is to detect the suitcases with explosives suitcases were use for training and 60 testing. (see web page for dataset).

2 Covariance Vs Correlation Matrix 1.Use covariance or correlation matrix? If variables are not in the same units  Use Correlations 2.Dim(V) =Dim(R) = pxp and if p is large  Dimension reduction.

3 Sample Correlation Matrix Scatterplot Matrix

4 - The data cloud is approximated by an ellipsoid - The axes of the ellipsoid represent the natural components of the data - The length of the semi-axis represent the variability of the component. Principal Components Geometrical Intuition Variable X 1 Variable X 2 Data Component 1 Component 2

5 -When some of the components show a very small variability they can be omitted. -The graphs shows that Component 2 has low variability so it can be removed. -The dimension is reduced from dim=2 to dim=1 DIMENSION REDUCTION Variable X 1 Variable X 2 Data Component 1 Component 2

6 Linear Algebra Linear algebra is useful to write computations in a convenient way. Singular Value Decomposition: X = U D V ’ nxp nxp pxp pxp X centered =>S = V D 2 V ’ pxp pxp pxp pxp Principal Components(PC): Columns of V. Eigenvalues (Variance of PC’s): Diagonal elements of D 2 Correlation Matrix: Subtract mean of rows of X and divide by standard deviation and calculate the covariance If p > n then SVD: X’ = U D V ’ and S = U D 2 U ’ pxn pxn nxn nxn

7 PRINCIPAL COMPONENTS TABLE Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation Proportion of Variance Cumulative Proportion Analysis: Dimension Reduction: 2 components explain 76.2% of variability First component: represents the sum or average of all crimes because the loadings are very similar. PC1 = violent crimes + non-violent crimes Second component: Violent crimes: MURDER RAPE ROBBERY ASSAULT all have positive coefficients. Non-violent crimes: BURGLARY LARCENY AUTO all have negative coefficients. PC2 = violent crimes – non-violent crimes

8 PC1= Violent + NonViolent 45º rotation PC1= NonViolent PC2= Violent – NonViolent PC2= Violent Geometrical Intuition Violent PC1=Violent + Non-Violent PC2=Violent - Non-Violent Non-Violent PC2=Violent PC1=Non-Violent 45º

9 Biplot Combination of two graphs into one: 1. Graph of the observations in the coordinates of the two principal components. 2.Graph of the Variables projected into the plane of the two principal components. 3.The variables are represented as arrows, the observations as points or labels.

10 Variances and Biplot

11 Analysis after rotation : First Component: Non violent crimes Second component: Violent crimes

12 Principal components of 100 genes. PC2 Vs PC1. (a) Cells are the observations Genes are the variables (b) Genes are the observations Cells are the variables

13 Dimension reduction: Choosing the number of PC’s 1.k components explain some percentage of the variance: 70%,80%. 2.k eigenvalues are greater than the average (1) 3. Scree plot: Graph the eigenvalues and look for the last sharp decline and choose k as the number of points above the cut off. 4.Test the null hypothesis that the last m eigenvalues are equal (0) The same idea can be applied to factor analysis.

14 1.The top 5 eigenvalues explain 81% of variability. 2.Five eigenvalues greater than the average 2.5% 3.Scree Plot 4.Test statistic is 4 significant for 6 and highly significant for 2. average

15 More general biplots Graphical display of X in which two sets of markers are plotted. One set of markers a 1,…,a G represents the rows of X The other set of markers, b 1,…, b p, represents the columns of X. For example: X = UDV ’  X 2 = U 2 D 2 V 2 ’ A = U 2 D 2 a and B=V 2 D 2 b, a+b=1 so X 2 =AB ’ The biplot is the graph of A and B together in the same graph.

16 Biplot of the first two principal components. Biplot of the first two Principal components.