Download presentation
Presentation is loading. Please wait.
1
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006
2
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 9/25/2006 Data Types Administrative Project design is due Oct 30 th ~3 weeks from now Include the following items in the document The goal of the project A brief introduction of the overall project A list of background materials that will be covered in the final report A high level design of your project A testing plan
3
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 9/25/2006 Data Types Overview Gain insights of high dimensional space by projection pursuit (feature reduction). PCA: Principle components analysis A data analysis tool Mathematical background PCA and gene expression profile analysis briefly
4
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 9/25/2006 Data Types A Group of Related Techniques Unsupervised Principal Component Analysis (PCA) Latent Semantic Indexing (LSI): truncated SVD Independent Component Analysis (ICA) Canonical Correlation Analysis (CCA) Supervised Linear Discriminant Analysis (LDA) Semi-supervised Research topic
5
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 9/25/2006 Data Types Rediscovery – Renaming of PCA Statistics: Principal Component Analysis (PCA) Social Sciences: Factor Analysis (PCA is a subset) Probability / Electrical Eng: Karhunen – Loeve expansion Applied Mathematics: Proper Orthogonal Decomposition (POD) Geo-Sciences: Empirical Orthogonal Functions (EOF)
6
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 9/25/2006 Data Types An Interesting Historical Note The 1st (?) application of PCA to Functional Data Analysis: Rao, C. R. (1958) Some statistical methods for comparison of growth curves, Biometrics, 14, 1-17. 1st Paper with “Curves as Data” viewpoint
7
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 9/25/2006 Data Types What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most of the sample's information. Useful for the compression and classification of data. By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.
8
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 9/25/2006 Data Types A Geometric Picture the 2 nd PC is the line, orthogonal to, to capture the remaining total variance the 1 st PC is the line in the space such that the “projected” data set has the largest total variance PCs are a series of linear fits to a sample, each orthogonal to all the previous.
9
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 9/25/2006 Data Types Connect Math to Graphics 2-d Toy Example Feature Space Object Space Data Points (Curves) are columns of data matrix, X
10
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Sample Mean, X
11
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Residuals from Mean = Data - Mean
12
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Recentered Data = Mean Residuals, shifted to 0
13
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC1 Direction = η = Eigenvector (w/ biggest λ)
14
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data PC1 Projection Residual
15
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC2 Direction = η = Eigenvector (w/ 2 nd biggest λ)
16
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data PC2 Projection Residual
17
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 9/25/2006 Data Types Connect Math to Graphics (Cont.) Note for this 2-d Example: PC1 Residuals = PC2 Projections PC2 Residuals = PC1 Projections (i.e. colors common across these pics)
18
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 9/25/2006 Data Types PCA and Complex Data Analysis Data set is a set of curves How to find clusters? Treat curves as points in a high dimensional space Applications in gene expression profile analysis Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808
19
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 9/25/2006 Data Types N-D Toy Example Upper left shows the mean. Upper right is residuals from mean. Lower left is projections of the mean residuals in the PC1 direction. Lower right is further residuals from PC1 projections.
20
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 9/25/2006 Data Types Yeast Cell Cycle Data Central question: Which genes are “ periodic ” over 2 cell cycles?
21
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 9/25/2006 Data Types Yeast Cell Cycle Data, PCA analysis Periodic genes? Na ï ve approach: Simple PCA
22
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 9/25/2006 Data Types Yeast Cell Cycle Data, FDA View Central question: which genes are “periodic” over 2 cell cycles? Naïve approach: Simple PCA Doesn’t work No apparent (2 cycle) periodic structure? Eigenvalues suggest large amount of “variation” PCA finds “directions of maximal variation” Often, but not always, same as “interesting directions” Here need better approach to study periodicities
23
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 9/25/2006 Data Types Yeast Cell Cycles, Freq. 2 Proj. PCA on Freq. 2 Periodic Component Of Data
24
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 9/25/2006 Data Types PCA for 2D Surfaces 2-d M-Rep Example: Corpus Callosum Atoms Spokes Implied Boundary
25
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 9/25/2006 Data Types Pros and Cons PCA works for Multi-dimensional Gaussian distribution It doesn’t work for Gaussian mixtures Data in non-Euclidian spaces
26
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 9/25/2006 Data Types Detailed Look at PCA Three important (and interesting) viewpoints: Mathematics Numerics Statistics 1st: Review linear alg. and multivar. prob.
27
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 9/25/2006 Data Types Review of Linear Algebra Vector Space: set of “vectors”,, and “scalars” (coefficients or an element in a field), “closed” under “linear combination” ( in space) For example: “ d dim Euclid’n space”
28
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 9/25/2006 Data Types Subspace Subspace: subset that is again a vector space which is closed under linear combination Examples: lines through the origin planes through the origin all linear combos of a subset of vector (= a hyperplane through origin)
29
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 9/25/2006 Data Types Basis Basis of subspace: set of vectors that span, i.e. everything is a lin. com. of them are linearly indep’t, i.e. lin. Com. is unique Example: “unit vector basis” in e.g.
30
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 9/25/2006 Data Types Basis Matrix Basis Matrix, of subspace of Given a basis: create matrix of columns: Then “linear combo” is a matrix multiplicat’n:
31
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 9/25/2006 Data Types Linear Transformation Aside on matrix multiplication: (linear transformation) for matrices Define the “matrix product” (“inner products” of columns with rows) (composition of linear transformations)
32
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 9/25/2006 Data Types Matrix Trace For a square matrix Define Trace commutes with matrix multiplication:
33
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 9/25/2006 Data Types Dimension Dimension of subspace (a notion of “size”): number of elements in a basis (unique) (use basis above) Example Dimension of a line is 1 Dimension of a plane is 2 Dimension is “degrees of freedom”
34
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 9/25/2006 Data Types Vector Norm in, Idea: “length” of the vector “length normalized vector”: (has length one, thus on surf. of unit sphere) get “distance” as:
35
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 9/25/2006 Data Types Inner Product Inner (dot, scalar) product: for vectors and, related to norm, via measures “angle between and ” as: key to “orthogonality”, i.e. “perpendicul’ty”: if and only if
36
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 9/25/2006 Data Types Orthonormal Basis Orthonormal basis : All ortho to each other, i.e., for All have length 1, i.e., for “Spectral Representation”: where check: Matrix notation: where i.e. is called “transform (e.g. Fourier, wavelet) of ”
37
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 9/25/2006 Data Types Vector Projection Projection of a vector onto a subspace : Idea: member of that is closest to (i.e. “approx’n”) Find that solves: (“least squa’s”) General solution in : for basis matrix So “proj’n operator” is “matrix mult’n”: (thus projection is another linear operation)
38
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 9/25/2006 Data Types Vector Projection (cont) Projection using orthonormal basis : Basis matrix is “orthonormal”: So = Recon(Coeffs of “in dir’n”) For “orthogonal complement”,, and Parseval inequality:
39
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 9/25/2006 Data Types Random Vectors Given a “random vector” A “center” of the distribution is the mean vector, A “measure of spread” is the covariance matrix:
40
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 9/25/2006 Data Types Empirically Given a random sample, estimate the theoretical mean, with the sample mean:
41
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 9/25/2006 Data Types Empirically (Cont.) And estimate the “theoretical cov.”, with the “sample cov.”:
42
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 9/25/2006 Data Types With Linear Algebra Outer product representation:, where:
43
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 9/25/2006 Data Types PCA as an Optimization Problem Find “direction of greatest variability”:
44
Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 9/25/2006 Data Types Applications of PCA Eigenfaces for recognition. Turk and Pentland. 1991. Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001. Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003. Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.