The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Eigen Decomposition and Singular Value Decomposition
3D Geometry for Computer Graphics
Component Analysis (Review)
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
An introduction to Principal Component Analysis (PCA)
Subspace and Kernel Methods April 2004 Seong-Wook Joo.
Chapter 5 Orthogonality
Principal Component Analysis
Computer Graphics Recitation 5.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimensional reduction, PCA
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Face Recognition Jeremy Wyatt.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Unsupervised Learning
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Chapter 2 Dimensionality Reduction. Linear Methods
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
Object Orie’d Data Analysis, Last Time
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: Detailed (math ’
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
1 Introduction to Kernel Principal Component Analysis(PCA) Mohammed Nasser Dept. of Statistics, RU,Bangladesh
CSE 185 Introduction to Computer Vision Face Recognition.
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Unsupervised Learning II Feature Extraction
Cornea Data Main Point: OODA Beyond FDA Recall Interplay: Object Space  Descriptor Space.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Principal Component Analysis (PCA)
Principal Component Analysis
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Principal Component Analysis
Participant Presentations
Dimension reduction : PCA and Clustering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Feature Selection Methods
Principal Component Analysis
Presentation transcript:

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 9/25/2006 Data Types Administrative Project design is due Oct 30 th ~3 weeks from now Include the following items in the document The goal of the project A brief introduction of the overall project A list of background materials that will be covered in the final report A high level design of your project A testing plan

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 9/25/2006 Data Types Overview Gain insights of high dimensional space by projection pursuit (feature reduction). PCA: Principle components analysis A data analysis tool Mathematical background PCA and gene expression profile analysis briefly

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 9/25/2006 Data Types A Group of Related Techniques Unsupervised Principal Component Analysis (PCA) Latent Semantic Indexing (LSI): truncated SVD Independent Component Analysis (ICA) Canonical Correlation Analysis (CCA) Supervised Linear Discriminant Analysis (LDA) Semi-supervised Research topic

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 9/25/2006 Data Types Rediscovery – Renaming of PCA Statistics: Principal Component Analysis (PCA) Social Sciences: Factor Analysis (PCA is a subset) Probability / Electrical Eng: Karhunen – Loeve expansion Applied Mathematics: Proper Orthogonal Decomposition (POD) Geo-Sciences: Empirical Orthogonal Functions (EOF)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 9/25/2006 Data Types An Interesting Historical Note The 1st (?) application of PCA to Functional Data Analysis: Rao, C. R. (1958) Some statistical methods for comparison of growth curves, Biometrics, 14, st Paper with “Curves as Data” viewpoint

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 9/25/2006 Data Types What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most of the sample's information. Useful for the compression and classification of data. By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 9/25/2006 Data Types A Geometric Picture the 2 nd PC is the line, orthogonal to, to capture the remaining total variance the 1 st PC is the line in the space such that the “projected” data set has the largest total variance PCs are a series of linear fits to a sample, each orthogonal to all the previous.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 9/25/2006 Data Types Connect Math to Graphics 2-d Toy Example Feature Space Object Space Data Points (Curves) are columns of data matrix, X

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Sample Mean, X

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Residuals from Mean = Data - Mean

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Recentered Data = Mean Residuals, shifted to 0

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC1 Direction = η = Eigenvector (w/ biggest λ)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data PC1 Projection Residual

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC2 Direction = η = Eigenvector (w/ 2 nd biggest λ)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 9/25/2006 Data Types Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data PC2 Projection Residual

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 9/25/2006 Data Types Connect Math to Graphics (Cont.) Note for this 2-d Example: PC1 Residuals = PC2 Projections PC2 Residuals = PC1 Projections (i.e. colors common across these pics)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 9/25/2006 Data Types PCA and Complex Data Analysis Data set is a set of curves How to find clusters? Treat curves as points in a high dimensional space Applications in gene expression profile analysis Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14,

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 9/25/2006 Data Types N-D Toy Example Upper left shows the mean. Upper right is residuals from mean. Lower left is projections of the mean residuals in the PC1 direction. Lower right is further residuals from PC1 projections.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 9/25/2006 Data Types Yeast Cell Cycle Data Central question: Which genes are “ periodic ” over 2 cell cycles?

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 9/25/2006 Data Types Yeast Cell Cycle Data, PCA analysis Periodic genes? Na ï ve approach: Simple PCA

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 9/25/2006 Data Types Yeast Cell Cycle Data, FDA View Central question: which genes are “periodic” over 2 cell cycles? Naïve approach: Simple PCA Doesn’t work No apparent (2 cycle) periodic structure? Eigenvalues suggest large amount of “variation” PCA finds “directions of maximal variation” Often, but not always, same as “interesting directions” Here need better approach to study periodicities

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 9/25/2006 Data Types Yeast Cell Cycles, Freq. 2 Proj. PCA on Freq. 2 Periodic Component Of Data

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 9/25/2006 Data Types PCA for 2D Surfaces 2-d M-Rep Example: Corpus Callosum Atoms Spokes Implied Boundary

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 9/25/2006 Data Types Pros and Cons PCA works for Multi-dimensional Gaussian distribution It doesn’t work for Gaussian mixtures Data in non-Euclidian spaces

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 9/25/2006 Data Types Detailed Look at PCA Three important (and interesting) viewpoints: Mathematics Numerics Statistics 1st: Review linear alg. and multivar. prob.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 9/25/2006 Data Types Review of Linear Algebra Vector Space: set of “vectors”,, and “scalars” (coefficients or an element in a field), “closed” under “linear combination” ( in space) For example: “ d dim Euclid’n space”

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 9/25/2006 Data Types Subspace Subspace: subset that is again a vector space which is closed under linear combination Examples: lines through the origin planes through the origin all linear combos of a subset of vector (= a hyperplane through origin)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 9/25/2006 Data Types Basis Basis of subspace: set of vectors that span, i.e. everything is a lin. com. of them are linearly indep’t, i.e. lin. Com. is unique Example: “unit vector basis” in e.g.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 9/25/2006 Data Types Basis Matrix Basis Matrix, of subspace of Given a basis: create matrix of columns: Then “linear combo” is a matrix multiplicat’n:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 9/25/2006 Data Types Linear Transformation Aside on matrix multiplication: (linear transformation) for matrices Define the “matrix product” (“inner products” of columns with rows) (composition of linear transformations)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 9/25/2006 Data Types Matrix Trace For a square matrix Define Trace commutes with matrix multiplication:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 9/25/2006 Data Types Dimension Dimension of subspace (a notion of “size”): number of elements in a basis (unique) (use basis above) Example Dimension of a line is 1 Dimension of a plane is 2 Dimension is “degrees of freedom”

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 9/25/2006 Data Types Vector Norm in, Idea: “length” of the vector “length normalized vector”: (has length one, thus on surf. of unit sphere) get “distance” as:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 9/25/2006 Data Types Inner Product Inner (dot, scalar) product: for vectors and, related to norm, via measures “angle between and ” as: key to “orthogonality”, i.e. “perpendicul’ty”: if and only if

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 9/25/2006 Data Types Orthonormal Basis Orthonormal basis : All ortho to each other, i.e., for All have length 1, i.e., for “Spectral Representation”: where check: Matrix notation: where i.e. is called “transform (e.g. Fourier, wavelet) of ”

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 9/25/2006 Data Types Vector Projection Projection of a vector onto a subspace : Idea: member of that is closest to (i.e. “approx’n”) Find that solves: (“least squa’s”) General solution in : for basis matrix So “proj’n operator” is “matrix mult’n”: (thus projection is another linear operation)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 9/25/2006 Data Types Vector Projection (cont) Projection using orthonormal basis : Basis matrix is “orthonormal”: So = Recon(Coeffs of “in dir’n”) For “orthogonal complement”,, and Parseval inequality:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 9/25/2006 Data Types Random Vectors Given a “random vector” A “center” of the distribution is the mean vector, A “measure of spread” is the covariance matrix:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 9/25/2006 Data Types Empirically Given a random sample, estimate the theoretical mean, with the sample mean:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 9/25/2006 Data Types Empirically (Cont.) And estimate the “theoretical cov.”, with the “sample cov.”:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 9/25/2006 Data Types With Linear Algebra Outer product representation:, where:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 9/25/2006 Data Types PCA as an Optimization Problem Find “direction of greatest variability”:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 9/25/2006 Data Types Applications of PCA Eigenfaces for recognition. Turk and Pentland Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14,