BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties 08-31-2010.

Slides:



Advertisements
Similar presentations
PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.
Advertisements

Lecture 3: A brief background to multivariate statistics
Mutidimensional Data Analysis Growth of big databases requires important data processing.  Need for having methods allowing to extract this information.
Machine Learning Lecture 8 Data Processing and Representation
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Amino Acids:. Peptide Bond * Elimination of water upon formation. * Peptide bond is flat.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Eigenvalues and eigenvectors
Eigenvalues and Eigenvectors
Quantitative Structure-Activity Relationships (QSAR) Comparative Molecular Field Analysis (CoMFA) Gijs Schaftenaar.
Bioinformatics IV Quantitative Structure-Activity Relationships (QSAR) and Comparative Molecular Field Analysis (CoMFA) Martin Ott.
Life and Chemistry: Small Molecules
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Statistical Thermodynamics of Soft and Biological Matter Lecture 4 Diffusion Random walk. Diffusion. Einstein relation. Diffusion equation.
Bayesian belief networks 2. PCA and ICA
Chapter 3 Macromolecules.
Lecture 7: Matrix-Vector Product; Matrix of a Linear Transformation; Matrix-Matrix Product Sections 2.1, 2.2.1,
Separate multivariate observations
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
1 Statistical Analysis Professor Lynne Stokes Department of Statistical Science Lecture 5QF Introduction to Vector and Matrix Operations Needed for the.
Amino acids as amphoteric compounds
Cell Membrane and Transport HOW THE CELL ABSORBS AND EXCRETES VARIOUS MOLECULES.
Life’s Water: Necessary and Abundant
Compiled By Raj G. Tiwari
Summarized by Soo-Jin Kim
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
1 February 24 Matrices 3.2 Matrices; Row reduction Standard form of a set of linear equations: Chapter 3 Linear Algebra Matrix of coefficients: Augmented.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Polymer Molecule made of many monomers bonded together
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Vector Norms and the related Matrix Norms. Properties of a Vector Norm: Euclidean Vector Norm: Riemannian metric:
Phys 102 – Lecture 2 Coulomb’s Law & Electric Dipoles 1.
1d – Intermolecular Forces.  To examine the effects of intermolecular forces on properties of a substance  To know how Van der Waals forces arise and.
Uncommon amino acids, amino acids forming proteins, and primary structure of a protein Sections By Melissa Myers, Caroline Stepanik, and Jade.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Eigenvalues The eigenvalue problem is to determine the nontrivial solutions of the equation Ax= x where A is an n-by-n matrix, x is a length n column.
MODELING MATTER AT NANOSCALES 6.The theory of molecular orbitals for the description of nanosystems (part II) The density matrix.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
Review of Matrix Operations Vector: a sequence of elements (the order is important) e.g., x = (2, 1) denotes a vector length = sqrt(2*2+1*1) orientation.
Chemistry XXI Unit 3 How do we predict properties? M1. Analyzing Molecular Structure Predicting properties based on molecular structure. M4. Exploring.
ILO 1-Explain the chemical structure,classification, and properties of amino acids and how peptides are formed. 2-Describe the order of protein organization.
Protein backbone Biochemical view:
Instructor: Mircea Nicolescu Lecture 8 CS 485 / 685 Computer Vision.
Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Unsupervised Learning II Feature Extraction
Boot Camp in Linear Algebra TIM 209 Prof. Ram Akella.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Electrostatic field in dielectric media When a material has no free charge carriers or very few charge carriers, it is known as dielectric. For example.
Unsupervised Learning II Feature Extraction
Amine R group Alpha Carbon Carboxylic Acid. Nonpolar side chains.
Proteins. Chemical composition of the proteins
Information Management course
Review of Matrix Operations
Eigenvalues and Eigenvectors
Prediction of RNA Binding Protein Using Machine Learning Technique
Conformationally changed Stability
Techniques for studying correlation and covariance structure
Packet #9 Supplement.
Conformationally changed Stability
X.1 Principal component analysis
Feature space tansformation methods
Chemistry of Life.
Principal Component Analysis
Volume 104, Issue 2, Pages (January 2013)
Presentation transcript:

BINF6201/8201 Principle components analysis (PCA) -- Visualization of amino acids using their physico-chemical properties

Physico-chemical properties and amino acids  Physico-chemical properties of the amino acids in a peptide chain determine its folding process, and therefore its 3D structure and functions.  A full understanding of these properties is necessary for understanding the folding process, and therefore design algorithms to predict protein 3D structures from the amino acid sequences.  Commonly known physico-chemical properties of amino acids that affect protein folding: 1.Volume/size: Sum of van der Waals volume of the atoms; Partial volume: the increase in water volume when solved in water Balkiness: the ratio of side chain volume and its length— cross section area

Physico-chemical properties and amino acids 2.Polarity index : Electrostatic force of the amino acid that acts on its surroundings at a distance of 10 Å, which is a combination of force due to electrical charge and dipole movement of polarized amino acid. 3.Isoelectric point (pI): the pH value of the solution in which the net charge on the amino acid is zero. - OH H H+H+ H+H+ pH = pI pH < pI pH > pI Å Amino acid Testing charge +

Physico-chemical properties and amino acids 4. Hydrophobicity: a measure of the solubility of an AA in water. Hydrophobic: difficult to solve in water Hydrophilic: easy to solve in water Hydrophobicity scales: Kyle and Doolittle scale: based on free energy cost when moving an AA from the inside of a protein to its surface. Engleman, Steitz and Goldman scale: based on free energy cost when moving an AA from a lipid bilayer membrane to water. Hydrophobic AAs have positive hydrophobicity values, and Hydrophilic AAs have negative hydrophobicity values. 5. Water accessible area : related to the portion of the side chain that is buried in a folded protein

Physico-chemical properties and amino acids  How can we visulize and quantitatively analyze the similarity and difference of AAs based on these and even more physico-chemical prosperities ?

Analysis of amino acids based on their physico- chemical properties  A simple visualization analysis of amino acids using molecular volume and isoelectric point (pI)  Such a results is not satisfactory, and more sophysticated methods are needed.

Principle component analysis (PCA)  Simply speaking, PCA rescale and transform the data based on the relationship of the properties of the data, such that the data point can be separated based on a few new computated properties (principle components) y2y2 y1y1 p 1 p 2 p 3 d 1 d 2 … d 11 p 1 p 2 p 3 d 1 d 2 … d 11 RescaleTransform c 1 c 2 c 3 d 1 d 2 … d 11 c 1 c 2 d 1 d 2 … d 11 Dimension reduction

Principle component analysis (PCA)  A general dataset can be represented as an N (rows) x P (columns) matrix X: p 1 p 2 … p j … p P d1d2…di…dNd1d2…di…dN where N is the number of objects/data points (eg. 20 amino acids), and P is the number of properties that each object has (eg. 8 properties of AAs).

Principle component analysis (PCA)  Rescale the data by normalization: 1. Compute the mean of each column j: 2. Compute the standard deviation of each column j 3. Normalization:

Principle component analysis (PCA)  Rescale the data (matrix X) by normalization, generating matrix Z: Matrix XMatrix Z

Principle component analysis (PCA)  Transform the matrix Z to a new matrix Y by multiplying Z by a special matrix V: Z N x P V P x P Y N x P  V is computed based on the relationships of the properties of the data, and has the following properties: 1.Each two columns j and k in V are orthologous: 2.Each column in V is a unit vector:

Principle component analysis (PCA)  Each column of Y is called a principle component of the data.  If we rank the principle components according to the their variance, a few columns are predominate relative to the other ones, and they can be used to visulize the data in a reduced dimensional space. … …

Principle component analysis (PCA)  The variety of data can be largely visualized in the reduced space. Plot of 20 AA’s on the first two components of the PCA Two vectors in V that give largest variance in two columns in Y Plot on molecular volume and isoelectric point (pI)

Principle component analysis (PCA)  To compute matrix V, we first compute a P x P matrix of correlation coefficients between each of columns of X (or Z), C:

Principle component analysis (PCA)  The correlation coefficient matrix is symmetric, c ij =c ji, and c ii =1.  The correlation coefficient matrix of the 20 AA based on the 8 properties:

Principle component analysis (PCA)  It can be shown that the columns of V are eigenvectors of C, i.e., n is the corresponding eigenvalue.

Principle component analysis  It can be shown that the variance the n-th component of the matrix Y is equal to the corresponding eigenvalue of the n-th eigenvector:  The total variance of the data is P, so fraction of variance of the first a few (e.g. 2) components is a measure of the representativeness of these principle components of the dataset.  In our PCA analysis of the 20 AAs dataset,