A Statistical Geometry Approach to the Study of Protein Structure Majid Masso Bioinformatics and Computational Biology George Mason University
Protein Basics formed by linearly linking amino acid residues (aa’s are the building blocks of proteins) 20 distinct aa types A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T, V,W,Y
Protein Basics genes: code, or “blueprint” proteins: product, or “building” protein structure gives rise to function why do “things go wrong”? mistakes in “blueprint” incorrectly built, or nonexistent “buildings” Protein Data Bank (PDB): repository of protein structural data, including 3D coords. of all atoms ( PDB ID: 1REZ Structure reference: Muraki M., Harata K., Sugita N., Sato K., Origin of carbohydrate recognition specificity of human lysozyme revealed by affinity labeling, Biochemistry 35 (1996)
Computational Geometry Approach to Protein Structure Prediction Tessellation protein structure represented as a set of points in 3D, using C α coordinates Voronoi tessellation: convex polyhedra, each contains one C α, all interior points closer to this C α than any other Delaunay tessellation: connect four C α whose Voronoi polyhedra meet at a common vertex vertices of Delaunay simplices objectively define a set of four nearest- neighbor residues (quadruplets) 5 classes of Delaunay simplices Quickhull algorithm (qhull program), Barber et al., UMN Geometry Center Voronoi/Delaunay tessellation in 2D space. Voronoi tessellation-dashed line, Delaunay tessellation-solid line (Adapted from Singh R.K., et al. J. Comput. Biol., 1996, 3, ) Five classes of Delaunay simplices. (Adapted from Singh R.K., et al. J. Comput. Biol., 1996, 3, )
Counting Quadruplets assuming order independence among residues comprising Delaunay simplices, the maximum number of all possible combinations of quadruplets forming such simplices is 8855
Residue Environment Scores log-likelihood: = normalized frequency of quadruplets containing residues i,j,k,l in a representative training set of high- resolution protein structures with low primary sequence identity i.e., = total number of quadruplets in dataset containing only residues i,j,k,l divided by total number of observed quadruplets = frequency of random occurrence of the quadruplet (multinomial) i.e., = total number of occurrences of residue i divided by total number of residues in the dataset, where n = number of distinct residue types in the quadruplet, and t i is the number of residues of type i.
Residue Environment Scores total statistical potential (topological score) of protein: sum the log- likelihoods of all quadruplets forming the Delaunay simplices individual residue potentials: sum the log-likelihoods of all quadruplets in which the residue participates (yields a 3D-1D potential profile) Structure reference: R. Lapatto, T. Blundell, A. Hemmings, et al., X-ray analysis of HIV-1 proteinase at 2.7 Å resolution confirms structural homology among retroviral enzymes, Nature 342 (1989) PDB ID: 3phv HIV-1 Protease Monomer 99 amino acids (total potential 27.93)
HIV-1 Protease Comprehensive Mutational Profile (CMP) mutate 19 times the residue present at each of the 99 positions in the primary sequence get total potential and potential profile of each artificially created mutant protein create 20x99 matrix containing total potentials of all the single residue mutants columns labeled with residues in the primary sequence of wild-type (WT) HIV-1 protease monomer, and rows labeled with the 20 naturally occurring amino acids subtract WT total potential (TP) from each cell, then average columns to get CMP CMP j = [(mutant TP) ij -(WT TP)] = [(mutant TP) ij ], j=1,…,99
Structure-Function Correlations 536 single point missense mutations 336 published mutants: Loeb D.D., Swanstrom R., Everitt L., Manchester M., Stamper S.E., Hutchison III C.A. Complete mutagenesis of the HIV-1 protease. Nature, 1989, 340, mutants provided by R. Swanstrom (UNC) each mutant placed in one of 3 phenotypic categories, positive, negative, or intermediate, based on activity mutant activity compared with change in sequence-structure compatibility elucidated by potential data
Observations set of mutants with unaffected protease activity exhibit minimal (negative) change in potential set of mutants that inactivate protease exhibit large negative change in potential, weighted heavily by NC set of mutants with intermediate phenotypes exhibit moderate negative change in potential (similar among C and NC); wide range for intermediate phenotype in the experiments
Acknowledgements Iosif Vaisman (Ph.D. advisor, first to apply Delaunay to protein structure) Zhibin Lu (Java programs for calculating statistical potentials from tessellations) Ronald Swanstrom (experimental HIV-1 protease mutants and activity measure)