Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Secondary structure prediction from amino acid sequence.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Todd J.Taylor, Iosif I.Vaisman Abstract: A method of protein structural domain assignment using an Ising/Potts-like.
Introduction to Bioinformatics
Strict Regularities in Structure-Sequence Relationship
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Centre for Integrative Bioinformatics.
Protein structure (Part 2 of 2).
The following slides present some answers….. Please don’t peek before doing the exercise!
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Structures and Structure Descriptions Chapter 8 Protein Bioinformatics.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein structures in the PDB
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Single Motif Charles Yan Spring Single Motif.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Classification A comparison of function inference techniques.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Lecture 3. α domain structures Coiled-coil, knobs and hole packing Four-helix bundle Donut ring large structure Globin fold Ridges and grooves model CS882,
IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Centre for Integrative Bioinformatics.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Lecture 10: Protein structure
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BMMB597E Protein Evolution Protein classification 1.
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
A Global View of the Protein Structure Universe and Protein Evolution Sung-Hou Kim University of California, Berkeley, CA U.S.A. June 27, 2006.
Comparing and Classifying Domain Structures
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
1 (c) Mark Gerstein, 1999, Yale, bioinfo.mbb.yale.edu Several motifs (  -sheet, beta-alpha-beta, helix-loop-helix) combine to form a compact globular.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Chapter 14 Protein Structure Classification
Demo: Protein Information Resource

Sequence Based Analysis Tutorial
Protein Structures.
Sebastian Meyer, Raimund Dutzler  Structure 
Protein structure prediction.
Volume 99, Issue 1, Pages (October 1999)
Yang Liu, Perry Palmedo, Qing Ye, Bonnie Berger, Jian Peng 
Volume 14, Issue 7, Pages (February 2016)
Sequential Hierarchical Clustering
Probing the “Dark Matter” of Protein Fold Space
Crystal Structure of a Phosphoinositide Phosphatase, MTMR2
Presentation transcript:

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics Institute VU (IBIVU) Faculty of Sciences / Faculty Earth and Life Sciences Vrije Universiteit Amsterdam

Protein structure evolution Insertion/deletion of secondary structural elements can ‘easily’ be done at loop sites

Protein structure evolution Insertion/deletion of secondary structural elements can ‘easily’ be done at loop sites Protein structure evolution Insertion/deletion of structural domains can ‘easily’ be done at loop sites N C

Folds: how many? Chothia (1992) – appr. 1,000 folds Estimates vary from 1,000 – 10,000 With 30,000 human genes, ≥3 genes per fold on average four broad structural protein fold classes: all-α all-β α/β α+β Fold classification Chothia, C., Proteins. One thousand families for the molecular biologist. Nature, (6379): p Zhang, C. and C. DeLisi, Estimating the number of protein folds. J Mol Biol, (5): p

The first protein structure in 1960: Myoglobin -  fold

Tropomyosin Coiled-coil domains This long protein is involved In muscle contraction

5(  ) fold Flavodoxin fold Flavodoxin family - TOPS diagrams (Flores et al., 1994)  /  fold

Greek key  -strand motif

Plait motifAlpha-beta barrel

 3-layer motifs (2 layers of helices with a  -sheet in between) are often specified as x-y-z (e.g ), where x is number of helices in first helical layer, y is number of strands in  -sheet, and y is number of helices in second helical layer

For  proteins, there are no good classification systems. You can only count…

How many folds – Chothia 1992 The first estimate of the number of protein families has been explicitly done by Chothia in At that time about 120 structural families were known. Chothia summarized the results of several genome projects and revealed that the chances of a random protein to belong to one of the known sequence families is approximately 1/3. According to the results of sequence comparison of the PDB with sequence databases (Sander, Schneider 1991), about 1/4 of all sequences appeared to be similar to one of the PDB entries at 25% identity level. Assuming equal distribution of proteins among the families, Chothia concluded that the total number of protein structural families should be equal to 120*3*4 = 1440.

How many folds – Alexandrov & Go, 1994, updated Pfam-2.1 database consists of 101,724 domains of proteins from SwissProt (Bairoch & R., 1996) release 34, clustered in 13,816 families. There were also 7,694 proteins of 30 or more amino acids in SwissProt-34, which are not present in Pfam and are not similar to other proteins. We have added them into the database, which now contains 109,418 domains in 21,510 families. We have eliminated very similar sequences from the database, trying to make the database more homogeneous. In the final classification there were 60,601 domains, distributed within 21,510 families. All families were ranked by the number of domains in each family. The resulting distribution fits nicely to the Zipf’s law.

How many folds Distribution of protein sequences among protein families. One can see that the distribution is essentially non- equal. The shape of the distribution is described very well by Zipf’s law: n(r) = ar -b, with a= 640 and b=0.64. Correlation coefficient of this approximation equals to r is the rank of family, n(r) is the number of proteins in the r-th family, a is a scaling constant, depending on the number of proteins in the dataset, and b Constant b does not depend on the size of the dataset.

Fold number according to Alexandrov & Go 60,000 protein sequence families in 14,000 different folds

Fold number according to Alexandrov & Go An important feature of Zipf distribution is that it has a very long tail of clusters with only few members in it. For example, if b=0.7, half of all proteins is located in 10% of all clusters.

General fold classification systems The definitions of four broad structural classes, all-α, all-β, α/β, and α+β, based on secondary structure compositions and β-sheet topologies [Levitt & Chothia, 1976] represented the first step towards a global characterization of the protein fold space. These definitions have been generally accepted and are being used by many classification systems to organize the fold hierarchy [Murzin et al., 1995; Orengo et al., 1997]. However, there is a need for methods to represent the full range of structural relationships among folds for a better understanding of the organizing principles and features of the protein fold space. The fold family trees such as those built by Effimov [1997], Zhang and Kim [2000] and Taylor [2002] are very informative, but the construction of such trees involves extensive manual operations and, sometimes, considerable human judgment. An alternative approach is to apply a uniform measure of the structural similarity across all fold types and map the structural relationships into a low dimensional space. Two such maps have been introduced, one is represented in the CATH database by Orengo and colleages [1997] and the other in the DALI database by Holm and Sander [1993]. Although the two maps are based on different structural alignment algorithms and multivariant analysis methods, they give similar two-dimensional projections featuring three large clusters corresponding to α, β, and α/β folds, respectively.

General fold classification system references Levitt, M. and C. Chothia, Structural patterns in globular proteins. Nature, (5561): p Murzin, A.G., et al., SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol, (4): p Orengo, C.A., et al., CATH--a hierarchic classification of protein domain structures. Structure, (8): p Taylor, W.R., A 'periodic table' for protein structures. Nature, (6881): p Orengo, C.A., et al., Identification and classification of protein fold families. Protein Eng, (5): p Efimov, A.V., Structural trees for protein superfamilies. Proteins, (2): p Zhang, C. and S.H. Kim, A comprehensive analysis of the Greek key motifs in protein beta- barrels and betasandwiches. Proteins, (3): p Holm, L. and C. Sander, Protein structure comparison by alignment of distance matrices. J Mol Biol, (1): p

Fold distribution Metric matrix distance geometry method applied to all pair-wise “distances” (structural dissimilarities) to assign three- dimensional coordinates to a set of 498 SCOP folds such that the relative distance between two folds is inversely correlated with the DALI alignment score. The results of the mapping are shown in the figure on the left.

The first 20 eigen values of the metric matrix calculated from the 498x498 DALI structural alignment scores.

Comparing the fold usages between two species in the eubacterial domain (Chlamydia versus Aquifex, A) and between those of two different domains (Chlamydia of bacteria versus Halobacterium of archaea, B). The usages of the 498 folds by the second organism are subtracted from the fold usages by the first organism. A contour surface (mesh) is then constructed and set at the values of 0.4% for blue and –0.4% for red. Regions within the blue contour include folds that appear more frequently in the first organism, whereas regions within the red contour include folds that occur more frequently in the second organism.

CATH database

Domain size The size of individual structural domains varies widely from 36 residues in E-selectin to 692 residues in lipoxygenase-1 (Jones et al., 1998), the majority (90%) having less than 200 residues (Siddiqui and Barton, 1995) with an average of about 100 residues (Islam et al., 1995). Small domains (less than 40 residues) are often stabilised by metal ions or disulphide bonds. Large domains (greater than 300 residues) are likely to consist of multiple hydrophobic cores (Garel, 1992).