NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.

Slides:



Advertisements
Similar presentations
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Advertisements

Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Pfam a resource for remote homology domain identification et al NAR 2014.
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Structural bioinformatics
Strict Regularities in Structure-Sequence Relationship
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Protein structure (Part 2 of 2).
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
The Protein Data Bank (PDB)
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein Structure Prediction II
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Classification A comparison of function inference techniques.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Classifying the protein universe Synapse- Associated Protein 97 Wu et al, EMBO J 19:
Automatic methods for functional annotation of sequences Petri Törönen.
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
Practical session 2b Introduction to 3D Modelling and threading 9:30am-10:00am 3D modeling and threading 10:00am-10:30am Analysis of mutations in MYH6.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
Protein and RNA Families
Identification of Ortholog Groups by OrthoMCL Protein sequences from organisms of interest All-against-all BLASTP Between Species: Reciprocal best similarity.
Identifying property based sequence motifs in protein families and superfamies: application to DNase-1 related endonucleases Venkatarajan S. Mathura et.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Manually Adjusting Multiple Alignments Chris Wilton.
Bioinformatics – NSF Summer School 2003 Z. Luthey-Schulten, UIUC.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
Comparing and Classifying Domain Structures
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.
Protein Homologue Clustering and Molecular Modeling L. Wang.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
A New Interface to GeneKeyDB Methods for analyzing relationships among proteins based on shared motifs Chris Symons & Xinxia Peng.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Chapter 14 Protein Structure Classification
Demo: Protein Information Resource
Dot Plots, Path Matrices, Score Matrices
Predicting Active Site Residue Annotations in the Pfam Database
Homology Modeling.
Protein structure prediction.
Presentation transcript:

NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki

Definitions Nrdbxx = nrdb where no two sequences are more than xx % identical; redundant sequences are mapped to representative –Uniprot + Genpept + PIR + PDB + … –Nrdb100 – Nrdb90 – … – Nrdb40 – Nrdb30 = “modeling family” PairsDB = database of all-against-all comparisons –Blast in nrdb90, PSI-Blast in nrdb40 BIG = family detected by profile comparison –Profile needs seed set (alignment); automatic iterative profile construction has poor convergence –Profiles  Partially overlapping neighbour sets  Need to cluster sequences  Clustering artefacts when true cluster shape is non-spherical

(graph) covering ≠ clustering ≠ classification Incomplete detection of homologous set by profile models Example: Urease et al. superfamily IDEAL REAL

ADDA: clustering of domains into families ADDA = Automatic Domain Definition Algorithm –Heger & Holm (2003) J Mol Biol 328, –Heger & al (2005) Nucl. Acids Res. 33 Database Issue, D188-D191. Principles of ADDA –Blast all-against-all comparison in nrdb90 –Domains are optimally covered by alignments Complete domain coverage; every residue belongs to a domain –Minimum/maximum spanning tree of domains –Remove links where profile-profile score is below threshold –Connected components are domain families Quality assessment –Most ADDA families are pure, containing one PFAM family or SCOP superfamily (plus previously unclassified members) –Occasionally members from different PFAM family are merged in one ADDA family (contamination or PFAM misclassification) –Domain size distribution is reasonable For example, much less over-fragmentation than by Prodom algorithm

ADDA purity and domain size PFAM SCOP Accuracy of domain boundaries -Red: best possible in domain tree -Black: actually selected

3D coverage of model proteomes PDB entries from May 2006 –Required greater than 80 % overlap between PDB sequence and ADDA domain to call family structurally covered ADDA domain families –BIG families families have more than ten members in nrdb100 –2383 structurally covered BIG families 8820 families have more than ten members in nrdb40 –1869 structurally covered BIG families NCBI genome sets –H sapiens, C elegans, D melanogaster, A thaliana, E coli, B anthracis, T maritima –Mapped to ADDA families 6770 BIG(nrdb40) families occur in model genome set –1705 structurally covered

Model genome coverage – BIG families in nrdb100 T. maritima would be covered by 1000 BIG families and is two thirds done

6770 BIG families in nrdb40 Multigene families in eukaryotes domains per euk. gene; 1.3 domains per prok. gene

Seven model genomes HumanWorm, fly, plant Prokaryotes (E coli, B anthracis, T maritima) Human BIG target families are almost exclusively eukaryote- specific HumanWorm, fly, plant Prokaryotes Universal BIG families are almost covered 5065 white BIG target families 1705 structurally covered BIG families

Covering all modelling families will have astronomical cost Nrdbxx updates; Nrdb30 = “modelling family”

Fine-grained coverage MF: Structural core shrinks rapidly below 30 % sequence identity –  Need less naïve modelling software capable of building those parts ab initio which are not covered by template –Misalignment is major source of error  Transitive alignment covers more of the structurally equivalent core Average coverage of structural core (152 pairs in 11 superfamilies): Transitive 51 % Global alignment (HMMer) 43 % Local alignment (PSI-Blast) 34 % ErrorRmsd/A Template32 Misaligned16 Loops8 Backbone4 Rotamers2

Coarse-grained coverage BIG/BIGGER: Homology detection –Difficulty of aligning remote homologues  Shared sequence motifs suggest conserved biochemical mechanism  Functional classification –Sequence comparison only detects half of remote homologue pairs  Structure comparison reveals missing links Transitive search for conserved motifs detects more remote homologues than profile-profile comparison

Clustering PFAM families Comparison of ADDA to PFAM-A resulted in extension but no discovery of completely new large families PFAM-A v.19: 7340 families, 2451 covered according to PFAM’s assignments, 1396 families in 205 clans Our method achieved 30 % coverage of clan relationships at 5 % error rate compared to 23 % coverage at 5 % error rate by profile-profile comparison –1083 unclassified PFAMs linked to 205 known clans 1219 white PFAMs linked to known structure in 155 clusters –1256 PFAMs clustered in 470 predicted clans 336 white PFAMs linked to known structure in 222 clusters –3610 PFAMS remained singletons 2352 white PFAMs  2451 covered, ~1555 fold assignments, ~3334 targets

Conclusions ADDA  ~3000 human target families –~40 % reduction in number of PFAM target families by fold assignment (based on sequence only) Coarse-grained coverage yields information out of reach to sequence comparison –Need to improve measures of sequence similarity to infer homology Sequence motif-based functional classification –Need to increase the radius of convergence in template- based structure prediction Protein complexes  hypothesis-driven research –Large conformational changes –Multigene receptor-ligand pair discrimination involves rotations in docking orientation

Acknowledgements Andreas Heger, Oxford University Swapan Mallick, Ashwin Sivakumar, Chris Wilton, Institute of Biotechnology Funding: Academy of Finland, Sigrid Juselius Foundation, EU