Comparing and Classifying Domain Structures

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Web Resources for Bioinformatics Vadim Alexandrov and Mark Gerstein.
C A T H C A T H lass rchitecture opology or Fold Group
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Enzyme Evolution John Mitchell, February Theories of Enzyme Evolution.
Pfam(Protein families )
Basics of Comparative Genomics Dr G. P. S. Raghava.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.
Protein structure (Part 2 of 2).
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
Protein threading Structure is better conserved than sequence
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
BMI 731 Protein Structures and Related Database Searches.
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Protein Structure Prediction II
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
CATH – a hierarchic classification of protein domain structures Rui Kuang.
The Chemistry of Protein Catalysis
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Principles of Protein Structure. AMINOACIDS Estereoisomer L Side-chain (-CH 3 ) }carboxyl-COOH amino amino -NH 2.
Modelling Genome Structure and Function Ram Samudrala University of Washington.
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Chapter 14 Protein Structure Classification
Protein Structure Comparison
Demo: Protein Information Resource
Basics of Comparative Genomics
Predicting Active Site Residue Annotations in the Pfam Database
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.
Classification: understanding the diversity and principles of
Protein structure prediction.
Basics of Comparative Genomics
Presentation transcript:

Comparing and Classifying Domain Structures Methods for comparing protein structures Protein structural classifications How do structures and functions diverge in protein superfamilies What proportion of genome sequences can be predicted to belong to superfamilies of known structure?

Protein Domain Family Classifications Known domain structures Alexey Murzin, LMB, Cambridge Predicted domain structures Julian Gough, Bristol University Known domain structures Predicted domain structures Christine Orengo, UCL Domain sequences Alex Bateman, Sanger

domains are important evolutionary units 60-80% of genes in genomes code for multidomain proteins

Evolution gives rise to families of proteins (homologues) Domain Superfamily human yeast human M. tuberculosis Th. thermophilus structure is more highly conserved than sequence during evolution At least 40-50% of the structure is conserved 4

Evolution gives rise to families of proteins (homologues) orthologues Domain Superfamily human yeast human M. tuberculosis Th. thermophilus structure is more highly conserved than sequence during evolution At least 40-50% of the structure is conserved 5

Evolution gives rise to families of proteins paralogues Domain Superfamily human yeast human M. tuberculosis Th. thermophilus structure is more highly conserved than sequence during evolution At least 40-50% of the structure is conserved 6

Structural diversity in the CATH Domain Family P-loop hydrolases Cocaine esterase Acetylcholinesterase Cutinase structure is more highly conserved than sequence during evolution At least 40-50% of the structure is conserved

Challenges in comparing protein structures residue substitutions due to single base mutations insertions or deletions (indels) of residues - usually not in the secondary structures but in the connecting loops Usually the structural cores are highly conserved Although structure is much more conserved than the sequence there can still be considerable structural differences between relatives outside the core

residue insertions usually occur in the loops connecting secondary residue insertions usually occur in the loops connecting secondary structures substitutions can cause shifts in the orientations of secondary structures

Superposition of OB fold Structures

Related structures RMSD usually < 3.5A

Coping with Insertions and Deletions ignore the variable loop regions and only compare the secondary structures use algorithms which can explicitly handle insertions/deletions e.g. dynamic programming, simulated annealing

Fast structure comparison by secondary structures

In this example the common graph contains 5 nodes. Graphs can be compared using the Bron Kerbosch algorithm to find the largest common graph In this example the common graph contains 5 nodes. E E E E E E H E H H H H H H H H H Generallly ~1000 times faster than residue based methods

STRUCTAL Score distances between superposed residues in path matrix Use dynamic programming to find best path Align sequences Superpose structures Use equivalences given by the best path to re-superpose the structures

Structure Comparison Algorithms Structure classification Secondary structure based: SSM Henrick PDB GRATH Harrison & Orengo CATH Residue based: SSAP Taylor and Orengo CATH DALI Holm and Sander SCOP Comparer Sali and Blundell HOMSTRAD FatCat Adam Godzik PDB Structal Levitt PDB Structural Bioinformatics, Ed: Phil Bourne, Wiley 2003 Bioinformatics: Genes, Proteins and Computers, Bios, 2003

Domain structure database lass Domain structure database A Orengo & Thornton 1993 rchitecture T opology or Fold Group H omologous Superfamily ~200,000 domains 2600 domain superfamilies

C A T H 3 ~40 ~1200 ~200,000 domains Class Architecture Topology or Fold 3 ~40 ~1200 domain database

CATH Architectures Orthogonal bundle Up-down bundle -horseshoe a-solenoid aa-barrel b-ribbon b-sheet b-roll b-barrel

CATH Architectures Clam 2-layer b-sandwich Trefoil Orthogonal b-prism Parallel b-prism 3-layer b-sandwich b-solenoid ab-roll b-propeller

CATH Architectures ab-barrel 2-layer (ab) sandwich 3-layer (aba) sandwich 3-layer (bba) sandwich 3-layer (bab) sandwich 4-layer (abba) sandwich ab-prism ab-box ab-horseshoe

C A T H ~200,000 domain entries 40,000 domain entries Topology or Fold Group ~1200 40,000 domain entries ~200,000 domain entries Homologous Superfamily ~2600 Sequence Family (30%)

Divergent Evolution Convergent Evolution Divergent Evolution ..VILST… ..KLST… ...SLTRF... ..VILST… ..KLST… ...SLTRF... Convergent Evolution Convergent Evolution

Homologous Structures cholera toxin pertussis toxin SSAP score 97 81 79% 12% Sequence identity Heat labile enterotoxin high structure similarity score, often < 4A may have detectable sequence similarity e.g. by HMMs related functions

Evolutionary Ancestry Uncertain structural similarity no sequence similarity no functional similarity Evolutionary Ancestry Uncertain

How do proteins evolve new functions?

Evolution of Protein Functions in Domain Superfamilies domain duplication residue mutations and domain structure embellishments domain fusion, change in domain partner oligomerisation

Mutation of Residues TIM barrel glycosyl hydrolases acid chitinase A Glu general acid narbonin Glu incorporated in a salt-bridge and this blocks substrate access

Changes in domain function in paralogous relatives EC code: 2.7.7.3 2.7.7.39 binding site binding site Pantetheine-phosphate adenyltransferase Glycerol-3-phosphate cytidylyl transferase changes in the domain structure can modify the binding site or domain surface

Pantetheine-phosphate adenyltransferase Arginyl-tRNA synthetase binding site Pantetheine-phosphate adenyltransferase Arginyl-tRNA synthetase 1od6A00 1f7uA01

Arginyl-tRNA synthetase

changes in the domain partnerships can modify the binding site Pantetheine-phosphate adenyltransferase Asparagine synthetase B

Change in Oligomerisation Thioredoxin superfamily peroxidase calsequestrin

The Mosaic Theory of Protein Evolution Teichmann et al 2001,2003 Gerstein et al. 2001 60-80% of proteins are multi-domain few thousand domain superfamilies (< 10,000 CATH, SCOP and Pfam) > Two million domain combinations (multi-domain architectures)

Similarity in Chemistry conserved I P 19% P semiconserved I 67% P P P poorly conserved I 7% P P I’ P’ 7% unconserved nearly 90% of families show full or partial conservation of functions

chemistry is conserved or semi-conserved across the family but the substrate can change cytochrome P450s FAD/NAD(P)(H)-dependent disulphide oxidoreductases hexapeptide repeat proteins

blade domain

fulcrum domain 41

handle domain 42

How representative are these structural superfamilies (ie in CATH, SCOP) of all proteins in nature?

:Domain structure predictions in genome sequences protein sequences from UniProt scan against library of sequence patterns (HMM models) for CATH ~ 26 million domain sequences assigned to CATH superfamilies ~6000 annotated genomes

10,340 curated families with annotation Pfam-A Pfam-B Other Pfam-A 10,340 curated families with annotation 47

CATH and Pfam coverage of genomes NewFam?

Protein Family Databases Each family is represented by a sequence profile or HMM