C A T H C A T H lass rchitecture opology or Fold Group

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
CATH and SCOP Topic 8 Chapters 17 & 18, Gu and Bourne “ Structural Bioinformatics”
Pfam(Protein families )
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Protein Tertiary Structure Prediction
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein structure. Amino acids Amino acids: R group properties.
Strict Regularities in Structure-Sequence Relationship
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.
Profile-profile alignment using hidden Markov models Wing Wong.
Protein structure (Part 2 of 2).
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
The Protein Data Bank (PDB)
. Protein Structure Prediction [Based on Structural Bioinformatics, section VII]
Protein Tertiary Structure. Primary: amino acid linear sequence. Secondary:  -helices, β-sheets and loops. Tertiary: the 3D shape of the fully folded.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein structures in the PDB
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein Structure Prediction II
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Tertiary Structure Prediction Structural Bioinformatics.
IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.
Protein Tertiary Structure Prediction
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Protein Sequence Alignment and Database Searching.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
CATH – a hierarchic classification of protein domain structures Rui Kuang.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
DALI Method Distance mAtrix aLIgnment
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Comparing and Classifying Domain Structures
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Principles of Protein Structure. AMINOACIDS Estereoisomer L Side-chain (-CH 3 ) }carboxyl-COOH amino amino -NH 2.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Chapter 14 Protein Structure Classification
Demo: Protein Information Resource
Dot Plots, Path Matrices, Score Matrices
Sequence Based Analysis Tutorial
Classification: understanding the diversity and principles of
Sequence Based Analysis Tutorial
Protein structure prediction.
Protein Structural Classification
Presentation transcript:

C A T H C A T H lass rchitecture opology or Fold Group domain database A Orengo & Thornton 1994 rchitecture T opology or Fold Group H omologous Superfamily The CATH domain database and associated resources - DHS, Gene3D How do we determine domain boundaries? How do we you identify fold groups and evolutionary superfamilies? What is the distribution of the CATH domain families in the PDB and in the genomes?

Multidomain proteins ~40% of the entries in CATH are multidomain ~20,000 chains from Protein Databank (PDB) ~50,000 domains in CATH structure database ~40% of the entries in CATH are multidomain

Domains are important evolutionary units analysis by Teichmann and others suggests that ~60-80% of genes in genomes may be multidomain

~30% of multidomains in CATH are discontinuous Carboxypeptidase A (2ctc) Carboxypeptidase G2 (1cg2A) ~30% of multidomains in CATH are discontinuous

Algorithms for Recognising Domain Boundaries DETECTIVE Swindells 1995 each domain should have a recognisable hydrophobic core DOMAK Siddiqui & Barton, 1995 residues comprising a domain make more internal contacts than external ones PUU Holm & Sander, 1994 parser for protein folding units: maximal interaction within domains and minimal interaction between domains Consensus is sought between the three methods – on average this occurs about 20% of the time

Homologues/analogues 74% Close homologues 29% 21% Twilight zone 4% Midnight zone 11% Homologues/analogues

Algorithms for Recognising Homologues Sequence Based methods close homologues – BLAST (Altschul et al.) - SSEARCH (Smith & Waterman) remote homologues – SAM-T99 (Karplus et al) Structure Based Methods close & remote homologues - CATHEDRAL (Harrison, Thornton Orengo) - SSAP (Taylor & Orengo) - CORA (Orengo)

Homologues/analogues 74% Close homologues SSEARCH 29% 21% Twilight zone HMMs, SSAP 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP

Hidden Markov Models (HMMs) SAM-T99 Karplus Group SAMOSA Orengo Group Non redundant GenBank database query sequence hits these methods can currently identify ~70% of remote homologues (3 times more powerful than BLAST)

Percentage of PDB structures classified in CATH by different methods over the last 2 years remote homologues (8.6) analogues (1.9) SSAP Novel folds 2.0 1.9 remote homologues (<30%) HMMs 8.6 7.6 20.7 59.2 Close homologues (>30%) SSEARCH Near-identical SSEARCH

Percentage of structural genomics PDB structures classified in CATH by different methods over the last 2 years near-identical SSEARCH novel folds 22.0 8.0 28.4 7.7 11.8 analogues SSAP close homologues (>30%) SSEARCH remote homologues SSAP remote homologues (<30%) HMMs

Structure Based Algorithms for Recognising Homologues CATHEDRAL Pairwise alignment - secondary structure comparison SSAP Pairwise alignment - residue comparison CORA Multiple alignment – residue comparison

Homologues/analogues 74% Close homologues ssearch 29% 21% Twilight zone HMMs 4% Midnight zone CATHEDRAL, SSAP 11% Homologues/analogues CATHEDRAL, SSAP

structure is much more highly conserved than sequence cholera toxin pertussis toxin Structure similarity (SSAP) score 97 81 Heat labile enterotoxin 79% 12% Sequence identity

structure similarity (SSAP) Pairwise Sequence Identities and Structure Similarity (SSAP) Scores in CATH Domain Families structure similarity (SSAP) score same function different function sequence identity (%)

Residue insertions in the loops connecting secondary structures Shifts in the orientations of secondary structures

Structural variation in the P-loop Hydrolase Superfamily

Structural variation in the Galectin Binding Superfamily

Fast Structure Comparison Method (CATHEDRAL) Andrew Harrison et al., JMB, 2002 ignore the variable loop regions and only compare the secondary structures derive vectors through secondary structure elements compare closest approach distances and vector orientations using graph theory

d a b a . b = | a || b | cos  + dihedral angle  + chirality

Compares graphs of proteins CATHEDRAL CATHs Existing Domain Recognition ALgorithm d, , , chirality H H edge d, , , chirality d, , , chirality H node Compares graphs of proteins

overlap graph has a structural motif of 3 secondary structures Comparing proteins with similar folds identifies an overlap graph with the largest common structural motif A III A,a I C III II B I C,d IV a B,c II III b b I overlap graph has a structural motif of 3 secondary structures d V II c

In this example the common graph contains 5 nodes. Graphs are compared using the Bron Kerbosch algorithm to find the largest common graph In this example the common graph contains 5 nodes. 1000 times faster than residue based methods (e.g. SSAP)

Performance

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2

statistical significance can be assessed by scanning a protein ‘graph’ against ‘graphs’ of all known structures Score ~ common graph size (size protein1 . size protein2)1/2

F = A e - b . score log F = log A - b .score scores for unrelated structures exhibit an extreme value distribution F = A e - b . score log F = log A - b .score allows you to calculate the probability (P-value, E-value) of obtaining any score by chance

Using CATHEDRAL to Identify Domain Boundaries Graph based secondary structure comparison is very fast - 1000 times faster than residue based methods New multi-domain structures can be rapidly scanned against the library of CATH domains. E-values can be used to identify significant matches. 85-90% of domains in new multi-domain structures have relatives in CATH

CATHEDRAL residues in new multi-domain Multi-domain structure Secondary structure match by graph SSAP residue alignment residues in new multi-domain residues in CATH domain family 1 Fold A residues in CATH domain family 2 Fold B

residue based structure comparison method using dynamic programming SSAP Protein B Protein A Taylor & Orengo, J. Mol. Biol. 1989 residue based structure comparison method using dynamic programming Scores range from 0-100 Residues in protein A Residues in protein B

One third of known multi-domain structures are discontinuous CATHEDRAL One third of known multi-domain structures are discontinuous

Reasons for Structural Similarity Divergence - similarity arises due to divergent evolution from a common ancestor - structure much more highly conserved than sequence Convergence - similarity due to there being a limited number of ways of packing helices and strands in 3D space

Domain structure database lass Domain structure database A Orengo & Thornton 1994 rchitecture T opology or Fold Group H omologous Superfamily ~50,000 domains in PDB ~1500 domain superfamilies in CATH

C A T H 3 ~36 ~810 ~50,000 domains Class Architecture Topology or Fold domain database

Superfamily (Domain Family) C A T H Topology or Fold Group ~810 40,000 domain entries ~50,000 domain entries Homologous Superfamily (Domain Family) ~1500 Sequence Family (35%, 60%, 95%)

Dictionary of Homologous Superfamilies DHS Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Description of structural and functional characteristics for each superfamily

Dictionary of Homologous Superfamilies DHS Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs Description of structural and functional characteristics for each superfamily

Variation in Secondary Structures Across Superfamily DHS:Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs

Functional annotations from GO, EC, COGs, KEGG DHS:Dictionary of Homologous Superfamilies http://www.biochem.ucl.ac.uk/bsm/dhs

Multiple structure alignments with conserved residues highlighted DHS:Dictionary of Homologous superfamilies http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D

Population of CATH Families and Structural Groups ~50,000 structural domains cluster proteins with similar sequences ~4000 sequence families (35%) S cluster proteins with similar structures and functions ~1,500 homologous superfamilies H cluster proteins with similar structures T ~810 fold groups A ~36 architectures C 3 major protein classes

nearly one third of the superfamilies belong to <10 fold groups Rossmann Fold Jelly Roll Alpha/Beta Plaits Arc repressor-like OB Fold CATH Arc repressor-like nearly one third of the superfamilies belong to <10 fold groups Up-down Rossmann SH3-like OB fold Immunoglobulin Jelly Roll Alpha-beta plait TIM barrel

CATH numbering scheme 2.40.50.100 Class 2. Mainly beta 40. Barrel Architecture 50. OB Fold Topology 100 Heat labile enterotoxin superfamily Homology

CATH domain structure database http://www.biochem.ucl.ac.uk/bsm/cath CATH domain structure database

CATH http://www.biochem.ucl.ac.uk/bsm/cath CATH class level

CATH architecture level http://www.biochem.ucl.ac.uk/bsm/cath CATH architecture level

CATH Topology or fold group level http://www.biochem.ucl.ac.uk/bsm/cath CATH Topology or fold group level

CATH homologous superfamilies in each fold group http://www.biochem.ucl.ac.uk/bsm/cath CATH homologous superfamilies in each fold group

CATH homologous superfamily level http://www.biochem.ucl.ac.uk/bsm/cath CATH homologous superfamily level

CATH sequence families (>=35% identity) in each superfamily http://www.biochem.ucl.ac.uk/bsm/cath CATH sequence families (>=35% identity) in each superfamily

CATH classification information for individual domains http://www.biochem.ucl.ac.uk/bsm/cath CATH classification information for individual domains

CATH structural relatives listed for each domain http://www.biochem.ucl.ac.uk/bsm/cath CATH structural relatives listed for each domain

CATH server http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

CATH server http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl

structural matches and statistics listed for query domain CATH server http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl structural matches and statistics listed for query domain

Expanding CATH with sequence relatives from genomes Library of HMMs built for representative sequences from each CATH domain superfamily Scan against CATH HMM library protein sequences from genomes assign domains to CATH superfamilies

~1400 Domain Structure Superfamilies Expanding CATH ~1400 Domain Structure Superfamilies sequences added from GenBank, genomes, SWPT-TrEMBL S1 S1 S2 H S2 H S3 Homologous Superfamily Homologous Superfamily S3 CATH-HMMs S4 Sequence family S5 ~50,000 sequences ~4,000 sequence families ~600,000 sequences ~24,000 sequence families Up to 70% of sequences in completed genomes can be assigned to CATH domain superfamilies

Gene3D Arc repressor-like Up-down Alpha horseshoe SH3-like OB fold Rossmann Fold Jelly Roll Alpha/Beta Plaits TIM Barrel Immunoglobulin-like Arc repressor-like OB Fold Four helix bundle SH3-type barrel Alpha horseshoe fold Gene3D Arc repressor-like Up-down Alpha horseshoe SH3-like OB fold Rossmann Immunoglobulin Jelly Roll TIM barrel Alpha-beta plait

CATH domain structure annotations for complete genomes Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D CATH domain structure annotations for complete genomes

Individual genome statistics Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Individual genome statistics

Assignment of sequences to Gene3D protein families http://www.biochem.ucl.ac.uk/bsm/Gene3D Assignment of sequences to Gene3D protein families

Functional annotations for individual sequences Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Functional annotations for individual sequences

Functional annotations for individual sequences Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Functional annotations for individual sequences

Domain annotations for individual sequences Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Domain annotations for individual sequences

Domain annotations for individual sequences Gene3D http://www.biochem.ucl.ac.uk/bsm/Gene3D Domain annotations for individual sequences

Summary CATH currently identifies ~1500 superfamilies in the ~50,000 structural domains from the PDB These domains families contain over 600,000 domain sequences from the genomes and sequence databases Up to 70% of genome sequences can be assigned to domain structure families using HMMs and threading

Acknowledgements Janet Thornton Frances Pearl Ian Sillitoe Oliver Redfern Mark Dibley Tony Lewis Chris Bennett Andrew Harrison Gabrielle Reeves Alastair Grant David Lee Janet Thornton http://www.biochem.ucl.ac.uk/bsm/cath Medical Research Council, Wellcome Trust, NIH Biotechnology and Biological Sciences Research Council