Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Web Resources for Bioinformatics Vadim Alexandrov and Mark Gerstein.
C A T H C A T H lass rchitecture opology or Fold Group
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Pfam(Protein families )
50%, guessing 100%, all correct Accuracy = Figure 2 Predictive Accuracy of SMO algorithm using each attribute separately Prediction of catalytic residues.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Alpha/Beta structures Barrels, sheets and horseshoes.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.
Protein structure (Part 2 of 2).
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
The Protein Data Bank (PDB)
Protein Modules An Introduction to Bioinformatics.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Protein Structure Prediction II
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Chemical mechanism is dominant Nature selects the protein for divergent evolution from a pool of enzymes whose mechanism provide a partial mechanism, or.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Centre for Integrative Bioinformatics.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
CATH – a hierarchic classification of protein domain structures Rui Kuang.
Tertiary structure combines regular secondary structures and loops (coil) Bovine carboxypeptidase A.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Motif discovery and Protein Databases Tutorial 5.
Comparing and Classifying Domain Structures
Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
Protein families, domains and motifs in functional prediction May 31, 2016.
Chapter 14 Protein Structure Classification
Elena Conti, Nick P Franks, Peter Brick  Structure 
Bio/Chem-informatics
Demo: Protein Information Resource
Genome Annotation Continued
Prediction of Protein Structure and Function on a Proteomic Scale
Classification: understanding the diversity and principles of
Prediction of protein function from sequence analysis
Growth Hormone Receptor
Volume 18, Issue 11, Pages (November 2010)
Protein structure prediction.
Elena Conti, Nick P Franks, Peter Brick  Structure 
Conserved motifs in the ABC
Presentation transcript:

Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring how structural divergence in domain families correlates with functional change  Predicting domain relatives likely to have significantly different structures and functions C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes

Thanks to Amos, Rolf and the Swiss-Prot Team!!!! Congratulations Swiss-Prot - 20 Years!!

H1 Class (3) Architecture (36) Topology or Fold (1100) C A TH Homologous superfamily (2100) H2H3 Orengo and Thornton (1994) 86,000 domains

Gene3D : Domain annotations in genome sequences scan against library of HMM models ~2100 CATH ~8300 Pfam >2 million protein sequences from 300 completed genomes and UniProt assign domains to CATH and Pfam superfamilies Benchmarking by structural data shows that 76% of remote homologues can be identified using the HMMs

DomainFinder: structural domains from CATH take precedent Gene3D: Domain annotations in genome sequences NC CATH-1 Pfam-2 Pfam-1 NewFam CATH-1Pfam-1 NewFam Pfam-2 UniProt sequence Assigned domains

Domain families ranked by size (number of domain sequences) Percentage of all domain family sequences in UniProt Rank by family size CATH superfamilies of known structure Pfam families of unknown structure NewFam of unknown stucture (>50,000 families) >90% of domain sequences in UniProt can be assigned to ~7000 domain families

Domain families ranked by size (number of domain sequences) Rank by family size CATH superfamilies of known structure Pfam families of unknown structure NewFam of unknown stucture (>50,000 families) 100 largest families of known structure account for 30% of domain sequences in UniProt Percentage of all domain family sequences in UniProt

Population in genomes Structural Diversity Correlation of sequence and structural variability of CATH families with the number of different functional groups

Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Prediting domain structure families and their domain contexts  Exploring how structural divergence in domain families correlates with functional change  Predicting domain relatives likely to have significantly different structures and functions C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes

Multiple structural alignment by CORA allows identification of consensus secondary structures and secondary structure embellishments Some superfamilies show great structural diversity In 117 superfamilies relatives expanded by >2 fold or more 2DSEC algorithm Gabrielle Reeves J. Mol. Biol. (2006)

Structural embellishments can modify the active site Galectin binding superfamily

Structural embellishments can modulate domain interactions Glucose 6-phosphate dehydrogenase side orientation face orientation Dihydrodipiccolinate reductase Additional secondary structure shown at (a) are involved in subunit interactions a

Structural embellishments can modify function by modifying active site geometry and mediating new domain and subunit interactions Biotin carboxylase D-alanine-d-alanine ligase Dimer of biotin carboxylase ATP Grasp superfamily

Secondary structure insertions are distributed along the chain but aggregate in 3D

For ~70% of domains analysed, 80% of the secondary structure embellishments are co-located in 3D with 3 or more other embellishments In 80% of domains, 1 or more embellishments contacts other domains or subunits Indel frequency < 1 % 0.85% 0.38% 0.23% 0.11% 0.06% 0.02% Size of Indel (number of secondary structures) Frequency (%) 85% of insertions comprise only 1 or 2 secondary structures Size of insertion (number of secondary structures) Frequency (%)

2 Layer Beta Sandwich 3 Layer Alpha/Beta Sandwich 2 Layer Alpha/Beta Alpha/Beta Barrel Many structurally diverse superfamilies adopt folds with these regular layered architectures

2 Layer Beta Sandwich 3 Layer Alpha/Beta Sandwich 2 Layer Alpha/Beta Alpha/Beta Barrel Many structurally diverse superfamilies adopt folds with these regular layered architectures

Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring how structural divergence in domain families correlates with functional change  Predicting domain relatives likely to have significantly different structures and functions C A T H Domain families of known structure Gene3D Protein families and domain annotations for completed genomes

subfamily of close sequence relatives predicted to have similar functions subfamily of close sequence relatives predicted to have similar functions (>=60% sequence identity) GEMMA – GEne Model and Model Annotation Algorithm for Predicting Sequence Homologues with Similar Structures and Functions Largest 100 CATH families have more than 20,000 subfamilies structuralsuperfamily

structuralsuperfamily GEMMA – Predicting Functional Groups in CATH Superfamilies Build multiple sequence alignments for each subfamily subfamily of close relatives predicted to have similar function (>60% identity) subfamily of close relatives predicted to have similar function (>60% identity)

structuralsuperfamily GEMMA – Predicting Functional Groups in CATH Superfamilies Cluster subfamilies predicted to have similar functions into functional groups subfamily of close relatives predicted to have similar function (>60% identity) subfamily of close relatives predicted to have similar function (>60% identity)

SSAP score = PSS score = Pyruvate phosphate dikinase (subfamily 1) Succinyl-CoA synthetase (subfamily 22) SSAP score = PSS score = SSAP score = PSS score = Pyruvate phosphate dikinase (subfamily 15) ATP Grasp Family 192 subfamilies

subfamily profiles coloured by residue conservation (red = high, blue = low) (red = high, blue = low) Pyruvate phosphate dikinase Profiles aligned using profile -profile comparison (MAFFT) Many fully conserved positions 6/7 positions are fully conserved Equivalent functions Scorecons (Valdar and Thornton, Profunc)

Succinyl-CoA synthetase Pyruvate phosphate dikinase Fully conserved positions No fully conserved positions subfamily profiles coloured by residue conservation (red = high, blue = low) (red = high, blue = low) Different functions Scorecons (Valdar and Thornton, Profunc) Profiles aligned using profile -profile comparison (MAFFT)

10 experimentally identified enzyme functions identified in this family Number of functional groups predicted Performance in Merging Subfamilies into Functional Groups Error rate

structural structuralsuperfamily GEMMA – Predicting Functional Groups in CATH Superfamilies subfamily of close relatives predicted to have similar function (>60% identity) subfamily of close relatives predicted to have similar function (>60% identity) functional group Benchmarked on 12 large enzyme families in CATH 6-10 fold reduction in the number of functional subfamilies

Summary Summary  More than half the domains in UniProt can be assigned to families of known structure  Analysis of some very large structural families revealed how secondary structure insertions can modulate functions  Functional groups can be identified in diverse families by comparing multiple features (e.g. residue conservation, predicted secondary structure)

CATHGene3D Lesley Greene Ian Sillitoe Tony Lewis Ollie Redfern Alison Cuff Tim Dallman Mark Dibley Sarah Addou Stathis Sidderis Russell Marsden Dave Lee Juan Ranea Ilhem Diboun Adam Reid Corin Yeats MRC, Wellcome Trust, NIH, EU -Biosapiens, Embrace, Enfin, BBSRC