Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome.

Slides:



Advertisements
Similar presentations
C A T H C A T H lass rchitecture opology or Fold Group
Advertisements

Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Pfam(Protein families )
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity Nicholas M. Luscombe and Janet M. Thornton JMB (2002)
Protein structure (Part 2 of 2).
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
MCSG Site Visit, Argonne, January 30, 2003 Genome Analysis to Select Targets which Probe Fold and Function Space  How many protein superfamilies and families.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
The Protein Data Bank (PDB)
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  How many domain families can we find in the genomes and can we predict the.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Introduction to Bioinformatics - Tutorial no. 8 Protein Prediction: - PROSITE - Pfam - SCOP - TOPITS - genThreader.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Motif discovery and Protein Databases Tutorial 5.
Manually Adjusting Multiple Alignments Chris Wilton.
Using structure in protein function annotation: predicting protein interactions Donald Petrey, Cliff Qiangfeng Zhang, Raquel Norel, Barry Honig Howard.
Comparing and Classifying Domain Structures
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Rita Casadio BIOCOMPUTING GROUP University of Bologna, Italy Prediction of protein function from sequence analysis.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Gene3D, Orthology and Homology-Based Inheritance of Protein-Protein Interactions Corin Yeats
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Sequence based searches:
Dot Plots, Path Matrices, Score Matrices
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Bioinformatics Methods for Inheriting Structural and Functional annotations for Gene Sequences if a related sequence has a known function can you inherit.
Prediction of protein function from sequence analysis
Protein structure prediction.
Presentation transcript:

Classification of protein and domain families Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Structure to function

Fold Group (1100) Homologous Superfamily (2100) 40,000 domain entries C AT H Sequence Family ~100,000 domains of known structure in CATH ~2 million sequences from genomes assigned to CATH superfamilies in Gene3D and functionally annotated Gene3D

Gene3D : Domain structure annotations in genome sequences scan against library of HMM models and sequences for CATH Pfam NewFam superfamilies ~5 million protein sequences from 560 completed genomes and UniProt ~ 2 million domain sequences assigned to CATH superfamilies

Gene3D (1) Cluster ~5 million sequences into protein superfamilies (2) Map domains onto the sequences using HMM technology (CATH & Pfam domains) >200,000 protein superfamilies ~10,000 domain superfamilies (2100 of known structure)

Proportion of genome sequences which can be assigned to domain families of known structure in CATH or SCOP HMM prediction threading prediction

Annotation levels for an average genome 0 50% 100% predicted to belong to structural superfamilies using HMM or threading techniques many predicted to be transmembrane many belonging to small species specific families

Families ordered by size Percentage of domain sequences Target selection strategy for PSI-2 known structure (CATH - MEGA) unknown structure (BIG -Pfam) Adam Godzik JCSG, Andras Fiser – NYSGC, Burkhard Rost - NESG

Population in genomes (x 1000) Structural Diversity Correlation of sequence and structural variability of CATH families with the number of different functional groups

Structural diversity in the CATH Domain Superfamily P-loop hydrolases Cutinase Cocaine esterase Acetylcholinesterase

Sequence to function Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions

Sequence identity thresholds for 90% conservation of enzyme function (to 3 EC Levels) highly variable families Number of sequences Sequence identity threshold for 90% conservation Number of families

N-Fold Increase in Functional Annotation for Sequences in Gene3D general thresholds family specific thresholds N-fold increase in coverage

Link to UniProt Links to GO Links to different levels in the Gene3D protein family Link to InterPro Links to CATH/Pfam Links to KEGG “S” - indicates you can search the term against Gene3D Get an XML version of this page Gene3D Functional information from GO, COGS, KEGG, EC, FunCat, MINT, IntAct, ComplexDB

Non-PSI PDBs PSI PDBs 0 terms1 term2 terms3 terms4 terms Functional annotation of structures using EC, GO, KEGG, FunCat resources

Phylogenetic trees derived from multiple sequence alignments can be used to infer functionally related proteins Tree Determinants - Valencia Evolutionary Trace - Lichtarge Funshift – Sonnhammer SCI-PHY – Sjolander

Score conservation for each position in the alignment using an entropy measure 1 = highly conserved 0 = unconserved Putative functional site Structural model Methods exploiting information on sequence conserved residue positions Scorecons –Thornton Protein Keys – Sander multiple sequence alignment of relatives from functional group

Superfamily of known structure (CATH) GEMMA: Compares sequence profiles (HMMs) between subfamilies sequence subfamily 80% seq. id) putative structure-function group clusters sequence relatives predicted to have similar structures/functions even at low levels of sequence identity

GeMMA v SCI-PHY using gold annotated sequences in Babbitt benchmark Purity (high is best) Edit distance (low) VI distance (low is best) Deviation from no. singletons (low)

Coverage of superfamily (%) experimental annotations inherit functions at 60% seq. id. inherit functions by GEMMA Functional annotation coverage using different strategies

Gene3D Biominer Methods Phylotuner: Correlation of domain occurrence profiles GOSS:Semantic Similarity calculation between protein pairs. CODA: Domain fusion analysis. HiPPI: homology inheritance of protein-protein physical interaction data. GECO: Correlation of gene expression data Protein interactions and gene networks

Protein Family Resources and Protocols for Structural and Functional Annotation of Genome Sequences Domain structures Domain structure predictions Structure to function

Methods for Assessing Structural Novelty CATHEDRAL – structure comparison Redfern et al. PLOS comp. biol. 2007

Structural clusters in the Aminoacyl tRNA synthetases – like family Aminoacyl tRNA synthetases DNA-binding, stress-related Argininosuccinate lyases Gln-hydrolyzing synthases Nucleotidyl-transferases structure similarity score

1bkzA dypA00 Galectin binding superfamily

Aminoacyl tRNA synthetases – like 1dnpA00 Deoxyribodi- pyrimidine photo-lyases Nucleotidylyl- transferases 1ej2A00 AA tRNA synthetase, Class I 1n3lA01 Electron transfer flavoprotein 1o97D01 Identifying functional groups in domain superfamilies

Exploiting 3D Templates to Represent Functional Relatives JESS – Thornton GASP - Babbitt SPASM – Kleywegt PINTS – Russell DRESPAT - Sarawagi pvSOAR – Joachimiak

SITESEER: Match 3-residue templates and assess relevance of hits by looking at residues within the local environment green and purple – identical residues; orange and white – similar residues Laskowski and Thornton

FLORA:3D templates for functional groups From multiple structure alignments of functional subgroups in the superfamily, identify vectors between amino acids that are highly conserved and distinctive for the functional subgroup.

FLORA:3D templates for functional groups localFLORA globalFLORA single site multiple sites

FLORA:Performance in recognising functionally related homologues Benchmark of 36 diverse enzyme groups (from 12 families)

Performance of FLORA Benchmarked on 36 large enzyme families

FLORA: 3D Templates for Structure-Function Groups in Domain Families 1dnpA01 Deoxyribo- dipyrimidine photo-lyases 1ej2A00 Nucleotidylyl- transferases 1q77A00 Unknown function MCSG 1n3lA01 AA tRNA synthetases 1o97D01 Electron transfer flavoprotein

Fold and structural motifs SSM fold search Surface clefts Residue conservation DNA-binding HTH motifs Nest analysis Sequence motifs (PROSITE, BLOCKS, SMART, Pfam, etc) Sequence scans Sequence search vs PDB Sequence search vs Uniprot Superfamily HMM library Gene neighbours n-residue templates Enzyme active sites Ligand binding sites DNA binding sites Reverse templates

Function Prediction for Proteins of ‘Putative’ or Unknown Function Class Sequence Evidence Structure Evidence Sequence + Structure Neither Successful Putative (57) Unknown (132) 95*69*57*25 * Numbers refer to results where the top hit is classed as ‘Strong’ or ‘Moderate’ structural data provides relatively more information for proteins about which there is less knowledge these predictions need to be experimentally validated