A brief on: Domain Families & Classification. The discovery of domains in protein structures Domains at the sequence level Examples of “Domain Resources”

Slides:



Advertisements
Similar presentations
Pfam(Protein families )
Advertisements

Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Corrections. N-linked glycosylation (GlcNac): Look at the Swiss-Prot annotation (in a random ‘glycosylated’ entry)
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
Single Motif Charles Yan Spring Single Motif.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Protein Classification A comparison of function inference techniques.
Proteomics: Analyzing proteins space. Protein families Why proteins? Shift of interest from “Genomics” to “Proteomics” Classification of proteins to groups/families.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
The Pfam and MEROPS databases EMBO course 2004 Robert Finn
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Protein Domain Database
Comparing and Classifying Domain Structures
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Interactions and Modules: the how and why of molecular.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.
Protein families, domains and motifs in functional prediction May 31, 2016.
Chapter 14 Protein Structure Classification
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Sequence based searches:
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
PIR: Protein Information Resource
There are four levels of structure in proteins
Protein Bioinformatics Course
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
A brief on: Domain Families & Classification
Volume 22, Issue 6, Pages (June 2006)
Variation on an Src-like Theme
A brief on: Domain Families & Classification
Presentation transcript:

A brief on: Domain Families & Classification

The discovery of domains in protein structures Domains at the sequence level Examples of “Domain Resources” Domain fusion Supra-domains Signaling domains and cell function InterPro Evolution by Protein Domains

Classification to Families We can classify proteins into families by: –A. Sequence (motifs; proteins) –B. Structure –C. Function (annotation) –D. Evolution Automatic Large scale Automatic Large scale Manual High Quality Manual High Quality

Sequence Based Classification Proteins as a unit Proteins as combination of domains Functional Structural Sequence The Goal: 1.New Annotation, New Family, Family connections (sub/ super) … 2.Predicting power (given a new unknown sequence)

Protein Multiple Alignment (Structurally supported)

Q: What is the best way to ‘represent’ this low sequence similarity of ~ 70 aa Domains can be recognized through sequence similarity

Misannotation due to multidomain proteins Smith and Zhang. Nat Biotechnol : Domain of known function Domain of unknown function kinase Kinase-like A B A is similar to C, and C is similar to B, but A is not similar to B Multidomain protein C Annotation

Q: What is the best way to ‘represent’ this low sequence similarity of ~ 70 aa ‘ Profile’ PSSM Regular Expression HMM And more…

Multi domain protein families Impossible to find ‘evolutionary relatedness” without adding DOMAIN information…

Domains are the evolutionary units of sequence that comprise the gene coding regions. Most genes are built from more than one domain. Novel genes can be created by recombination of domains into new domain arrangements. How is a novel gene born?

Glycerone-P Glycerate-1,3P2 Glycerate-3P PGK1 GAPDH TIM Glyceraldehyde-3P Thermotoga Maritima PGK+TIM M. genitalium PGK M. genitalium TIM Phytophthora infestans TIM+GAPDH M. genitalium GAPDH From Glycolysis: Correspondence between functional associations and genes linked by the fusion method

8e-78 2e-47 9e-41 1e-42 False Transitivity of Local Alignment CSKP HUMAN DLG3 MOUSEMPP3 HUMANK6A1 MOUSE BLAST values Pairwise similarities better than 1e-40 EScore If we cluster these proteins, assuming transitivity of local alignment scores, we will cluster K6A1_MOUSE with MPP3_HUMAN input

Used Terms: Motif = Domain = Signature = Profile = Seed Family = Cluster These terms are used interchangeably, They are very (too) flexible Domain Classification (intro to few systems)

Protein Sequence Domain Classification DOMO ADDA EVEREST InterPro CDD MetaFam ProSite Pfam Blocks+ Profile SBASE TigrFam eMotif SMART PRINTS ProDom Based on different principles and a different focus!

Integration: Data Fusion InterPro 13,000 entries Based on UniProt DB

Expert system Pfam InterPro - >13,000 entries 2006 >8000 Sequence coverage Pfam-A : 75% Sequence coverage Pfam-B : 19% Other

Examples: complexity in domains Identification ? Boundary ? Composition ? Examples: complexity in domains Identification ? Boundary ? Composition ?

Why domains and not proteins Reducing false transitivity. Exposing Mix and Match evolution Immediate relevance to structural domain-families Suggesting evolutionary ‘robust units’ Providing models for a family Why automatic? Overcoming large amounts of data Unbiased identification of new families (even without an identified seed / without 3D structural information )

Domains are the building blocks of evolution: some facts.. Pyruvate kinase, PDB:1pkn 3 domains Each occurs in diverse sets of protein families Number of domains in proteins ranges from 1 up to tens Structural based domain are ~ 150 aa Length varies: some are very short aa, other are long > 500 aa Domain definition is somewhat blurred Domain boundary is an unsolved problem

What is a domain? You know it when you see one

Automatic vs Manual >13,000 entries

General approaches Motif based databases Prosite, Prints, Blocks, eMotif, InterPro Domain-based databases Pfam, ProDom, Domo, Smart Manual/Semi-manual Prosite Semi-automatically Pfam, Smart Fully automatic ProDom, Blocks, Domo, eMotif Use different models (regular expressions, profiles, HMMs) Based on each other

Example of semi - automatic Pfam: Nucleic Acids Research, 2007, 1–8 1.Release of Pfam (22.0) contains 9318 protein families. cover 73.2% of sequences and 50.8%. 2.Pfam is now based on UniProtKB, NCBI GenPept and metagenomics projects. 3. ~ 500 new Pfam-A families for PDB sequences and SCOP entries. Increasing the aa cover ! 4. Clans are built manually (supported by literature, SCOP..) total of 283 clans comprising a total of 1808 Pfam-A families.

The Power of Integration Pfam, Prosite, SMART, PRINTS, tigrFam ProDom InterPro SCOP CATH FSSP GO ENZ KEGG

TRANSFERASE (METHYLTRANSFERASE) 1adm Proteins were found to have spatially distinct structural units Structure Domains provide a “clean” definition

In 1974, Michael Rossman observes that structural domains can recur in different structural contexts 1ht0 – an alcohol dehydrogenase 1i0z – a lactate dehydrogenase Rossman fold

Domains can recur in multiple copies in the same protein Fibronectin protein–1fnf

A distinct, compact, and stable protein structural unit that folds independently of other such units. Structural definition of domains

A distinct, compact, and stable protein structural unit that folds independently of other such units. Structural definition of domains

Recurrent domains in diphtheria toxin (1ddt) The diphtheria toxin is made up of three domains, each of which is involved in a different stage of infection (receptor binding, membrane penetration, and catalysis of ADP-ribosylation of elongation factor 2). A structural neighbor is depicted next to each domain of diphtheria toxin (middle).

Dominant domain fold types. Holm and Sander. PROTEINS: Structure, Function, and Genetics 33:88–96 (1998)

701 1,110 1,940 44,327 SCOP – a structural classification of proteins Updated from Murzin et al. J. Mol. Biol. 247, Families are in turn grouped into superfamilies where sequence similarity is still recognizable and basic biochemical properties are conserved. Superfamilies and families are monophyletic (derive from a common ancestor)

Dominant domain fold types. Holm and Sander. PROTEINS: Structure, Function, and Genetics 33:88–96 (1998)

Sequence Biology predominantly proceeds by decomposing proteins into their domains Protein sequence families are constructed at the domain level

Prosite A dictionary of functional and structural motifs and domains Valuable biological information on each family Each motif/domain/family is represented as a regular expression, a rule or a profile Models are generated from (usually published) multiple alignments, manually calibrated to ensure selectivity and sensitivity Patterns do not always cover complete domains whereas profiles usually span the whole domain As of June 2002 contains 1800 patterns and profiles describing 1200 families or domains G-x(2,3)-[MLIV]-x-P-{K,H}-x(2)-C A C G T OR

From the SMART database Detecting domains at the sequence level

Fusion link Glycyl-tRNA Synthetase E. Coli: CT796 Fusion Links glyQglyS C. Trachomatis: The fact that glyQ and glyS interact could have been predicted from the fusion protein CT796

Interpro An integrated resource of protein sites and functional domains The good thing about standards is that there are so many of them to choose from…

Introducing Interpro….

Interpro entry for a zinc finger domain

חיפוש לפי taxonomy:

תוצאות חיפוש לדוגמא עבור החלבון 1Sirt באדם :

הצגת Alignment.

הצגת HMM-Logo.

iPfam - מאגר אינטראקציות domain-domain המבוסס על רשומות PDB.

יתרונות בולטים : קישור ממאגרי המידע המובילים – UniProt,PDB,interPro. בקרה ידנית על החלוקה למשפחות. חיפוש בעזרת HMM עבור רצפים גלובלים ומקומיים. ריכוז של domain architectures בהם משולב החלבון. עצים פילוגנטיים וטקסונומיים לחיפוש חלבונים הומולוגים מוכרים. תצוגת HMM ו -Alignment בצורה גרפית. אפשרות להוריד את המאגר בשלמותו.

Super-families of domains in Interpro (analogous to superfamilies in SCOP)

Some domains actually contain other domains!

GATCTACCATGAAAGACTTGTGAATCCAGGAAGAGAGACTGACTGGGCAACATGTTATTCAGG TACAAAAAGATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAA TGGGAAGCTCTTCTGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCT GAAAGATGAGGAAGAGTTGTATGAGAGTGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAA TGGGCCGGGATGGGATAGCGCAAACTGCCCGGGAAGGGAAACCAGCACTGTACAGACCTGA ACAACGAAGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGCAGGAATGCTTTGTA GACACAGTAATTTGCTTGTATGGAATTTTGCCTGAGAGACCTCATTGCAGTTTCTGATTTTTTGA TGTCTTCATCCATCACTGTCCTTGATGGCATATTTTGTTCAGGGAATGGTGAATTAAGTGTGGC AGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAGTCAAATAGTTTGGAACAGGTATAAT GATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGCCACAGT TATGGAGT SignalingandMulticellularity AAACCTTAGGAATAATGAATGATTTGCGCAGGC TCACCTGGATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTC ATGAGCACCACATATTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTT TGTCTCTACTCACACACTTCCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTG CTCTGGATATTAACTGGGTATCCCATGACTTTATTCTGACACTACCTGGACCTTGTCAAATAGTT TGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTTGGGGTTAGCACAGACCCCACAA GTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAATGGCAAGTCCTAAGTTGT CACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTCTTCTTGGGTTTAAT AATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCTGAGAGAGA GGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCCTTGAT TGTCTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCAC CACCACATCTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTG GCCACAAATTCCTGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAG ATGTGAGCCACCACATCTGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTC AGAAACAGTCAATGAAAGAGACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCA CTTTGGGAGGCCAAGGTGGGAGGATCGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAAC ATGGTGAAAACCTGTCTCTATAAAAAATTAAAAAATAATAATAATAACTGGTGTGGTGTTGTGC ACCTAGAGTTCCAACTACTAGGGAAGCTGAGATGAGAGGATACCTTGAGCTGGGGACTGGGG AGGCTTAGGTTACAGTAAGCTGAGATTGTGCCACTGCACTCCAGCTTGGACAAAAGAGCCTG ATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCCACAGGCACAGCTCCATCGTTACAATG GCCTCTTTAGACCCAGCTCCTGCCTCCCAGCCTTCT One of the key problems of becoming a multicellular organism is solving the problem of cell signaling.

inactiveactiveinactive p kinase phosphotase Phosphorylation can reversibly alter the activity of an enzyme through the combined action of a protein kinase and a protein phosphatase. signal transduction

Tyrosine phosphorylation is a major mechanism of transmembrane signaling. Pawson and Scott. Scientific American (2000) Protein tyrosine kinases (PTKs) add phosphate to tyrosines

SH2 domains (Src-homlogy 2) SH2 domains are modules of ~100 amino acids that bind to specific phospho (pY)-containing peptide motifs The Pawson Lab

Pawson, T. et al., Trends in Cell Biology Vol.11 No.12 December 2001 The SH2 domain is found embedded in a wide variety of metazoan proteins that regulate functionally diverse processes.

Several modular domains have been identified that recognize specific sequences on their target acceptor proteins. Protein modules for the assembly of signaling complexes Pawson & Scott. Science (1997)

One way receptors may amplify their signaling is to use adaptor proteins that provide additional docking sites for modular signaling proteins. Adaptor proteins

The Order of Domains in the Polypeptide Chains of Src and Abl, and Diagrams of Their Assembled, Autoinhibited States In both cases, the SH3-SH2 clamp fixes the bilobed kinase domain in an inactive conformation. The domain color codes are SH3, yellow; SH2, green; kinase small lobe, dark blue; kinase large lobe, light blue. The activation loop in the large lobe is red. Connector, linker, and N- and C-terminal extensions are black. In Bcr/Abl, gene fusion has replaced the Abl cap by a long segment of Bcr. Harrison, S. C. (2003). Cell, 112, 737–740. Supra-domains in Src and Abl

A supra-domain is defined as a domain combination in a particular N- to-C-terminal orientation that occurs in at least two different domain architectures in different proteins with: (i) different types of domains at the N and C-terminal end of the combination; or (ii) different types of domains at one end and no domain at the other. Supra-domains Evolutionary units larger than single domains Vogel C. J Mol Biol (3) : N-terminal end C-terminal end Each represents a different domain architecture Supra-domain of size 2 and 3

Chothia C. Science : Vogel C. J Mol Biol (3) : Supra-domains Evolutionary units larger than single domains The P-loop containing nucleotide triphosphate (NTP) hydrolase domain and the translation protein domain occur as one combination in several different translation factors. This supra-domain occurs in 35 different domain architectures, and five of these are given here.

The building blocks: modular interaction domains in signal transduction Pawson & Nash. Science (2003)

The Order of Domains in the Polypeptide Chains of Src and Abl, and Diagrams of Their Assembled, Autoinhibited States In both cases, the SH3-SH2 clamp fixes the bilobed kinase domain in an inactive conformation. The domain color codes are SH3, yellow; SH2, green; kinase small lobe, dark blue; kinase large lobe, light blue. The activation loop in the large lobe is red. Connector, linker, and N- and C-terminal extensions are black. In Bcr/Abl, gene fusion has replaced the Abl cap by a long segment of Bcr. Supra-domains in Src and Abl