Download presentation
Presentation is loading. Please wait.
Published byIsabella Boyd Modified over 9 years ago
1
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Kmavrommatis@lbl.gov
2
MGM workshop. 19 Oct 2010 Let’s get started… Information from databases is used to predict the function of a protein (functional annotation). Product name Enzyme catalog number Domain architecture …
3
MGM workshop. 19 Oct 2010 But what is function? cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF) molecular/enzymatic (methyltransferase) Reaction (methylation) Substrate (cobalt-precorrin-4) Ligand (S-adenosyl-L-methionine) metabolic (cobalamin biosynthesis) physiological (maintenance of healthy nerve and red blood cells, through B12).
4
MGM workshop. 19 Oct 2010 Functional annotation Predict the biochemistry and physiology of an organism based on its genome sequence Explain known biochemical and physiological properties
5
MGM workshop. 19 Oct 2010 Homologs/Orthologs/Paralogs
6
Function prediction Function transfer by homology Homology implies a common evolutionary origin. not retention of similarity in any of their properties. Homology ≠ similarity of function. Punta & Ofran. PLOS Comp Biol. 2008
7
MGM workshop. 19 Oct 2010 Trust transfer of annotation ? Punta & Ofran. PLOS Comp Biol. 2008
8
MGM workshop. 19 Oct 2010 Dos and Don’ts TypeDon’tDo HomologySame functionProbability for same function OrthologySame functionProbability for same function ParalogySame functionProbability for same function Sequence similaritySame functionProbability for same function High sequence similaritySame functionProbability for same function Same sequenceSame functionProbability for same function
9
MGM workshop. 19 Oct 2010 What if nothing is similar ? Subcellular localization Gene context Special features Prediction of binding residues (DISIS, bindN) Cytoplasm S ~ S Periplasm
10
MGM workshop. 19 Oct 2010 Genome annotation Model pathway Annotation should make sense Substrate A Substrate B Substrate C Substrate D Enzyme 2 Enzyme 1Enzyme 3 Enzyme 2 ? ? Enzyme 1Enzyme 3 ✓
11
MGM workshop. 19 Oct 2010 Annotation should make sense
12
MGM workshop. 19 Oct 2010 Databases Databases used for the analysis of biological molecules. Databases contain information organized in a way that allows users/researchers to retrieve and exploit it. Why bother? Store information. Organize data. Predict features (genes, functions...). Understand relationships (metabolic reconstruction).
13
MGM workshop. 19 Oct 2010 Primary nucleotide databases EMBL/GenBank/DDBJ ( http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl ) http://www.ncbi.nlm.nih.gov/ Archive containing all sequences from: genome projects sequencing centers individual scientists patent offices The sequences are exchanged between the three centers on a daily basis. Database is doubling every 10 months. Sequences from >140,000 different species. 1400 new species added every month. YearBase pairsSequences 200444,575,745,17640,604,319 200556,037,734,46252,016,762 200669,019,290,70564,893,747 200783,874,179,73080,388,382 200899,116,431,94298,868,465
14
MGM workshop. 19 Oct 2010 Primary protein sequence databases Contain coding sequences derived from the translation of nucleotide sequences GenBank Valid translations (CDS) from nt GenBank entries. UniProtKB/TrEMBL (1996) Automatic CDS translations from EMBL. TrEMBL Release 40.3 (26-May-2009) contains 7,916,844 entries.
15
MGM workshop. 19 Oct 2010 Errors in databases There are many errors in the primary sequence databases: In the sequences themselves: Sequencing errors. Cloning vectors sequences. In the annotations: Inaccuracies, omissions, and even mistakes. Inconsistencies between some fields.
16
MGM workshop. 19 Oct 2010 Redundancy Redundancy is a major problem. Entries are partially or entirely duplicated: e.g. 20% of vertebrate sequences in GenBank. { { { { { { Partial and complete sequence duplications
17
MGM workshop. 19 Oct 2010 NCBI Derivative Sequence Data ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
18
MGM workshop. 19 Oct 2010 RefSeq Curated transcripts and proteins. reviewed by NCBI staff. Model transcripts and proteins. generated by computer algorithms. Assembled Genomic Regions (contigs). Chromosome records.
19
MGM workshop. 19 Oct 2010 Secondary protein databases Uniprot/SWISS-PROT (1986) (http://ca.expasy.org/spro)http://ca.expasy.org/spro a curated protein sequence database high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.) a minimal level of redundancy high level of integration with other databases
20
MGM workshop. 19 Oct 2010 Classification databases Groups (families/clusters) of proteins based on… Overall sequence similarity. Local sequence similarity. Presence / absence of specific features (active site, signal peptides… ). Structural similarity. ... These groups contain proteins with similar properties. Specific function, enzymatic activity. General function. Evolutionary relationship. …
21
MGM workshop. 19 Oct 2010 Overall sequence similarity
22
MGM workshop. 19 Oct 2010 Clusters of orthologous groups (COGs) COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages. Each Cluster has representatives of at least 3 lineages A function (specific or broad) has been assigned to each COG. http://www.ncbi.nlm.nih.gov/COG/
23
MGM workshop. 19 Oct 2010 How it works Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2
24
MGM workshop. 19 Oct 2010 Profiles & Pfam A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles). These domains/profiles can be used to detect distant relationships, where only few residues are conserved.
25
MGM workshop. 19 Oct 2010 Regions similarity
26
MGM workshop. 19 Oct 2010 http://pfam.sanger.ac.uk HMMs of protein alignments (local) for domains, or global (cover whole protein) Pfam
27
MGM workshop. 19 Oct 2010 TIGRfam Full length alignments. Domain alignments. Equivalogs: families of proteins with specific function. Superfamilies: families of homologous genes. HMMs http://www.tigr.org/TIGRFAMs/
28
MGM workshop. 19 Oct 2010 KEGG orthology
29
MGM workshop. 19 Oct 2010 Composite pattern databases To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs http://www.ebi.ac.uk/interpro/
30
MGM workshop. 19 Oct 2010 * It is up to the user to decide if the annotation is correct *
31
MGM workshop. 19 Oct 2010 ENZYME
32
http://ca.expasy.org/enzyme/ ENZYME
33
MGM workshop. 19 Oct 2010 KEGG Contains information about biochemical pathways, and protein interactions. http://www.kegg.com
34
MGM workshop. 19 Oct 2010 Functional annotation COGs gene KO terms Pfam TIGRfam Pfam TIGRfam IMG PSI BLAST 1e-2 BLASTp <1e-10, >45% id, >70% length Hmmsearch (BLAST preprocessing) BLASTp evalue<10, 20 best hits IMG term TIGRfam COG COG + pfam Pfam Product name (based on translation tables) NO YES hypothetical NO BLAST NO YES http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf
35
MGM workshop. 19 Oct 2010 Sequencing projects & Metadata http://www.genomesonline.org
36
MGM workshop. 19 Oct 2010 Literature search PubMed http://www.ncbi.nlm.nih.gov/Pubmed
37
MGM workshop. 19 Oct 2010 Specialized databases There is a large number of databases devoted to specific organisms. For some model organisms there are often concurrent systems. These databases are typically associated to sequencing or mapping projects.
38
MGM workshop. 19 Oct 2010 Other specialized databases Signal transduction, regulation, protein-protein interactions TRANSFAC (Transcription Factor database) BRITE (Biomolecular Relations in Information Transmission and Expression database) DIP (Database of Interacting Proteins) BIND (Biomolecular Interaction Network database) BioCarta Biochemical pathways KLOTHO (Biochemical Compounds Declarative database) BRENDA (enzyme information system) LIGAND (similar to Enzyme but with more information for substrates) Gene order and co-occurrence STRING 3D structures PDB (Protein Data Bank) MMDB (Molecular Modelling Data Base) NRL_3D (Non-Redundant Library of 3D Structures) SCOP (Structural Classification of Proteins) Polymorphism ALFRED (Allele Frequency Database) Molecular interactions DIP (Database of Interacting proteins) BIND (Biomolecular Interaction Network Database) Gene expression GXD (Mouse Gene Expression Database) The Stanford Microarray Database Mapping GDB (Genome Data Base) EMG (Encyclopedia of Mouse Genome) MGD (Mouse Genome Database) INE (Integrated Rice Genome Explorer) Protein quantification SWISS-2DPAGE PDD (Protein Disease Database) Sub2D (B. subtilis 2D Protein Index)
39
MGM workshop. 19 Oct 2010 List of databases http://www.oxfordjournals.org/nar/database/c
40
MGM workshop. 19 Oct 2010 Databanks interconnection Not all databases are updated regularly. Changes of annotation in one database are not reflected in others.
41
MGM workshop. 19 Oct 2010 Summary Gene annotation should make sense in the context of the organism We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more… They help predict the function, or the network of functions. Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required QUESTIONS?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.