MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group
MGM workshop. 19 Oct 2010 Outline Genome annotation (Functional) How do we know it is correct? How do we do it? Data collections Protein families Pathway collections
MGM workshop. 19 Oct 2010 Genome annotation: The process of identifying the locations and functions of coding sequences. cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF) molecular/enzymatic (methyltransferase) Reaction (methylation) Substrate (cobalt-precorrin-4) Ligand (S-adenosyl-L-methionine) metabolic (cobalamin biosynthesis) physiological (maintenance of healthy nerve and red blood cells, through B12).
MGM workshop. 19 Oct 2010 Functional annotation helps make sense out of nonsense But it only directs us to the potential of the organism
MGM workshop. 19 Oct 2010 Function prediction is mainly based on homology detection Homology implies a common evolutionary origin. not retention of similarity in any of their properties. Homology ≠ similarity of function. Function transfer by homology Conservative amino acid substitution Low complexity region Gap (insertion or deletion)
MGM workshop. 19 Oct 2010 Function transfer based on homology is error prone Punta & Ofran. PLOS Comp Biol. 2008
MGM workshop. 19 Oct 2010 Limits in transfer of annotation based on homology Punta & Ofran. PLOS Comp Biol. 2008
MGM workshop. 19 Oct 2010 If no similarity is detected use alternative methods to predict function Subcellular localization Gene context Special sequence motifs features Cytoplasm S ~ S Periplasm
MGM workshop. 19 Oct 2010 Genome annotation Model pathway Annotation should make sense in the context of the cell metabolism Substrate A Substrate B Substrate C Substrate D Enzyme 2 Enzyme 1Enzyme 3 Enzyme 2 ? ? Enzyme 1Enzyme 3 ✓
MGM workshop. 19 Oct 2010 Annotation should make sense. Missing genes may be present.
MGM workshop. 19 Oct 2010 Helps prediction Is error prone. Has to make sense. Genome annotation: The process of identifying the locations and functions of coding sequences.
MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation Sequence databases Protein classification databases Specialized databases
MGM workshop. 19 Oct 2010 Primary databases store raw information from various sources EMBL/GenBank/DDBJ ( ) Archive containing all sequences from all sources GenBank/UnitProt contain translations of sequences. YearBase pairsSequences ,575,745,17640,604, ,037,734,46252,016, ,019,290,70564,893, ,874,179,73080,388, ,116,431,94298,868,465
MGM workshop. 19 Oct 2010 Primary databases accumulate errors in sequences and annotations In the sequences themselves: Sequencing errors. Cloning vector sequences. In the annotations: Inaccuracies, omissions, and even mistakes. Inconsistencies between some fields. Redundancy. { { { { {{
MGM workshop. 19 Oct 2010 IMG is using Refseq as its primary source ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG ATAT GAGA C ATT GAGA GAGA C GAGA GAGA C C GAGA GAGA C GAGA GAGA C GAGA GAGA C C GAGA GAGA C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
MGM workshop. 19 Oct 2010 Protein families use different methods to classify proteins COG/KOG Pfam TIGRfam KEGG Orthologs InterPro
MGM workshop. 19 Oct 2010 What are COGs/KOGs? How much can I trust them? Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2 >gnl|COG|2723 COG2723, BglB, Beta-glucosidase/6-phospho-beta-glucosidase/beta- galactosidase [Carbohydrate transport and metabolism]. Length = 460 Score = 388 bits (998), Expect = e-132 Identities = 176/503 (34%), Positives = 251/503 (49%), Gaps = 75/503 (14%) Query: 4 SFPKSFRFGWSQAGFQSEMGTPGSEDPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63 FPK F +G + A FQ E +DW VWVHD I LVSGD PE ++ Sbjct: 3 KFPKDFLWGGATAAFQVEGAWNEDGKGPSDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60 Query: 64 YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHVDENDLKRLDE 123 Y+ A +MGL+ R ++EWSRIFP Sbjct: 61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV Query: 124 AANQEAVRHYREIFSDLKARGIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183 N++ +R Y +F +LKARGI + YH+ LPLW+ P GW + +TV Sbjct: 95 --NEKGLRFYDRLFDELKARGIEPFVTLYHFDLPLWLQKPYG GWENRETV 142 Query: 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNGYMWVKSGFPPSYLNFELSRRVMVNLI 243 FAR+AA +F D + T NEPNVV GY+ G PP V +++ Sbjct: 143 DAFARYAATVFERFGDKVKYWFTFNEPNVVVELGYL--YGGHPPGIVDPKAAYQVAHHML 200 Query: 244 QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRWIFFDAIIKGE 300 AHA A A+K I+ K +GII + PL+DK D KA E A+ F DA +KGE Sbjct: 201 LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNRFFLDAQVKGE 260 Query: LMGVTRDDL----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342 L + DL + +D+IG+NYY+ + VK + GYG Sbjct: 261 YPEYLEKELEENGILPEIEDGDLEILKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317
MGM workshop. 19 Oct HMMs of protein alignments (local) for domains, or global (cover whole protein) Pfam are based on the detection of domains
MGM workshop. 19 Oct 2010 TIGRfam Full length alignments. Domain alignments. Equivalogs: families of proteins with specific function. Superfamilies: families of homologous genes. HMMs
MGM workshop. 19 Oct 2010 Hits to other models How can we search Pfam and TIGRfam? Query: BChl_A [M=357] Accession: PF Description: Bacteriochlorophyll A protein Scores for complete sequences (score includes all domains): --- full sequence best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 Domain annotation for each sequence (and alignments): >> tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 OS=Ignisphaera aggregans (strain DSM) # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc ! e Alignments for each domain: == domain 1 score: 10.5 bits; conditional E-value: 1.1e-05 BChl_A 217 fshagsgvvdsisrwaelfpveklnkpasveagfrsdsqgievkvdgelpgvsvdag 273 fs+ g+v+si+ w l ++ + e gfr + iev v+g l v +d tr|E0STV9|E0STV9_IGNAA 255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVNGYLDDVYRDDL ********* *********************99864 PP GA Gathering method: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment. NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment. Noise cutoff Gathering cutoff Trusted cutoff
MGM workshop. 19 Oct 2010 InterPro. Composite pattern databases To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro Release 30.0 (Dec10) contains entries Central annotation resource, with pointers to its satellite dbs
MGM workshop. 19 Oct 2010 KEGG orthology Xizeng Mao et al. Bioinformatics Volume 21,(2005) <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity
MGM workshop. 19 Oct 2010 ENZYME
Pathway collections KEGG Contains information about biochemical pathways, and protein interactions.
MGM workshop. 19 Oct 2010 Pathway collections: Metacyc
MGM workshop. 19 Oct 2010 Functional annotation
MGM workshop. 19 Oct 2010 RNA structural and functional annotation are coupled SILVA alignments of rRNAs are used to generate models Covariance models for each RNA class are used to predict genes
MGM workshop. 19 Oct 2010 There is a plethora of specialized databases that one needs to search
MGM workshop. 19 Oct 2010 In most cases databases are interconnected but …..not all databases are updated regularly. Changes of annotation in one database are not reflected in others
MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation Sequence databases Contain sequences deposited by verious sources Protein classification databases Utilize sequence homology or other criteria to group together proteins COG, Pfam, TIGRfam, InterPro, KO terms Specialized databases Start by searching for available resources
MGM workshop. 19 Oct 2010 Question? Genome annotation (Functional) How do we know it is correct? How do we do it? Data collections Protein families Pathway collections