Download presentation
Presentation is loading. Please wait.
Published byMadeline Harris Modified over 9 years ago
1
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov
2
MGM workshop. 19 Oct 2010 Outline Genome annotation (Functional) How do we know it is correct? How do we do it? Data collections Protein families Pathway collections
3
MGM workshop. 19 Oct 2010 Genome annotation: The process of identifying the locations and functions of coding sequences. cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF) molecular/enzymatic (methyltransferase) Reaction (methylation) Substrate (cobalt-precorrin-4) Ligand (S-adenosyl-L-methionine) metabolic (cobalamin biosynthesis) physiological (maintenance of healthy nerve and red blood cells, through B12).
4
MGM workshop. 19 Oct 2010 Functional annotation helps make sense out of nonsense But it only directs us to the potential of the organism
5
MGM workshop. 19 Oct 2010 Function prediction is mainly based on homology detection Homology implies a common evolutionary origin. not retention of similarity in any of their properties. Homology ≠ similarity of function. Function transfer by homology Conservative amino acid substitution Low complexity region Gap (insertion or deletion)
6
MGM workshop. 19 Oct 2010 Function transfer based on homology is error prone Punta & Ofran. PLOS Comp Biol. 2008
7
MGM workshop. 19 Oct 2010 Limits in transfer of annotation based on homology Punta & Ofran. PLOS Comp Biol. 2008
8
MGM workshop. 19 Oct 2010 If no similarity is detected use alternative methods to predict function Subcellular localization Gene context Special sequence motifs features Cytoplasm S ~ S Periplasm
9
MGM workshop. 19 Oct 2010 Genome annotation Model pathway Annotation should make sense in the context of the cell metabolism Substrate A Substrate B Substrate C Substrate D Enzyme 2 Enzyme 1Enzyme 3 Enzyme 2 ? ? Enzyme 1Enzyme 3 ✓
10
MGM workshop. 19 Oct 2010 Annotation should make sense. Missing genes may be present.
11
MGM workshop. 19 Oct 2010 Helps prediction Is error prone. Has to make sense. Genome annotation: The process of identifying the locations and functions of coding sequences.
12
MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation Sequence databases Protein classification databases Specialized databases
13
MGM workshop. 19 Oct 2010 Primary databases store raw information from various sources EMBL/GenBank/DDBJ ( http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl ) http://www.ncbi.nlm.nih.gov/ Archive containing all sequences from all sources GenBank/UnitProt contain translations of sequences. YearBase pairsSequences 200444,575,745,17640,604,319 200556,037,734,46252,016,762 200669,019,290,70564,893,747 200783,874,179,73080,388,382 200899,116,431,94298,868,465
14
MGM workshop. 19 Oct 2010 Primary databases accumulate errors in sequences and annotations In the sequences themselves: Sequencing errors. Cloning vector sequences. In the annotations: Inaccuracies, omissions, and even mistakes. Inconsistencies between some fields. Redundancy. { { { { {{
15
MGM workshop. 19 Oct 2010 IMG is using Refseq as its primary source ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG ATAT GAGA C ATT GAGA GAGA C GAGA GAGA C C GAGA GAGA C GAGA GAGA C GAGA GAGA C C GAGA GAGA C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA
16
MGM workshop. 19 Oct 2010 Protein families use different methods to classify proteins COG/KOG Pfam TIGRfam KEGG Orthologs InterPro
17
MGM workshop. 19 Oct 2010 What are COGs/KOGs? How much can I trust them? Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2 >gnl|COG|2723 COG2723, BglB, Beta-glucosidase/6-phospho-beta-glucosidase/beta- galactosidase [Carbohydrate transport and metabolism]. Length = 460 Score = 388 bits (998), Expect = e-132 Identities = 176/503 (34%), Positives = 251/503 (49%), Gaps = 75/503 (14%) Query: 4 SFPKSFRFGWSQAGFQSEMGTPGSEDPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63 FPK F +G + A FQ E +DW VWVHD I LVSGD PE ++ Sbjct: 3 KFPKDFLWGGATAAFQVEGAWNEDGKGPSDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60 Query: 64 YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHVDENDLKRLDE 123 Y+ A +MGL+ R ++EWSRIFP Sbjct: 61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV-------------------------- 94 Query: 124 AANQEAVRHYREIFSDLKARGIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183 N++ +R Y +F +LKARGI + YH+ LPLW+ P GW + +TV Sbjct: 95 --NEKGLRFYDRLFDELKARGIEPFVTLYHFDLPLWLQKPYG----------GWENRETV 142 Query: 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNGYMWVKSGFPPSYLNFELSRRVMVNLI 243 FAR+AA +F D + T NEPNVV GY+ G PP ++ + + +V +++ Sbjct: 143 DAFARYAATVFERFGDKVKYWFTFNEPNVVVELGYL--YGGHPPGIVDPKAAYQVAHHML 200 Query: 244 QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRWIFFDAIIKGE 300 AHA A A+K I+ K +GII + PL+DK D KA E A+ F DA +KGE Sbjct: 201 LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNRFFLDAQVKGE 260 Query: 301 --------------LMGVTRDDL----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342 L + DL + +D+IG+NYY+ + VK + GYG Sbjct: 261 YPEYLEKELEENGILPEIEDGDLEILKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317
18
MGM workshop. 19 Oct 2010 http://pfam.sanger.ac.uk HMMs of protein alignments (local) for domains, or global (cover whole protein) Pfam are based on the detection of domains
19
MGM workshop. 19 Oct 2010 TIGRfam Full length alignments. Domain alignments. Equivalogs: families of proteins with specific function. Superfamilies: families of homologous genes. HMMs http://www.tigr.org/TIGRFAMs/
20
MGM workshop. 19 Oct 2010 Hits to other models How can we search Pfam and TIGRfam? Query: BChl_A [M=357] Accession: PF02327.12 Description: Bacteriochlorophyll A protein Scores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 0.00014 11.2 0.0 0.00024 10.5 0.0 1.2 1 tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 Domain annotation for each sequence (and alignments): >> tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 OS=Ignisphaera aggregans (strain DSM) # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 10.5 0.0 1.1e-05 0.00024 217 273.. 255 307.. 240 321.. 0.84 Alignments for each domain: == domain 1 score: 10.5 bits; conditional E-value: 1.1e-05 BChl_A 217 fshagsgvvdsisrwaelfpveklnkpasveagfrsdsqgievkvdgelpgvsvdag 273 fs+ g+v+si+ w l ++ + e gfr + iev v+g l v +d tr|E0STV9|E0STV9_IGNAA 255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVNGYLDDVYRDDL 307 899999*********98877765....3569*********************99864 PP GA Gathering method: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment. NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment. Noise cutoff Gathering cutoff Trusted cutoff
21
MGM workshop. 19 Oct 2010 InterPro. Composite pattern databases To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs http://www.ebi.ac.uk/interpro/
22
MGM workshop. 19 Oct 2010 KEGG orthology Xizeng Mao et al. Bioinformatics Volume 21,(2005)3787-3793 <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity
23
MGM workshop. 19 Oct 2010 ENZYME
24
Pathway collections KEGG Contains information about biochemical pathways, and protein interactions. http://www.kegg.com
25
MGM workshop. 19 Oct 2010 Pathway collections: Metacyc
26
MGM workshop. 19 Oct 2010 Functional annotation http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf
27
MGM workshop. 19 Oct 2010 RNA structural and functional annotation are coupled SILVA alignments of rRNAs are used to generate models Covariance models for each RNA class are used to predict genes
28
MGM workshop. 19 Oct 2010 There is a plethora of specialized databases that one needs to search http://www.oxfordjournals.org/nar/database/c
29
MGM workshop. 19 Oct 2010 In most cases databases are interconnected but …..not all databases are updated regularly. Changes of annotation in one database are not reflected in others
30
MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation Sequence databases Contain sequences deposited by verious sources Protein classification databases Utilize sequence homology or other criteria to group together proteins COG, Pfam, TIGRfam, InterPro, KO terms Specialized databases Start by searching for available resources
31
MGM workshop. 19 Oct 2010 Question? Genome annotation (Functional) How do we know it is correct? How do we do it? Data collections Protein families Pathway collections
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.