MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group

MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group DOE-JGIKmavrommatis@lbl.gov

MGM workshop. 19 Oct 2010 Outline Genome annotation (Functional)  How do we know it is correct?  How do we do it?  Data collections  Protein families  Pathway collections

MGM workshop. 19 Oct 2010 Genome annotation: The process of identifying the locations and functions of coding sequences. cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)  molecular/enzymatic (methyltransferase)  Reaction (methylation)  Substrate (cobalt-precorrin-4)  Ligand (S-adenosyl-L-methionine)  metabolic (cobalamin biosynthesis)  physiological (maintenance of healthy nerve and red blood cells, through B12).

MGM workshop. 19 Oct 2010 Functional annotation helps make sense out of nonsense But it only directs us to the potential of the organism

MGM workshop. 19 Oct 2010 Function prediction is mainly based on homology detection  Homology  implies a common evolutionary origin.  not retention of similarity in any of their properties.  Homology ≠ similarity of function.  Function transfer by homology Conservative amino acid substitution Low complexity region Gap (insertion or deletion)

MGM workshop. 19 Oct 2010 Function transfer based on homology is error prone Punta & Ofran. PLOS Comp Biol. 2008

MGM workshop. 19 Oct 2010 Limits in transfer of annotation based on homology Punta & Ofran. PLOS Comp Biol. 2008

MGM workshop. 19 Oct 2010 If no similarity is detected use alternative methods to predict function  Subcellular localization  Gene context  Special sequence motifs features Cytoplasm S ~ S Periplasm

MGM workshop. 19 Oct 2010 Genome annotation Model pathway Annotation should make sense in the context of the cell metabolism Substrate A Substrate B Substrate C Substrate D Enzyme 2 Enzyme 1Enzyme 3 Enzyme 2 ? ? Enzyme 1Enzyme 3 ✓

MGM workshop. 19 Oct 2010 Annotation should make sense. Missing genes may be present.

MGM workshop. 19 Oct 2010 Helps prediction Is error prone. Has to make sense. Genome annotation: The process of identifying the locations and functions of coding sequences.

MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation  Sequence databases  Protein classification databases  Specialized databases

MGM workshop. 19 Oct 2010 Primary databases store raw information from various sources EMBL/GenBank/DDBJ ( http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl ) http://www.ncbi.nlm.nih.gov/  Archive containing all sequences from all sources  GenBank/UnitProt contain translations of sequences. YearBase pairsSequences 200444,575,745,17640,604,319 200556,037,734,46252,016,762 200669,019,290,70564,893,747 200783,874,179,73080,388,382 200899,116,431,94298,868,465

MGM workshop. 19 Oct 2010 Primary databases accumulate errors in sequences and annotations  In the sequences themselves:  Sequencing errors.  Cloning vector sequences.  In the annotations:  Inaccuracies, omissions, and even mistakes.  Inconsistencies between some fields.  Redundancy.  { {  { { {{

MGM workshop. 19 Oct 2010 IMG is using Refseq as its primary source ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG ATAT GAGA C ATT GAGA GAGA C GAGA GAGA C C GAGA GAGA C GAGA GAGA C GAGA GAGA C C GAGA GAGA C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA

MGM workshop. 19 Oct 2010 Protein families use different methods to classify proteins  COG/KOG  Pfam  TIGRfam  KEGG Orthologs  InterPro

MGM workshop. 19 Oct 2010 What are COGs/KOGs? How much can I trust them? Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2 >gnl|COG|2723 COG2723, BglB, Beta-glucosidase/6-phospho-beta-glucosidase/beta- galactosidase [Carbohydrate transport and metabolism]. Length = 460 Score = 388 bits (998), Expect = e-132 Identities = 176/503 (34%), Positives = 251/503 (49%), Gaps = 75/503 (14%) Query: 4 SFPKSFRFGWSQAGFQSEMGTPGSEDPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63 FPK F +G + A FQ E +DW VWVHD I LVSGD PE ++ Sbjct: 3 KFPKDFLWGGATAAFQVEGAWNEDGKGPSDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60 Query: 64 YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHVDENDLKRLDE 123 Y+ A +MGL+ R ++EWSRIFP Sbjct: 61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV-------------------------- 94 Query: 124 AANQEAVRHYREIFSDLKARGIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183 N++ +R Y +F +LKARGI + YH+ LPLW+ P GW + +TV Sbjct: 95 --NEKGLRFYDRLFDELKARGIEPFVTLYHFDLPLWLQKPYG----------GWENRETV 142 Query: 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNGYMWVKSGFPPSYLNFELSRRVMVNLI 243 FAR+AA +F D + T NEPNVV GY+ G PP ++ + + +V +++ Sbjct: 143 DAFARYAATVFERFGDKVKYWFTFNEPNVVVELGYL--YGGHPPGIVDPKAAYQVAHHML 200 Query: 244 QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRWIFFDAIIKGE 300 AHA A A+K I+ K +GII + PL+DK D KA E A+ F DA +KGE Sbjct: 201 LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNRFFLDAQVKGE 260 Query: 301 --------------LMGVTRDDL----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342 L + DL + +D+IG+NYY+ + VK + GYG Sbjct: 261 YPEYLEKELEENGILPEIEDGDLEILKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317

MGM workshop. 19 Oct 2010 http://pfam.sanger.ac.uk HMMs of protein alignments (local) for domains, or global (cover whole protein) Pfam are based on the detection of domains

MGM workshop. 19 Oct 2010 TIGRfam  Full length alignments.  Domain alignments.  Equivalogs: families of proteins with specific function.  Superfamilies: families of homologous genes.  HMMs http://www.tigr.org/TIGRFAMs/

MGM workshop. 19 Oct 2010 Hits to other models How can we search Pfam and TIGRfam? Query: BChl_A [M=357] Accession: PF02327.12 Description: Bacteriochlorophyll A protein Scores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description ------- ------ ----- ------- ------ ----- ---- -- -------- ----------- 0.00014 11.2 0.0 0.00024 10.5 0.0 1.2 1 tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 Domain annotation for each sequence (and alignments): >> tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 OS=Ignisphaera aggregans (strain DSM) # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc --- ------ ----- --------- --------- ------- ------- ------- ------- ------- ------- ---- 1 ! 10.5 0.0 1.1e-05 0.00024 217 273.. 255 307.. 240 321.. 0.84 Alignments for each domain: == domain 1 score: 10.5 bits; conditional E-value: 1.1e-05 BChl_A 217 fshagsgvvdsisrwaelfpveklnkpasveagfrsdsqgievkvdgelpgvsvdag 273 fs+ g+v+si+ w l ++ + e gfr + iev v+g l v +d tr|E0STV9|E0STV9_IGNAA 255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVNGYLDDVYRDDL 307 899999*********98877765....3569*********************99864 PP GA Gathering method: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment. NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment. Noise cutoff Gathering cutoff Trusted cutoff

MGM workshop. 19 Oct 2010 InterPro. Composite pattern databases  To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro  Release 30.0 (Dec10) contains 21178 entries  Central annotation resource, with pointers to its satellite dbs http://www.ebi.ac.uk/interpro/

MGM workshop. 19 Oct 2010 KEGG orthology Xizeng Mao et al. Bioinformatics Volume 21,(2005)3787-3793 <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity

MGM workshop. 19 Oct 2010 ENZYME

Pathway collections KEGG  Contains information about biochemical pathways, and protein interactions. http://www.kegg.com

MGM workshop. 19 Oct 2010 Pathway collections: Metacyc

MGM workshop. 19 Oct 2010 Functional annotation http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf

MGM workshop. 19 Oct 2010 RNA structural and functional annotation are coupled  SILVA alignments of rRNAs are used to generate models  Covariance models for each RNA class are used to predict genes

MGM workshop. 19 Oct 2010 There is a plethora of specialized databases that one needs to search http://www.oxfordjournals.org/nar/database/c

MGM workshop. 19 Oct 2010 In most cases databases are interconnected but …..not all databases are updated regularly. Changes of annotation in one database are not reflected in others

MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation  Sequence databases  Contain sequences deposited by verious sources  Protein classification databases  Utilize sequence homology or other criteria to group together proteins  COG, Pfam, TIGRfam, InterPro, KO terms  Specialized databases  Start by searching for available resources

MGM workshop. 19 Oct 2010 Question? Genome annotation (Functional)  How do we know it is correct?  How do we do it?  Data collections  Protein families  Pathway collections

MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group

Similar presentations

Presentation on theme: "MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group

Similar presentations

Presentation on theme: "MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group"— Presentation transcript:

Similar presentations

About project

Feedback