MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group

Slides:



Advertisements
Similar presentations
1 Orthologs: Two genes, each from a different species, that descended from a single common ancestral gene Paralogs: Two or more genes, often thought of.
Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Profiles for Sequences
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
ComPath Comparative Metabolic Pathway Analyzer Kwangmin Choi and Sun Kim School of Informatics Indiana University.
Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
2 February, 2007 Life Science: Organisms. 2 February, 2007 Genomics “The genetic blueprints of all people generally have the same information, with approximately.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Sequence-based Similarity Module (BLAST & CDD only ) & Horizontal Gene Transfer Module (Ortholog Neighborhood & GC content only)
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Analysis and comparison of very large metagenomes with fast clustering and functional annotation Weizhong Li, BMC Bioinformatics 2009 Present by Chuan-Yih.
MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Predicting Protein Function Annotation using Protein- Protein Interaction Networks By Tamar Eldad Advisor: Dr. Yanay Ofran Computational Biology.
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
What is sequencing? Video: WlxM (Illumina video) WlxM.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein Families, Motifs & Domains.
Demo: Protein Information Resource
Sequence based searches:
Microbial Genome Annotation
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
What do you with a whole genome sequence?
Basic Local Alignment Search Tool
Presentation transcript:

MGM workshop. 19 Oct 2010 Functional annotation Datasources Konstantinos Mavrommatis Head of Omics group

MGM workshop. 19 Oct 2010 Outline Genome annotation (Functional)  How do we know it is correct?  How do we do it?  Data collections  Protein families  Pathway collections

MGM workshop. 19 Oct 2010 Genome annotation: The process of identifying the locations and functions of coding sequences. cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)  molecular/enzymatic (methyltransferase)  Reaction (methylation)  Substrate (cobalt-precorrin-4)  Ligand (S-adenosyl-L-methionine)  metabolic (cobalamin biosynthesis)  physiological (maintenance of healthy nerve and red blood cells, through B12).

MGM workshop. 19 Oct 2010 Functional annotation helps make sense out of nonsense But it only directs us to the potential of the organism

MGM workshop. 19 Oct 2010 Function prediction is mainly based on homology detection  Homology  implies a common evolutionary origin.  not retention of similarity in any of their properties.  Homology ≠ similarity of function.  Function transfer by homology Conservative amino acid substitution Low complexity region Gap (insertion or deletion)

MGM workshop. 19 Oct 2010 Function transfer based on homology is error prone Punta & Ofran. PLOS Comp Biol. 2008

MGM workshop. 19 Oct 2010 Limits in transfer of annotation based on homology Punta & Ofran. PLOS Comp Biol. 2008

MGM workshop. 19 Oct 2010 If no similarity is detected use alternative methods to predict function  Subcellular localization  Gene context  Special sequence motifs features Cytoplasm S ~ S Periplasm

MGM workshop. 19 Oct 2010 Genome annotation Model pathway Annotation should make sense in the context of the cell metabolism Substrate A Substrate B Substrate C Substrate D Enzyme 2 Enzyme 1Enzyme 3 Enzyme 2 ? ? Enzyme 1Enzyme 3 ✓

MGM workshop. 19 Oct 2010 Annotation should make sense. Missing genes may be present.

MGM workshop. 19 Oct 2010 Helps prediction Is error prone. Has to make sense. Genome annotation: The process of identifying the locations and functions of coding sequences.

MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation  Sequence databases  Protein classification databases  Specialized databases

MGM workshop. 19 Oct 2010 Primary databases store raw information from various sources EMBL/GenBank/DDBJ ( )  Archive containing all sequences from all sources  GenBank/UnitProt contain translations of sequences. YearBase pairsSequences ,575,745,17640,604, ,037,734,46252,016, ,019,290,70564,893, ,874,179,73080,388, ,116,431,94298,868,465

MGM workshop. 19 Oct 2010 Primary databases accumulate errors in sequences and annotations  In the sequences themselves:  Sequencing errors.  Cloning vector sequences.  In the annotations:  Inaccuracies, omissions, and even mistakes.  Inconsistencies between some fields.  Redundancy.  { {  { { {{

MGM workshop. 19 Oct 2010 IMG is using Refseq as its primary source ATTGACTA TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG ATAT GAGA C ATT GAGA GAGA C GAGA GAGA C C GAGA GAGA C GAGA GAGA C GAGA GAGA C C GAGA GAGA C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA

MGM workshop. 19 Oct 2010 Protein families use different methods to classify proteins  COG/KOG  Pfam  TIGRfam  KEGG Orthologs  InterPro

MGM workshop. 19 Oct 2010 What are COGs/KOGs? How much can I trust them? Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2 >gnl|COG|2723 COG2723, BglB, Beta-glucosidase/6-phospho-beta-glucosidase/beta- galactosidase [Carbohydrate transport and metabolism]. Length = 460 Score = 388 bits (998), Expect = e-132 Identities = 176/503 (34%), Positives = 251/503 (49%), Gaps = 75/503 (14%) Query: 4 SFPKSFRFGWSQAGFQSEMGTPGSEDPNTDWYVWVHDPENIASGLVSGDLPEHGPGYWGL 63 FPK F +G + A FQ E +DW VWVHD I LVSGD PE ++ Sbjct: 3 KFPKDFLWGGATAAFQVEGAWNEDGKGPSDWDVWVHDE--IPGRLVSGDPPEEASDFYHR 60 Query: 64 YRMFHDNAVKMGLDIARINVEWSRIFPKPMPDPPQGNVEVKGNDVLAVHVDENDLKRLDE 123 Y+ A +MGL+ R ++EWSRIFP Sbjct: 61 YKEDIALAKEMGLNAFRTSIEWSRIFPNGDGGEV Query: 124 AANQEAVRHYREIFSDLKARGIHFILNFYHWPLPLWVHDPIRVRKGDLSGPTGWLDVKTV 183 N++ +R Y +F +LKARGI + YH+ LPLW+ P GW + +TV Sbjct: 95 --NEKGLRFYDRLFDELKARGIEPFVTLYHFDLPLWLQKPYG GWENRETV 142 Query: 184 INFARFAAYTAWKFDDLADEYSTMNEPNVVHSNGYMWVKSGFPPSYLNFELSRRVMVNLI 243 FAR+AA +F D + T NEPNVV GY+ G PP V +++ Sbjct: 143 DAFARYAATVFERFGDKVKYWFTFNEPNVVVELGYL--YGGHPPGIVDPKAAYQVAHHML 200 Query: 244 QAHARAYDAVKAISKK-PIGIIYANSSFTPLTDK--DAKAVELAEYDSRWIFFDAIIKGE 300 AHA A A+K I+ K +GII + PL+DK D KA E A+ F DA +KGE Sbjct: 201 LAHALAVKAIKKINPKGKVGIILNLTPAYPLSDKPEDVKAAENADRFHNRFFLDAQVKGE 260 Query: LMGVTRDDL----KGRLDWIGVNYYSRTVVKLIGEKSYVSIPGYGY 342 L + DL + +D+IG+NYY+ + VK + GYG Sbjct: 261 YPEYLEKELEENGILPEIEDGDLEILKENTVDFIGLNYYTPSRVK---AAEPRYVSGYGP 317

MGM workshop. 19 Oct HMMs of protein alignments (local) for domains, or global (cover whole protein) Pfam are based on the detection of domains

MGM workshop. 19 Oct 2010 TIGRfam  Full length alignments.  Domain alignments.  Equivalogs: families of proteins with specific function.  Superfamilies: families of homologous genes.  HMMs

MGM workshop. 19 Oct 2010 Hits to other models How can we search Pfam and TIGRfam? Query: BChl_A [M=357] Accession: PF Description: Bacteriochlorophyll A protein Scores for complete sequences (score includes all domains): --- full sequence best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 Domain annotation for each sequence (and alignments): >> tr|E0STV9|E0STV9_IGNAA Glycoside hydrolase family 1 OS=Ignisphaera aggregans (strain DSM) # score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to acc ! e Alignments for each domain: == domain 1 score: 10.5 bits; conditional E-value: 1.1e-05 BChl_A 217 fshagsgvvdsisrwaelfpveklnkpasveagfrsdsqgievkvdgelpgvsvdag 273 fs+ g+v+si+ w l ++ + e gfr + iev v+g l v +d tr|E0STV9|E0STV9_IGNAA 255 FSKKPIGIVESIASWIPLREGDR----EAAEKGFRYNLWPIEVAVNGYLDDVYRDDL ********* *********************99864 PP GA Gathering method: Search threshold to build the full alignment. TC Trusted Cutoff: Lowest sequence score and domain score of match in the full alignment. NC Noise Cutoff: Highest sequence score and domain score of match not in full alignment. Noise cutoff Gathering cutoff Trusted cutoff

MGM workshop. 19 Oct 2010 InterPro. Composite pattern databases  To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro  Release 30.0 (Dec10) contains entries  Central annotation resource, with pointers to its satellite dbs

MGM workshop. 19 Oct 2010 KEGG orthology Xizeng Mao et al. Bioinformatics Volume 21,(2005) <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity <10 -5 evalue ≤ rank 5 ≥ 70% query length ≥ 30% identity

MGM workshop. 19 Oct 2010 ENZYME

Pathway collections KEGG  Contains information about biochemical pathways, and protein interactions.

MGM workshop. 19 Oct 2010 Pathway collections: Metacyc

MGM workshop. 19 Oct 2010 Functional annotation

MGM workshop. 19 Oct 2010 RNA structural and functional annotation are coupled  SILVA alignments of rRNAs are used to generate models  Covariance models for each RNA class are used to predict genes

MGM workshop. 19 Oct 2010 There is a plethora of specialized databases that one needs to search

MGM workshop. 19 Oct 2010 In most cases databases are interconnected but …..not all databases are updated regularly. Changes of annotation in one database are not reflected in others

MGM workshop. 19 Oct 2010 There are multiple datasources to help organize information and facilitate annotation  Sequence databases  Contain sequences deposited by verious sources  Protein classification databases  Utilize sequence homology or other criteria to group together proteins  COG, Pfam, TIGRfam, InterPro, KO terms  Specialized databases  Start by searching for available resources

MGM workshop. 19 Oct 2010 Question? Genome annotation (Functional)  How do we know it is correct?  How do we do it?  Data collections  Protein families  Pathway collections