Annotations, Subsystems based approach

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Integration of Prokaryotic Genomics into the Unknown Microbe ID Lab Bert Eardley – Penn State, Berks & Dan Golemboski – Bellarmine University.
Luciano Brocchieri, PhD Research Interests. Summary of Research Interests 1.Gene identification and genome annotation 2.The evolution of genome-sequence.
What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards Fellowship.
Molecular & Genomic Surgery Eric M. Wilson 1/5/10.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
High Throughput Computational Sequence Analysis Rob Edwards Argonne National Laboratory San Diego State University.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.
Experimental and computational assessment of conditionally essential genes in E. coli Chao WANG, Oct
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
THE GLOBAL MARINE VIRIOME Rob Edwards Dept. Biology, SDSU Computational Sciences Research Center, SDSU Center for Microbial Sciences, San Diego, Fellowship.
How We Annotated Genomes for Free: Fast and Accurate Functional Analysis Using Subsystems Technology Rob Edwards Depts of Computer Science And Biology,
National Microbial Pathogen Data Resource About us NMPDR is a Bioinformatics Resource Center dedicated to the thorough understanding of core.
Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Sequencing All of Microbial Life: Challenges and Opportunities Rob Edwards Argonne National Laboratory San Diego State University.
Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Tools for comparative genomics and expert annotations.
Genome-scale Metabolic Reconstruction and Modeling of Microbial Life Aaron Best, Biology Matthew DeJongh, Computer Science Nathan Tintle, Mathematics Hope.
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
National Microbial Pathogen Data Resource Connecting Bioinformatics to the Bench Leslie Klis McNeil NCSA, University of Illinois, Urbana.
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Functional and Evolutionary Attributes through Analysis of Metabolism Sophia Tsoka European Bioinformatics Institute Cambridge UK.
Annotation. Traditional genome annotation BLAST Similarities.
SGM Meeting, Warwick, April 2006
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Major characteristics used in taxonomy
Paths to a Reference Architecture for an Open Bio Grid Rick Stevens.
Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,
The SEED Family First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How.
Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL
Bioinformatics Overview
The Integrated Microbial Genome (IMG) systems
Networks and Interactions
Components of life and Ecosystems
Optimizing Biological Data Integration
The Integrated Microbial Genome (IMG) systems
a) SW PO TFS AHEC HAFS PSMS BSM RSM POCR Relative abundance
FLiPS Functional Linkage Prediction Service.
Genomes and their evolution
High-throughput Biological Data The data deluge
Mariya Munir, Terence L. Marsh, and Irene Xagoraraki Background
Functional Annotation of the Horse Genome
Genome Annotation Continued
The Mimivirus Giant double stranded DNA virus Discovered in amoebas
Supplemental material for Wang et al
Large Scale Data Integration
A bioinformatic analysis of microRNAs role in osteoarthritis
Genomic Data Manipulation
Regulation of Gene Expression
Overview of Microbial Pathway and Genome Databases
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
A User’s Guide to GO: Structural and Functional Annotation
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
MAPPFinder and You: An Introductory Presentation
PANTHER (Protein Analysis Through Evolutionary Relationships): Trees, Hidden Markov Models, Biological Annotations Paul Thomas, Ph.D. Division of Bioinformatics.
Annotation Presentation
Exploring the forest canopy metagenome for novel compounds
MAPPFinder and You: An Introductory Presentation
Chromatophore Genome Sequence of Paulinella Sheds Light on Acquisition of Photosynthesis by Eukaryotes  Eva C.M. Nowack, Michael Melkonian, Gernot Glöckner 
Volume 24, Issue 13, Pages (July 2014)
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Overview of the Pathway Tools FBA Module
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Fig. 3. Phylogenetic relationship of the replicons of the family Burkholderiaceae. An unrooted RAxML maximum ... Fig. 3. Phylogenetic relationship of the.
Presentation transcript:

Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University

The Problem (I) The 1,000th genome will be sequenced soon. 2,000 3,000 4,000 5,000 1996 2000 2004 2008 X Complete Genomes Year www.nmpdr.org www.theseed.org

The Problem (II) Growth of sequencing versus annotation. www.nmpdr.org www.theseed.org

The SEED and NMPDR The SEED : 504 complete genomes Archaea 30 Bacteria 445 Eukarya 29 NMPDR : 62 complete genomes Campylobacter 10 Listeria 16 Staphylococcus 11 Streptococcus 14 Vibrio www.nmpdr.org www.theseed.org

The questions: How do you generate consistent annotations for 1,000 genomes? How do you measure consistent annotations for 1,000 genomes?? www.nmpdr.org www.theseed.org

Basic biology lacI lacZ lacY lacA www.nmpdr.org www.theseed.org

Different types of clustering < 80 % < 80% www.nmpdr.org www.theseed.org

Occurrence of clustering in different genomes 1 Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group 120 0.8 Fraction of genes in clusters 0.6 80 Number of genomes 0.4 40 0.2 Aquificae Average Chlamydiae Chloroflexi Deinococcus- Thermus Firmicutes Actinobacteria Bacteroidetes Cyanobacteria Spirochaetes Thermotogae Proteobacteria

The Subsystems Approach to Annotation Subsystem is a generalization of “pathway” collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product atomic, or user-defined, examples: 6-phosphofructokinase (EC 2.7.1.11) LSU ribosomal protein L31p Streptococcal virulence factors Does not (usually) contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles across organisms www.nmpdr.org www.theseed.org

Thermatoga: ~40% of genes covered in 304 different subsystems

Subsystems developed based on Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data … www.nmpdr.org www.theseed.org

How do we measure annotations? www.nmpdr.org www.theseed.org

Natural Metrics Number of subsystems defined Number of functional roles defined Number of genes connected to functional roles www.nmpdr.org www.theseed.org

Annotations for NMPDR Genomes www.nmpdr.org www.theseed.org

Applied Metrics Number of solid connections of gene to functional role where “solid” is supported by experimental data connected to functional role and in chromosomal cluster with genes implementing functional roles from the same subsystem only gene in genome connected to a functional role in an active variant of a subsystem Reactions, GO terms, Articles, Other databases cross references (number and diversity) www.nmpdr.org www.theseed.org

Applied Metrics www.nmpdr.org www.theseed.org

The Importance of Consistency Consistency: same genes connected to same functional role Enables communication Required for most comparative genomics assays www.nmpdr.org www.theseed.org

hisA FIG function: Other functions in RefSeq: www.nmpdr.org Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC 5.3.1.16) Other functions in RefSeq: phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1- (5'-phosphoribosyl)-4- imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase] www.nmpdr.org www.theseed.org

Measuring Consistency Define a set of protein families such that each family contains genes playing the same function Attach functional roles to protein families Measure the consistency of the annotations made to genes within each family "consistency" is the odds that two proteins from the same family have the same function Evaluate both families and functions. www.nmpdr.org www.theseed.org

Consistency among databases www.nmpdr.org www.theseed.org

Number of RefSeq proteins in families www.nmpdr.org www.theseed.org

How to measure accuracy If everything was called “hypothetical protein” the database would be 100% consistent Need to measure accuracy (specificity) as well as consistency Sample 100 proteins at random from “curated” set (i.e. that are believed to be correct) Manually inspect annotations to score correctness TIGR/SEED joint project for annotation consistency server www.nmpdr.org www.theseed.org

Acknowledgements SEED: http://www.theseed.org/ NMPDR: http://www.nmpdr.org/ RAST: http://www.nmpdr.org/anno-server metaRAST: http://metagenomics.theseed.org/ ANL Rick Stevens Bob Olsen Folker Meyer Daniela Bartels Tobi Paczian Daniel Paarmann Terry Disz FIG Veronika Vonstein Ross Overbeek Annotators