Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Integration of Prokaryotic Genomics into the Unknown Microbe ID Lab Bert Eardley – Penn State, Berks & Dan Golemboski – Bellarmine University.
Characterizing and Classifying Prokaryotes
Chapter 26: Bacteria and Archaea: the Prokaryotic Domains CHAPTER 26 Bacteria and Archaea: The Prokaryotic Domains.
Luciano Brocchieri, PhD Research Interests. Summary of Research Interests 1.Gene identification and genome annotation 2.The evolution of genome-sequence.
What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards Fellowship.
Molecular & Genomic Surgery Eric M. Wilson 1/5/10.
Dairian Wan | Bioinformatics © 2003, Genentech 1 6/1/2015 Bioinformatics Overview 8 November 2004 Dairian Wan.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
High Throughput Computational Sequence Analysis Rob Edwards Argonne National Laboratory San Diego State University.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
High performance computational analysis of DNA sequences from different environments Rob Edwards Computer Science Biology edwards.sdsu.eduwww.theseed.org.
Experimental and computational assessment of conditionally essential genes in E. coli Chao WANG, Oct
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
THE GLOBAL MARINE VIRIOME Rob Edwards Dept. Biology, SDSU Computational Sciences Research Center, SDSU Center for Microbial Sciences, San Diego, Fellowship.
Brock Biology of Microorganisms
How We Annotated Genomes for Free: Fast and Accurate Functional Analysis Using Subsystems Technology Rob Edwards Depts of Computer Science And Biology,
National Microbial Pathogen Data Resource About us NMPDR is a Bioinformatics Resource Center dedicated to the thorough understanding of core.
Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequencing All of Microbial Life: Challenges and Opportunities Rob Edwards Argonne National Laboratory San Diego State University.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Ch10. Intermolecular Interactions and Biological Pathways
Tools for comparative genomics and expert annotations.
Genome-scale Metabolic Reconstruction and Modeling of Microbial Life Aaron Best, Biology Matthew DeJongh, Computer Science Nathan Tintle, Mathematics Hope.
Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012.
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Measuring the T m of DNA GC pairs connected by 3 H bonds AT pairs connected by 2 H bonds * Higher GC content  higher T m Absorbance of 260 nM light (UV)
Identify gene markers for different taxonomic groups in Archaea and Bacteria Genomes Dongying Wu 1,2, Jonathan A. Eisen 1,2 1. DOE Joint Genome Institute,
Subsystem: Succinate dehydrogenase The super-macromolecular respiratory complex II (succinate:quinone oxidoreductase) couples the oxidation of succinate.
National Microbial Pathogen Data Resource Connecting Bioinformatics to the Bench Leslie Klis McNeil NCSA, University of Illinois, Urbana.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Abstract Our current understanding of the taxonomic and phylogenetic diversity of cellular organisms, especially the bacteria and archaea, is mostly based.
GEBA Project Summary Dongying Wu. Phylogenetic Tree Building (Martin Wu) Concatenate alignments of 31 marker genes build a PHYML tree 667 non-GEBA genomes,
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Functional and Evolutionary Attributes through Analysis of Metabolism Sophia Tsoka European Bioinformatics Institute Cambridge UK.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Annotation. Traditional genome annotation BLAST Similarities.
Biomax Informatics AG Bioinformatics designed with you in mind. FunCat TM, a controlled vocabulary encompassing the biology of prokaryotes, plants and.
1 AraCyc Metabolic Pathway Annotation. 2 AraCyc – An overview  AraCyc is a metabolic pathway database for Arabidopsis thaliana;  Computational prediction.
SGM Meeting, Warwick, April 2006
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
SRI International Bioinformatics 1 Pathway Tools Features Available Only in the Desktop Version PathoLogic.
Using Online Resources. Dave Westenberg Associate Professor of Biology Missouri S&T Using Available Online Resources to Facilitate the Flipped Classroom.
Subsystem: General secretory pathway (sec-SRP) complex (TC 3.A.5.1.1) Matthew Cohoon, Department of Computer Science, University of Chicago, Chicago, IL.
General Microbiology (Micr300)
Computational Characterization of Short Environmental DNA Fragments Jens Stoye 1, Lutz Krause 1, Robert A. Edwards 2, Forest Rohwer 2, Naryttza N. Diaz.
Real Time DNA Sequence Analysis: New tools for mining data Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne,
The SEED Family First bacterial genome 100 bacterial genomes 1,000 bacterial genomes Number of known sequences Year How.
Genomics, Metagenomics, And Google Rob Edwards San Diego State University, San Diego, CA Argonne National Laboratory, Argonne, IL
The Integrated Microbial Genome (IMG) systems
a) SW PO TFS AHEC HAFS PSMS BSM RSM POCR Relative abundance
High-throughput Biological Data The data deluge
Mariya Munir, Terence L. Marsh, and Irene Xagoraraki Background
Genome Annotation Continued
Overview of Microbial Pathway and Genome Databases
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Annotations, Subsystems based approach
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University

The Problem (I) 1,000 2,000 3,000 4,000 5, X XXXX X X X X X Complete Genomes Year The 1,000th genome will be sequenced soon.

The Problem (II) Growth of sequencing versus annotation.

The SEED and NMPDR The SEED : 504 complete genomes Archaea30 Bacteria445 Eukarya29 NMPDR : 62 complete genomes Campylobacter10 Listeria16 Staphylococcus11 Streptococcus14 Vibrio11

The questions: How do you generate consistent annotations for 1,000 genomes? How do you measure consistent annotations for 1,000 genomes??

Basic biology lacZlacIlacYlacA

Different types of clustering < 80 %

Actinobacteria Aquificae Bacteroidetes Chlamydiae Chloroflexi Cyanobacteria Deinococcus- Thermus Firmicutes Spirochaetes Thermotogae Proteobacteria Clusters of genes w/ maximum 80% identity Genes in subsystems in clusters Total number of genomes in group Fraction of genes in clusters Number of genomes Average Occurrence of clustering in different genomes

The Subsystems Approach to Annotation Subsystem is a generalization of “pathway” –collection of functional roles jointly involved in a biological process or complex Functional Role is the abstract biological function of a gene product –atomic, or user-defined, examples: 6-phosphofructokinase (EC ) LSU ribosomal protein L31p Streptococcal virulence factors Does not (usually) contain “putative”, “thermostable”, etc Populated subsystem is complete spreadsheet of functions and roles across organisms

Thermatoga: ~40% of genes covered in 304 different subsystems

Subsystems developed based on Wet lab Chromosomal context Metabolic context Phylogenetic context Microarray data Proteomics data …

How do we measure annotations?

Natural Metrics Number of subsystems defined Number of functional roles defined Number of genes connected to functional roles

Annotations for NMPDR Genomes

Applied Metrics Number of solid connections of gene to functional role where “solid” is 1.supported by experimental data 2.connected to functional role and in chromosomal cluster with genes implementing functional roles from the same subsystem 3.only gene in genome connected to a functional role in an active variant of a subsystem Reactions, GO terms, Articles, Other databases cross references (number and diversity)

Applied Metrics

The Importance of Consistency Consistency: same genes connected to same functional role Enables communication Required for most comparative genomics assays

hisA FIG function: Phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase (EC ) Other functions in RefSeq: phosphoribosylformimino-5-aminoimidazole carboxamide phosphoribosylformimino-5-aminoimidazole carboxamide ribotide isomerase phosphoribosylformimino-5-aminoimidazole carboxamide ribotide... 1-(5-phosphoribosyl)-5-[(5- phosphoribosylamino)methylideneamino] imidazole-4-carboxamide isomerase N-(5-phospho-L-ribosyl-formimino)-5-amino-1-(5- phosphoribosyl)-4-imidazolecarboxamide isomerase N-(5'-phospho-L-ribosyl-formimino)-5-amino-1-(5'-phosphoribosyl)-4-imidazolecarboxamide isomerase Phosphoribosyl isomerase A [1-[5-phosphoribosyl]-5-[[5-phosphoribosylamino]methylideneamino] imidazole-4-carboxamide isomerase]

Measuring Consistency Define a set of protein families such that each family contains genes playing the same function Attach functional roles to protein families Measure the consistency of the annotations made to genes within each family 1."consistency" is the odds that two proteins from the same family have the same function 2.Evaluate both families and functions.

Consistency among databases

Number of RefSeq proteins in families

How to measure accuracy If everything was called “hypothetical protein” the database would be 100% consistent Need to measure accuracy (specificity) as well as consistency Sample 100 proteins at random from “curated” set (i.e. that are believed to be correct) Manually inspect annotations to score correctness TIGR/SEED joint project for annotation consistency server

FIG Veronika Vonstein Ross Overbeek Annotators ANL Rick Stevens Bob Olsen Folker Meyer Daniela Bartels Tobi Paczian Daniel Paarmann Terry Disz Acknowledgements SEED: NMPDR: RAST: metaRAST: