Tools for comparative genomics and expert annotations.

Slides:



Advertisements
Similar presentations
Integration of Prokaryotic Genomics into the Unknown Microbe ID Lab Bert Eardley – Penn State, Berks & Dan Golemboski – Bellarmine University.
Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
MitoInteractome : Mitochondrial Protein Interactome Database Rohit Reja Korean Bioinformation Center, Daejeon, Korea.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics Rob Edwards Fellowship.
Ontology annotation: mapping genomic regions biological function Paul D Thomas, Huaiyu Mi and Suzanna Lewis.
Basics of Comparative Genomics Dr G. P. S. Raghava.
High Throughput Computational Sequence Analysis Rob Edwards Argonne National Laboratory San Diego State University.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Annotating Metagenomes Using the NMPDR Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
How We Annotated Genomes for Free: Fast and Accurate Functional Analysis Using Subsystems Technology Rob Edwards Depts of Computer Science And Biology,
National Microbial Pathogen Data Resource About us NMPDR is a Bioinformatics Resource Center dedicated to the thorough understanding of core.
Annotating Metagenomes Using the SEED Rob Edwards Department of Computer Sciences, San Diego State University Mathematics and Computer Sciences Division,
Annotations, Subsystems based approach Rob Edwards Argonne National Labs San Diego State University.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Metagenomic Analysis Using MEGAN4
Development of Bioinformatics and its application on Biotechnology
Genome-scale Metabolic Reconstruction and Modeling of Microbial Life Aaron Best, Biology Matthew DeJongh, Computer Science Nathan Tintle, Mathematics Hope.
Advancing Science with DNA Sequence Data Curation in IMG-ER Natalia Ivanova MGM Workshop May 16, 2012.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
The Metagenomics RAST server: Annotation, Analysis, and Comparisons Perfect for Pyrosequencing Rob Edwards Department of Computer Science, San Diego State.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Gene Regulatory Network Inference. Progress in Disease Treatment  Personalized medicine is becoming more prevalent for several kinds of cancer treatment.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
SAGExplore web server tutorial for Module II: Genome Mapping.
Muhammad Awais PhD Biochemistry 08-ARID-1103 Understanding Basic Local Alignment Search Tool.
Tutorial session 2 Network annotation Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
SAGExplore web server tutorial for Module I: Genome Explore.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
National Microbial Pathogen Data Resource Connecting Bioinformatics to the Bench Leslie Klis McNeil NCSA, University of Illinois, Urbana.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Annotation. Traditional genome annotation BLAST Similarities.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
SAGExplore web server tutorial. The SAGExplore server has three different modules …
A guided tour of Ensembl This quick tour will give you an outline view of what Ensembl is all about. You will learn: –Why we need Ensembl –What is in the.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Copyright OpenHelix. No use or reproduction without express written consent1.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Subsystem: General secretory pathway (sec-SRP) complex (TC 3.A.5.1.1) Matthew Cohoon, Department of Computer Science, University of Chicago, Chicago, IL.
Bioinformatics What is a genome? How are databases used? What is a phylogentic tree?
Basics of Comparative Genomics
Genome Annotation Continued
Genomic Data Manipulation
Overview of Microbial Pathway and Genome Databases
Introduction to Bioinformatics II
Basic Local Alignment Search Tool
lincRNAs: Genomics, Evolution, and Mechanisms
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
Gautam Dey, Tobias Meyer  Cell Systems 
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Annotations, Subsystems based approach
Presentation transcript:

Tools for comparative genomics and expert annotations

Goals of this Presentation Introduce microbiologists to the power of NMPDR and SEED Enable users to interact with data Invite experts to participate in construction of subsystems Capture expert annotations via the annotation clearinghouse

What is NMPDR?NMPDR Beautified, read-only version of the SEED What is the SEED?SEED Editable environment for assignment of function in the context of systems biology Intended to clean up legacy of errors created by similarity-based, automated assignment of function Manual assignment of function based on integrated evidence: sequence similarity, functional clusters, phylogenetic and metabolic profiles Developed for the project to annotate 1000 genomes

When Will We Have 1000 Complete Genomes? Depends on what is meant by “complete”  Many sequencing projects will stop without “finishing” or “closing” the genome in one contiguous sequence for each replicon A genome is essentially complete when:  % of genome accurately sequenced  10X coverage by 454 method; 5X coverage by Sanger method  Assembly places 70% data in contigs at least 20 kbp

Bacterial Genome Facts First two complete genomes in 1995 were bacterial pathogens 2913 genomes started as of Sept., 2007  63% of total are bacteria; 50% of bacteria are pathogens 4434 genomes started as of January, 2009  51% bacteria Value depends on accuracy of annotation

Complete Genome Projects

What is an Annotation? Identification of nucleotide string that could potentially encode a protein  Open reading frames (ORFs) computed from stop and start codons, codon bias, promoters and RBS Assignment of a name to that gene  Usually that of known protein with most similar sequence, computed from translated BLAST Prediction of functional role for that gene  Function of most similar protein not always established with experimental evidence  Most similar protein may not have known function  Most similar ORF may or may not be expressed

Problems with Standard Annotations 42% of H. influenzae ORFs assigned no function in 1995  about half of those had no sequence match in GenBank  the rest matched “hypothetical proteins” in E. coli 58% of H. influenzae ORFs assigned function of a significantly similar sequence What was in GenBank to compare with in 1995?  7% of all GenBank entries were bacterial, 16% of those, E. coli  many “conserved hypotheticals” added to database Paralogous members of protein families may not be properly discriminated Significantly similar enzymes may act on different substrates Assignments are transitive, many times removed from experimental data

Subsystems Annotations vs. Pipelines or Protein Families What is subsystems annotation?  humans integrating evidence within a comparative framework What’s wrong with “genome-at-a-time” pipelines?  automated assignment of archived annotations to new genomes  propagates uninformative and incorrect annotations What’s wrong with annotation based on protein families?  emphasizes structural and phylogenetic evidence  ignores metabolic and chromosomal contexts  leads to ambiguity for members of large families, e.g. transporters

What is a Subsystem? Subsystem is a generalization of pathway  Collection of functional roles jointly involved in a biological process or complex metabolic, signaling, regulatory, structural Functional role is the abstract biological function of a gene product  Atomic or fundamental; examples: 6-phosphofructokinase (EC ) LSU ribosomal protein L31p cell division protein FtsZ Inclusion of gene in subsystem is only by functional role Controlled vocabulary …

Expert-Defined Subsystems Curator is researcher with first-hand knowledge of biological system Functional roles defined and grouped into subsystem and subsets by curator  universal groups of roles include all organisms  functional variants are subsets of roles found in a limited number of organisms often represent alternative paths or nonorthologous replacement Semi-automated assignment of function based on manual groundwork, sequence homology, and functional clustering

Subsystem Primer Describe your subsystem in 150 words or less—why should these functions be considered together?  define the emergent properties of the system Provide or link to a diagram that illustrates this subsystem  define the graph or network List the reactions or relationships between these functional roles  define the edges List the exact names and abbreviations of these functional roles  define the nodes List the id numbers (GenBank, SwissProt—any identifying alias) of genes that play these roles in one or more exemplar genomes  examples of nodes Provide one or more references that support the assignment of function for the exemplar genes  provide evidence

Populated Subsystems Two-dimensional integration of functional roles with genomes Spreadsheet  Columns of functional roles  Rows of organisms  Cells of annotated genes Table of functional roles with GO terms Diagram Curator notes and citations

Simple Example: Histidine Degradation Subsystem Conversion of histidine to glutamate is organizing principle Functional roles defined in table:

Subsystem Diagram Three functional variants Universal subset has three roles, followed by three alternative paths from IV to VI

Subsystem Spreadsheet Column headers taken from table of functional roles Rows are selected genomes, or organisms Cells are populated with specific, annotated genes Shared background color indicates proximity of genes Functional variants defined by the annotated roles Variant code -1 indicates subsystem is not functional OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi gi gi gi Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8 Listeria monocytogenes Subsystem Spreadsheet

Missing Genes Noticed by Subsystems Annotation No genes were annotated “ ForI (EC ) Formiminoglutamic iminohydrolase ” when the Histidine Degradation subsystem was populated Organisms missing ForI convert His to Glu Candidate genes that could perform the role “ ForI ” must be identified Strategy for finding genes is based on chromosomal clustering and occurrence profiling

Finding Genes that Cluster with NfoD Red gene in graphic and table is NfoD of XanthomonasNfoD of Xanthomonas Genes pictured in gray boxes located nearby NfoD in four or more species Advanced controls expands display of homologous regions in other genomes Functional Coupling score links to table of homologous pairs in other genomes Cluster button finds biggest clusters in other species when not clustered in subject genome

What are Pinned Regions? Focus gene is number 1, colored red Most frequently co-localized homolog numbered 2, colored green Sets of homologous genes presented in the same color with the same numerical label; BLASTP cut- off e-val = 1e-20 Numerical labels correspond to rank-ordered frequency of co-localization with the focus gene Number of regions, size of region, and cut-off can be re-set by user

Compare Regions around NfoD, red, center HutC, the regulator, is green, 2  HutH, the first functional role in the subsystem, is blue, 4  Candidate ForI is teal, 6, originally annotated as “ conserved hypothetical ” Candidate ForI in Context with NfoD

Annotation of ForI EC Metabolic context proves need for role  Organisms missing annotated ForI degrade His to Glu Chromosomal context points to candidate  Clusters with NfoD and other genes in subsystem Occurrence context supports candidate  Organisms containing NfoD lack GluF and HutG, required for functional variants 1 and 2, respectively  Organisms containing candidate ForI also contain NfoD, indicating functional variant 3 Phylogenetic trees of candidate ForI genes are coherent

Subsystems Allow Bioinformatics to Inform Bench Research Subsystems point to missing or alternative genes Bioinformatic predictions need to be tested at the bench ForI candidate now verified experimentallyverified Connections forged between bench and bioinformatics

How is NMPDR distinct from NCBI? Corrected, functional annotations, manually curated in context of systems biology Multiple starting points for accessing data  gene or protein name, subsystem, organism Search results downloadable as names or sequences Interactive tools for comparative analysis  Compare regions—adjust size of region, number of genomes Compare regions  Subsystems—browse phylogenetic distribution of biological system; color spreadsheet and diagram Subsystems  Functional clusters—find genes with conserved proximity Functional clusters  BLASTP Hits—select and align interesting sequences BLASTP Hits  Signature genes—find genes in common or that distinguish user- selected groups of genomes; groups may contain one or many Signature genes

Exploration of physical, genomic context Compare Regions graphic  Focus protein highlighted red  Color-matched orthlogs allow comparative analysis of functional clustering and chromosomal rearrangements  Redraw the display with different number of genomes or different size region Compare Regions table  Table is sortable and filterable with active column headings  Genes with conserved proximity shown with functional coupling scores, fc-sc fc-sc (functional coupling score)  Measures conservation of gene proximity and phylogenetic distance  Link returns table listing pairs of proximal orthologs CL (find best clusters)  Finds clusters containing the focus protein in other genomes  Useful for genes without functional coupling scores, fc-sc

Exploration of functional, biological context Populated Subsystem Spreadsheet  Columns represent functional roles, mouse over header for definition  Genomes (rows) shown may be filtered and sorted by name or taxonomic group  Cells populated with specific, annotated genes linked to context pages  Functional variants defined by the annotated roles  Variant codes defined in notes tab  Diagram of subsystem often provided Protein families  FIGfams taken from single column of functional roles  Links to structures, orthologs, literature

NMPDR Services Essential Genes on Genomic ScaleEssential Genes  Experimentally verified in genome-wide scans of 10 important model organisms Drug targets pipline to in silico screening  essential in at least one of the NMPDR pathogens  included in subsystems by our curators  orthologs in the Protein Data Bank  orthologs in a substantial number of bacterial priority pathogens Targets search: flexible search forms for discovering novel targets based on computed attributes  physical characteristics such as MW, pI  subcellular location  transmembrane regions and signal peptides  subsystem, pathway, reaction  structural motifs, protein families

Related NMPDR Services RAST Genome annotation serverRAST  Automated annotation of essentially complete genome sequences in a small set of long sequence contigs  View results in comparative context with other genomes MG-RAST Metagenome annotation serverMG-RAST  Automated annotation of a very large set of very short DNA sequences  View results in comparative context with other data sets Annotation Clearinghouse  Tool to credit experts with annotation of specific genes and to share annotations with other databases  Input is a two-column table of gene IDs and annotations vouched for by expert

Who is NMPDR? Fellowship for Interpretation of Genomes (FIG) Ross Overbeek, Veronika Vonstein, Gordon Pusch, Bruce Parrello, Rob Edwards, Andrei Osterman, Michael Fonstein, Svetlana Gerdes, Olga Zagnitko, Olga Vassieva, Yakov Kogan, Irina Goltsman Argonne National Laboratory Rick Stevens, Terry Disz, Robert Olson, Folker Meyer, Elizabeth Glass, Chris Henry, Jared Wilkening Computation Institute at University of Chicago Daniela Bartels, Michael Kubal, William Mihalo, Tobias Paczian, Andreas Wilke, Alex Rodriguez, Mark D'Souza, Rami Aziz University of Illinois at Urbana; Hope College Gary J. Olsen, Claudia Reich, Leslie McNeil; Aaron Best, Matt DeJongh National Institute of Allergy and Infectious Diseases National Institutes of Health, Department of Health and Human Services, Contract HHSN C.