Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tools for comparative genomics and expert annotations.

Similar presentations


Presentation on theme: "Tools for comparative genomics and expert annotations."— Presentation transcript:

1 Tools for comparative genomics and expert annotations

2 www.nmpdr.org Goals of this Presentation Introduce microbiologists to the power of NMPDR and SEED Enable users to interact with data Invite experts to participate in construction of subsystems Capture expert annotations via the annotation clearinghouse

3 www.nmpdr.org What is NMPDR?NMPDR Beautified, read-only version of the SEED What is the SEED?SEED Editable environment for assignment of function in the context of systems biology Intended to clean up legacy of errors created by similarity-based, automated assignment of function Manual assignment of function based on integrated evidence: sequence similarity, functional clusters, phylogenetic and metabolic profiles Developed for the project to annotate 1000 genomes

4 www.nmpdr.org When Will We Have 1000 Complete Genomes? Depends on what is meant by “complete”  Many sequencing projects will stop without “finishing” or “closing” the genome in one contiguous sequence for each replicon A genome is essentially complete when:  95 - 99% of genome accurately sequenced  10X coverage by 454 method; 5X coverage by Sanger method  Assembly places 70% data in contigs at least 20 kbp

5 www.nmpdr.org Bacterial Genome Facts First two complete genomes in 1995 were bacterial pathogens 2913 genomes started as of Sept., 2007  63% of total are bacteria; 50% of bacteria are pathogens 4434 genomes started as of January, 2009  51% bacteria Value depends on accuracy of annotation

6 www.nmpdr.org Complete Genome Projects

7 www.nmpdr.org What is an Annotation? Identification of nucleotide string that could potentially encode a protein  Open reading frames (ORFs) computed from stop and start codons, codon bias, promoters and RBS Assignment of a name to that gene  Usually that of known protein with most similar sequence, computed from translated BLAST Prediction of functional role for that gene  Function of most similar protein not always established with experimental evidence  Most similar protein may not have known function  Most similar ORF may or may not be expressed

8 www.nmpdr.org Problems with Standard Annotations 42% of H. influenzae ORFs assigned no function in 1995  about half of those had no sequence match in GenBank  the rest matched “hypothetical proteins” in E. coli 58% of H. influenzae ORFs assigned function of a significantly similar sequence What was in GenBank to compare with in 1995?  7% of all GenBank entries were bacterial, 16% of those, E. coli  many “conserved hypotheticals” added to database Paralogous members of protein families may not be properly discriminated Significantly similar enzymes may act on different substrates Assignments are transitive, many times removed from experimental data

9 www.nmpdr.org Subsystems Annotations vs. Pipelines or Protein Families What is subsystems annotation?  humans integrating evidence within a comparative framework What’s wrong with “genome-at-a-time” pipelines?  automated assignment of archived annotations to new genomes  propagates uninformative and incorrect annotations What’s wrong with annotation based on protein families?  emphasizes structural and phylogenetic evidence  ignores metabolic and chromosomal contexts  leads to ambiguity for members of large families, e.g. transporters

10 www.nmpdr.org What is a Subsystem? Subsystem is a generalization of pathway  Collection of functional roles jointly involved in a biological process or complex metabolic, signaling, regulatory, structural Functional role is the abstract biological function of a gene product  Atomic or fundamental; examples: 6-phosphofructokinase (EC 2.7.1.11) LSU ribosomal protein L31p cell division protein FtsZ Inclusion of gene in subsystem is only by functional role Controlled vocabulary …

11 www.nmpdr.org Expert-Defined Subsystems Curator is researcher with first-hand knowledge of biological system Functional roles defined and grouped into subsystem and subsets by curator  universal groups of roles include all organisms  functional variants are subsets of roles found in a limited number of organisms often represent alternative paths or nonorthologous replacement Semi-automated assignment of function based on manual groundwork, sequence homology, and functional clustering

12 www.nmpdr.org Subsystem Primer Describe your subsystem in 150 words or less—why should these functions be considered together?  define the emergent properties of the system Provide or link to a diagram that illustrates this subsystem  define the graph or network List the reactions or relationships between these functional roles  define the edges List the exact names and abbreviations of these functional roles  define the nodes List the id numbers (GenBank, SwissProt—any identifying alias) of genes that play these roles in one or more exemplar genomes  examples of nodes Provide one or more references that support the assignment of function for the exemplar genes  provide evidence

13 www.nmpdr.org Populated Subsystems Two-dimensional integration of functional roles with genomes Spreadsheet  Columns of functional roles  Rows of organisms  Cells of annotated genes Table of functional roles with GO terms Diagram Curator notes and citations

14 www.nmpdr.org Simple Example: Histidine Degradation Subsystem Conversion of histidine to glutamate is organizing principle Functional roles defined in table:

15 www.nmpdr.org Subsystem Diagram Three functional variants Universal subset has three roles, followed by three alternative paths from IV to VI

16 www.nmpdr.org Subsystem Spreadsheet Column headers taken from table of functional roles Rows are selected genomes, or organisms Cells are populated with specific, annotated genes Shared background color indicates proximity of genes Functional variants defined by the annotated roles Variant code -1 indicates subsystem is not functional OrganismVariant HutHHutUHutIGluFHutGNfoDForI Bacteroides thetaiotaomicron 1 Q8A4B3Q8A4A9Q8A4B1Q8A4B0 Desulfotela psychrophila 1 gi51246205gi51246204gi51246203gi51246202 Halobacterium sp. 2 Q9HQD5Q9HQD8Q9HQD6Q9HQD7 Deinococcus radiodurans 2 Q9RZ06Q9RZ02Q9RZ05Q9RZ04 Bacillus subtilis 2 P10944P25503P42084P42068 Caulobacter crescentus 3 P58082Q9A9MIP58079Q9A9M0 Pseudomonas putida 3 Q88CZ7Q88CZ6Q88CZ9Q88D00 Xanthomonas campestris 3 Q8PAA7P58988Q8PAA6Q8PAA8 Listeria monocytogenes Subsystem Spreadsheet

17 www.nmpdr.org Missing Genes Noticed by Subsystems Annotation No genes were annotated “ ForI (EC 3.5.3.13) Formiminoglutamic iminohydrolase ” when the Histidine Degradation subsystem was populated Organisms missing ForI convert His to Glu Candidate genes that could perform the role “ ForI ” must be identified Strategy for finding genes is based on chromosomal clustering and occurrence profiling

18 www.nmpdr.org Finding Genes that Cluster with NfoD Red gene in graphic and table is NfoD of XanthomonasNfoD of Xanthomonas Genes pictured in gray boxes located nearby NfoD in four or more species Advanced controls expands display of homologous regions in other genomes Functional Coupling score links to table of homologous pairs in other genomes Cluster button finds biggest clusters in other species when not clustered in subject genome

19 www.nmpdr.org What are Pinned Regions? Focus gene is number 1, colored red Most frequently co-localized homolog numbered 2, colored green Sets of homologous genes presented in the same color with the same numerical label; BLASTP cut- off e-val = 1e-20 Numerical labels correspond to rank-ordered frequency of co-localization with the focus gene Number of regions, size of region, and cut-off can be re-set by user

20 www.nmpdr.org Compare Regions around NfoD, red, center HutC, the regulator, is green, 2  HutH, the first functional role in the subsystem, is blue, 4  Candidate ForI is teal, 6, originally annotated as “ conserved hypothetical ” Candidate ForI in Context with NfoD

21 www.nmpdr.org Annotation of ForI EC 3.5.3.13 Metabolic context proves need for role  Organisms missing annotated ForI degrade His to Glu Chromosomal context points to candidate  Clusters with NfoD and other genes in subsystem Occurrence context supports candidate  Organisms containing NfoD lack GluF and HutG, required for functional variants 1 and 2, respectively  Organisms containing candidate ForI also contain NfoD, indicating functional variant 3 Phylogenetic trees of candidate ForI genes are coherent

22 www.nmpdr.org Subsystems Allow Bioinformatics to Inform Bench Research Subsystems point to missing or alternative genes Bioinformatic predictions need to be tested at the bench ForI candidate now verified experimentallyverified Connections forged between bench and bioinformatics

23 www.nmpdr.org How is NMPDR distinct from NCBI? Corrected, functional annotations, manually curated in context of systems biology Multiple starting points for accessing data  gene or protein name, subsystem, organism Search results downloadable as names or sequences Interactive tools for comparative analysis  Compare regions—adjust size of region, number of genomes Compare regions  Subsystems—browse phylogenetic distribution of biological system; color spreadsheet and diagram Subsystems  Functional clusters—find genes with conserved proximity Functional clusters  BLASTP Hits—select and align interesting sequences BLASTP Hits  Signature genes—find genes in common or that distinguish user- selected groups of genomes; groups may contain one or many Signature genes

24 www.nmpdr.org Exploration of physical, genomic context Compare Regions graphic  Focus protein highlighted red  Color-matched orthlogs allow comparative analysis of functional clustering and chromosomal rearrangements  Redraw the display with different number of genomes or different size region Compare Regions table  Table is sortable and filterable with active column headings  Genes with conserved proximity shown with functional coupling scores, fc-sc fc-sc (functional coupling score)  Measures conservation of gene proximity and phylogenetic distance  Link returns table listing pairs of proximal orthologs CL (find best clusters)  Finds clusters containing the focus protein in other genomes  Useful for genes without functional coupling scores, fc-sc

25 www.nmpdr.org Exploration of functional, biological context Populated Subsystem Spreadsheet  Columns represent functional roles, mouse over header for definition  Genomes (rows) shown may be filtered and sorted by name or taxonomic group  Cells populated with specific, annotated genes linked to context pages  Functional variants defined by the annotated roles  Variant codes defined in notes tab  Diagram of subsystem often provided Protein families  FIGfams taken from single column of functional roles  Links to structures, orthologs, literature

26 www.nmpdr.org NMPDR Services Essential Genes on Genomic ScaleEssential Genes  Experimentally verified in genome-wide scans of 10 important model organisms Drug targets pipline to in silico screening  essential in at least one of the NMPDR pathogens  included in subsystems by our curators  orthologs in the Protein Data Bank  orthologs in a substantial number of bacterial priority pathogens Targets search: flexible search forms for discovering novel targets based on computed attributes  physical characteristics such as MW, pI  subcellular location  transmembrane regions and signal peptides  subsystem, pathway, reaction  structural motifs, protein families

27 www.nmpdr.org Related NMPDR Services RAST Genome annotation serverRAST  Automated annotation of essentially complete genome sequences in a small set of long sequence contigs  View results in comparative context with other genomes MG-RAST Metagenome annotation serverMG-RAST  Automated annotation of a very large set of very short DNA sequences  View results in comparative context with other data sets Annotation Clearinghouse  Tool to credit experts with annotation of specific genes and to share annotations with other databases  Input is a two-column table of gene IDs and annotations vouched for by expert

28 www.nmpdr.org Who is NMPDR? Fellowship for Interpretation of Genomes (FIG) Ross Overbeek, Veronika Vonstein, Gordon Pusch, Bruce Parrello, Rob Edwards, Andrei Osterman, Michael Fonstein, Svetlana Gerdes, Olga Zagnitko, Olga Vassieva, Yakov Kogan, Irina Goltsman Argonne National Laboratory Rick Stevens, Terry Disz, Robert Olson, Folker Meyer, Elizabeth Glass, Chris Henry, Jared Wilkening Computation Institute at University of Chicago Daniela Bartels, Michael Kubal, William Mihalo, Tobias Paczian, Andreas Wilke, Alex Rodriguez, Mark D'Souza, Rami Aziz University of Illinois at Urbana; Hope College Gary J. Olsen, Claudia Reich, Leslie McNeil; Aaron Best, Matt DeJongh National Institute of Allergy and Infectious Diseases National Institutes of Health, Department of Health and Human Services, Contract HHSN266200400042C.


Download ppt "Tools for comparative genomics and expert annotations."

Similar presentations


Ads by Google