Pre-SIG Genome Annotation Database Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.
Genome Annotation: A Protein-centric Perspective.
Www. GeneOntology.org Gene Ontology Collaboration.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
The Sense of Sequense The Sense of Sequense Chris Evelo BiGCaT Bioinformatics Universiteit Maastricht.
Extending to the GO model OBO open biology ontologies aka - extended go - (ego)
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
COG and GO tutorial.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Genome Annotation BCB 660 October 20, From Carson Holt.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
The Sequence Ontology Suzanna Lewis This talk…  Why is there a SO  What is the SO  SO and GFF3  A bit about mereology  Some examples using.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project
Copyright, ©, 2002, John Wiley & Sons, Inc.,Karp/CELL & MOLECULAR BIOLOGY 3E Control of Gene Expression Prokaryotes and Operons.
Gene Ontology Consortium
Apollo Future Plans Nomi Harris, BDGP/FlyBase GMOD Meeting, Cambridge April 27, 2004.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
1 of 38 Data Mining in Ensembl with BioMart. 2 of 38 Simple Text-based Search Engine.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Linking Animal Models and Human Diseases Supported by NIH P41 HG002659, U54 HG004028, & R01 HG Cambridge University & the University of Oregon.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
Motif discovery and Protein Databases Tutorial 5.
Phenote Mark Gibson Berkeley Bioinformatics and Ontology Project (BBOP) National Center for Biomedical Ontologies(NCBO) Lawrence Berkeley National Lab.
From Genomes to Genes Rui Alves.
Genome reannotation: Dealing with the atypical, the ambiguous, and the contrary.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Phenote Mark Gibson Berkeley Bioinformatics and Ontology Project (BBOP) National Center for Biomedical Ontologies(NCBO) Lawrence Berkeley National Lab.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
Genetic Literature Curation at FlyBase-Cambridge Steven Marygold ABC meeting, December 2007 A Database of.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Apollo Progress Report GMOD Meeting, Berkeley September 15, 2003.
Gene Ontology TM (GO) Consortium
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
KEY CONCEPT 8.5 Translation converts an mRNA message into a polypeptide, or protein.
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Sequence based searches:
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
A User’s Guide to GO: Structural and Functional Annotation
Apollo Progress Report
Part II SeqViewer AraCyc Help
Annotator Interface GUS 3.0 Workshop June 18-21, 2002.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Pre-SIG Genome Annotation Database Operations Suzanna Lewis FlyBase/Berkeley Drosophila Genome Project Gene Ontology Consortium

Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

Contradictions and Complications Assembly errors Missed gene merges Missed gene splits Complex adjustments Dicistronic genes Overlaps and intersections

Assembly dependencies

Missed merges

Missed Splits

Splerges

Dicistronic Genes

Shared 5 UTR

Shared UTRs

Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data deleted: 41 new: 179 merges: 31 splits: 26 reinstated: 32

Broad Institute TIGR JGI Baylor College of Medicine Washington University FlyBase Ensembl GMOD contributors meeting March 2004

The Essentials Visualization and manual editors

The Essentials Visualization and manual editors Combiners

The Essentials Visualization and manual editors Combiners Full-length cDNA sequences

The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies

The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification

The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification Evidence tracking and versioning

The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification Evidence tracking and versioning Open source software components and standards are critical to long term success

The Essentials Visualization and manual editors Combiners Full-length cDNA sequences High-quality assemblies Annotation standards and verification Evidence tracking and versioning Open source software components and standards are critical to long term success

Annotation verification Community input On-line error reporting Curation of the literature Confirmation by comparison to cDNA sequences

SWISSPROT Comparison Perfect match 100% identity over 100% of lengths Single AA substitutions 99% identity over 100% of lengths The above account for 75% of all genes with a SWISSPROT cognate (2,771 out of 3,687).

SWISSPROT Comparison Significant mismatch spans of 40 residues or 20% peptide length, with at least 97% sequence identity No match poor or empty matches These remaining differences were due to lingering annotation errors or errors in the reported DNA sequence from SWISSPROT

Analysis of 8687 cDNAs (full inserts)

Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

Annotation of all types of features Protein-coding13,410 tRNA291 microRNA23 snRNA32 snoRNA29 Pseudogenes19 Non-coding RNA36 Transposons1,572 Promoters(thousands) TSS(thousands) P element Insertions(thousands) Total15,412

Having it all Complete: every occurrence is found Precise: every occurrence is accurate Comprehensive: all types of features Richly described: biological functional data

How to find what you need FlyBaseMGISGD CappuccinoBNI1Formin 2 By name?By database ID? Actin binding FBgn S MGI: By function?

But in 1998 there was a problem… None of the organism databases used standard terminology to describe biological function.

For example It will be difficult for youand even harder for a computerto find functionally equivalent gene products. translation Protein synthesis You want all gene products that are involved in bacterial protein synthesis, But the sequences are significantly different from those in humans.

How to best describe biology? Natural language Highly expressive Ambiguous in meaning Hard to compute on Structured representation Limited in expressivity Precise May be computed on We needed to find a middle ground, that supports and enables both.

The aims of GO 1. To develop comprehensive shared vocabularies. 2. Use the vocabularies to describe the gene products held in different databases. 3. To provide access to the vocabularies, the annotations, and associated data. 4. To provide software tools to assist biological researchers.

The early key decisions The vocabulary itself requires a serious and ongoing effort. Carefully define every concept Initially keep things as simple as possible and only use a minimally sufficient data representation. Focus initially on molecular aspects that are shared between many organisms.

A sequence is not equal to a gene Physically a gene is composed of sequences. DNA, RNA, and protein Different strains, ESTs, cDNAs, alleles… A fully characterized gene has multiple sequence references

GO is NOT a gene nomenclature system Communities decide upon the official gene name or symbol and their community databases maintain these data. Sequence repositories (I.e. Genbank/EMBL/DDBJ/SwissProt) provide sequence identifiers and protein names Proteins may be named differently than genes e.g. HUGO and UniProt IDs

GO encompasses descriptions for all functional molecular entities A gene product may be either a functional RNA or a protein Protein tRNA miRNA snRNA rRNA …

The breakdown of work Task 1 Building the ontology: a computable description of the biological world Task 2 Describing your geneannotation Protein structure Phenotype Expression data Function, process, localization…

Vocabulary and relationships Look up concept to accurately express biology Your gene product Refer to representative sequences Gene nomenclature decisions Sequence DB Choose approved name and synonyms Collect what is known from the literature

GO databases: distributed and centralized Support cross-database queries By having a mutual understanding of the definition and meaning of any word used to describe a gene product Provide database access to a common repository of annotations By submitting a summary of gene products that have been annotated

If we build it… What is a term? Definition of term concepts How to represent and manage the concepts Biological scope Annotation

What is a term? Must have a stable ID May have synonyms Have relationships to other terms Can be made obsolete Can be split or merged Must have a definition

Definitions Purpose is to remove ambiguity of interpretation and alternate meanings All definitions are supported by cross-references to the source(s)

Annotating gene products Expert curation accepted from any group that can provide an ID and evidence Each annotation must be supported by evidence, including a cross-reference Gene can be annotated with multiple terms Annotation is at finest possible granularity Guidelines

GO functional analysis Sequence similarity Literature harvesting Motif analysis Expression studies Interaction studies

Current work Low coverage genome sequencing Multiple species genome comparisons SOthe Sequence Ontology

e.g. What is a pseudogene? Human Sequence similar to known protein but contains frameshift(s) and/or stop codons which disrupts the ORF. Neisseria A gene that is inactive - but may be activated by translocation (e.g. by gene conversion) to a new chromosome site. - note some would call such a gene a cassette in yeast.

SO is useful if you want to: Annotate sequence using consistent terminology for the same features across genomes. Enable practical querying and comparisons between sequence databases. Describe and propagate features at all levels of the sequence from genomic to mature protein.

Thank You to… Curators-Berkeley Sima Misra, Josh Kaminker, Simon Prochnik, Chris Smith, Jon Tupy Curators-Harvard Lynn Crosby, Bev Matthews, Kathy Campbell, Pavel Hradecky, Yanmei Huang, Leyla Bayraktaroglu Curators-Cambridge Gillian Millburn, Rachel Drysdale, Chihiro Yamada Curators-SWISSPROT Eleanor Whitfield Software-Berkeley Chris Mungall, Ben Berman, Joe Carlson, Mark Gibson, Nomi Harris, George Hartzell, Brad Marshall, John Richter, ShengQiang Shu Software-Harvard David Emmert Software-Cambridge Aubrey de Grey Software-Ensembl Michelle Clamp, Vivek Iyer, Steve Searle

and Thanks to… Christopher Mungall John Day- Richter Brad Marshall Karen Eilbeck Mark Yandell George Hartzell David Hill Joel Richardson GO Curators Michael Ashburner Judith Blake J. Michael Cherry

We want and depend on you! Corrections to the peptides Functional annotation Corrections and additions to GO