A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Slides:



Advertisements
Similar presentations
Using DAML format for representation and integration of complex gene networks: implications in novel drug discovery K. Baclawski Northeastern University.
Advertisements

Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Chapter 18 Regulation of Gene Expression.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Ontologies and the Semantic Web by Ian Horrocks presented by Thomas Packer 1.
Extending to the GO model OBO open biology ontologies aka - extended go - (ego)
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
UML CASE Tool. ABSTRACT Domain analysis enables identifying families of applications and capturing their terminology in order to assist and guide system.
Kendall & KendallCopyright © 2014 Pearson Education, Inc. Publishing as Prentice Hall 9 Kendall & Kendall Systems Analysis and Design, 9e Process Specifications.
CS 330 Programming Languages 09 / 16 / 2008 Instructor: Michael Eckmann.
Describing Syntax and Semantics
Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
RIZWAN REHMAN, CCS, DU. Advantages of ORDBMSs  The main advantages of extending the relational data model come from reuse and sharing.  Reuse comes.
The chapter will address the following questions:
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
ICS-FORTH May 25, The Utility of XML Martin Doerr Foundation for Research and Technology - Hellas Institute of Computer Science Heraklion, May.
Traits, such as eye color, are determined By proteins that are built according to The instructions specified in the DNA.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Chapter 17 Notes From Gene to Protein.
Biology 10.1 How Proteins are Made:
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Take me to NZQA Documents relating to this standard AS Describe the role of DNA in relation to gene expression Protein Synthesis Part three…
ITEC224 Database Programming
The Sequence Ontology Suzanna Lewis This talk…  Why is there a SO  What is the SO  SO and GFF3  A bit about mereology  Some examples using.
INF 384 C, Spring 2009 Ontologies Knowledge representation to support computer reasoning.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.
Open Biomedical Ontologies. Open Biomedical Ontologies (OBO) An umbrella project for grouping different ontologies in biological/medical field –a repository.
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman & Mario Latendresse Bioinformatics Research Group SRI, International.
Gene Ontology Consortium
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman & Mario Latendresse Bioinformatics Research Group SRI, International.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
Web Architecture: Extensible Language Tim Berners-Lee, Dan Connolly World Wide Web Consortium 元智資工所 系統實驗室 楊錫謦 1999/9/15.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Semantic Web - an introduction By Daniel Wu (danielwujr)
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
DAVID R. SMITH DR. MARY DOLAN DR. JUDITH BLAKE Integrating the Cell Cycle Ontology with the Mouse Genome Database.
LECTURE CONNECTIONS 14 | RNA Molecules and RNA Processing © 2009 W. H. Freeman and Company.
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.
Tutorial 13 Validating Documents with Schemas
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
To Boldly GO… Amelia Ireland GO Curator EBI, Hinxton, UK.
OilEd An Introduction to OilEd Sean Bechhofer. Topics we will discuss Basic OilEd use –Defining Classes, Properties and Individuals in an Ontology –This.
Statistical Testing with Genes Saurabh Sinha CS 466.
Copyright OpenHelix. No use or reproduction without express written consent1.
Mining the Biomedical Research Literature Ken Baclawski.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
CHAPTER 13 RNA and Protein Synthesis. Differences between DNA and RNA  Sugar = Deoxyribose  Double stranded  Bases  Cytosine  Guanine  Adenine 
The Semantic Web. What is the Semantic Web? The Semantic Web is an extension of the current Web in which information is given well-defined meaning, enabling.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Finding genes in the genome
OWL Web Ontology Language Summary IHan HSIAO (Sharon)
SRI International Bioinformatics 1 The Structured Advanced Query Page Tomer Altman Mario Latendresse Bioinformatics Research Group SRI International April.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Copyright © by Holt, Rinehart and Winston. All rights reserved. ResourcesChapter menu How Proteins Are Made Chapter 10 Table of Contents Section 1 From.
Database Design, Application Development, and Administration, 6 th Edition Copyright © 2015 by Michael V. Mannino. All rights reserved. Chapter 5 Understanding.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
OWL (Ontology Web Language and Applications) Maw-Sheng Horng Department of Mathematics and Information Education National Taipei University of Education.
Protein synthesis DNA is the genetic code for all life. DNA literally holds the instructions that make all life possible. Even so, DNA does not directly.
 DATAABSTRACTION  INSTANCES& SCHEMAS  DATA MODELS.
Functional Annotation of the Horse Genome
CCO: concept & current status
How to Use This Presentation
An Overview of Gene Expression
ONTOMERGE Ontology translations by merging ontologies Paper: Ontology Translation on the Semantic Web by Dejing Dou, Drew McDermott and Peishen Qi 2003.
University of Manchester
Presentation transcript:

A Sequence Ontology Suzanna Lewis Berkeley Drosophila Genome Project

Semantic Interpretation: What is communication? An information transmission from a source to a receiver by means of encoding-decoding processes (including language). But what is meant, what is said, what is heard, and what is understood are not always the same thing. This has a simple consequence: it is only possible to communicate to the extent that we share rules of usage and have reciprocal understanding of the meaning.

Working towards a shared language for the description of sequence. Hey, know what I figured out? The meaning of words isn’t a fixed thing! Any word can mean anything! By giving words new meanings, ordinary English can become an exclusionary code! Two researchers can be divided by the same language! To that end, we’re inventing new definitions for common words, so we’ll be unable to communicate. Don’t you think that is totally excellent?

How to best describe biology? natural language highly expressive ambiguous hard to compute on why would I want to compute on it? database searching data mining knowledge transfer

The aims of SO 1. Develop a shared set of terms and concepts to annotate biological sequences. 2. Apply these in our separate projects to provide consistent query capabilities between them. 3. Provide a software resource to assist in the application and distribution of SO. 4. Meet the GOBO criteria.

SO: Phase I To provide a structured controlled vocabulary for the description of primary annotations of nucleic acid sequence Useful for the annotations shared by a DAS server. To provide a structured representation of these annotations within genomic databases. Making it possible to query all for example, all genes whose transcripts are edited, or trans-spliced, or are bound by a particular protein.

SourceTypeGroup SangerexonSequence Em:AP C22.2.mRNA EBIexondJ68O2.C22.1.mRNA WUSTLCDSgene_is "001"; transcript_id "001.1"; Gadflyexongenegrp=CG18090; transgrp=CG18090-RA; WormbaseexonSequence "C27C7.7" Simple GFF: What is a transcript?

What is a pseudogene? Human Sequence similar to known protein but contains frameshift(s) and/or stop codons which disrupts the ORF. Neisseria A gene that is inactive - but may be activated by translocation (e.g. by gene conversion) to a new chromosome site. - note such a gene would be called a “cassette” in yeast.

SO so far 1280 terms Top levels Structural variation Locatable features Other sequence attributes

Approach Determine the top level orthogonal categories Domain, site, sequence type, location Specify the specializations homeo domain, phosphorylation site, DNA/RNA/AA Define inter-relationships between orthogonal categories ison, defines

primary transcript DNA sequence RNA nucleic acid sequence processed transcript defines ison nucleic acid sequence region sequence region DNA region gene regiontranscript regionexon RNA region

SourceTypeGroup Sangerexontranscript “Em:AP C22.2” EBIexontranscript “dJ68O2.C22.1” WUSTLCDSgene "001"; transcript "001.1"; Gadflyexongene “CG18090”; transcript “CG18090-RA”; Wormbaseexontranscript "C27C7.7" GFF After

SO long(term) 1. Formalize the current phrase-based ontology to a description logic 2. Provide DAML+OIL/OWL representations 3. Add declarative rules and constraints to ensure consistency of annotations and aid annotation. 4. Extend the ontology so that it can be used as a full sequence knowledge base.

Description logics will make the ontology easier to maintain For example, it will enable cross-products within the ontology. Now: "tRNA alanyl", "tRNA coding gene alanyl", "tRNA primary transcript alanyl". tRNA class has a ‘slot’ for "amino-acid” and a slot for anti-codon. 'restrictions' effectively say "any instance of class tRNA that has the amino-acid slot value of alanine is of the class 'tRNA alanyl'". ‘checks’ for inconsistency between anticodon, amino-acid and class.

Computable definitions Human-readable text definitions are always desirable. But, lengthy text definitions will always be open to interpretation. …besides, much of the data will be provided by programmers, and programmers never read the instructions. If programmers write their own code for assigning these, this opens the possibility of inconsistencies of interpretation of the concept. Computable definitions/constraints are essential wherever possible to provide a set of declarative rules for checking and inference.

A SO Knowledge Base? SO could eventually be used not just as a way of categorizing sequence features, but as the data model for storing sequence and sequence feature data. Accomplish this by adding a few slots to the top level feature class - for instance for start and end coordinates. One could then have an entire sequence database in DAML+OIL/OWL format.

Declarative representation for spatial definitions Rules involving mathematical constructs cannot be usually be expressed in a Description Logic. There needs to be a declarative representation of these rules because enforcing the rules using a program written in an imperative language, is difficult to sustain. Declarative languages specify *what* is to be done, rather than *how* they should be done.

Give me 500 bases upstream of all 5’ exons. Define 5’ exon as being the first exon on the five prime end of a transcribed region. It would be very tedious for a curator to have to specifically annotate exons as being ”5' exon" as opposed to the more general "exon". There is no need for them to do this, as this is computable from rules.

Give me all the dicistronic genes Define a dicistronic gene in terms of the cardinality of the transcript to open-reading-frame relationship and their spatial arrangement.

Give me all 3’ exons that overlap 5’ untranslated regions. Define “exons with overlapping UTRs” as a spatial relationship coupled with being “partsof” different genes and being non-coding.

Loose and Flexible Rules are meant purely to ensure consistency There will always be fuzzy areas where we want to allow freedom, because normal biology is like that. Constraints are NOT meant to perform any predictive function, they just provide a consistent definition.

A single framework that integrates other biological ontologies One could have a 'knowledge base' centered around the genome. This KB would be amenable to reasoning. This is a significant change from relational, OO, or XML modeling, however, it is compatible with all these. SO could be a framework for integrating data with other ontologies. product features would have slots for standard GO annotations, variation features would have slots into phenotypic ontologies.

Build in a Bayesian Belief Network Probabilities may be assigned to annotations or used to suggest new annotations. Define a model for binding sites and regulatory regions on weight matrices, proximity to starts of genes and so forth. Curator can interactively explore and ask questions like "ok, i have evidence for there being such and such a binding site here, what if I alter the priors, how does that affect other nodes in the network (statements in the knowledge base) pertaining to pathways?”

To paraphrase Brunelleschi on the importance of tools, circa 1425 I am accustomed to think about and construct in my mind some unheard of invention making it possible to create great and wonderful things.

GOBO Criteria 1. The ontologies are "open" and can be used by all without any constraint other than that their origin must be acknowledged. 2. The ontologies are in, or can be instantiated in, the GO syntax, extensions of this syntax or in DAML+OIL.. 3. The ontologies are orthogonal to other ontologies already lodged with gobo. 4. The ontologies share an unique identifier space. 5. The ontologies include definitions of their terms.

Giving it a go Sanger Institute Richard Durbin, Tim Hubbard EBI Michael Ashburner, Ewan Birney Mouse Genome Database Judith Blake, Carol Bult BDGP Chris Mungall, Brad Marshall, John Richter, ShengQiang Shu Wormbase Lincoln Stein