OBOL: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall GO Advisors Meeting Denver, CO May.

Slides:



Advertisements
Similar presentations
Obol: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project.
Advertisements

Semantics Static semantics Dynamic semantics attribute grammars
CSA2050: DCG I1 CSA2050 Introduction to Computational Linguistics Lecture 8 Definite Clause Grammars.
Grammars, Languages and Parse Trees. Language Let V be an alphabet or vocabulary V* is set of all strings over V A language L is a subset of V*, i.e.,
Of 27 lecture 7: owl - introduction. of 27 ece 627, winter ‘132 OWL a glimpse OWL – Web Ontology Language describes classes, properties and relations.
Statistical NLP: Lecture 3
Grammars.
Background information Formal verification methods based on theorem proving techniques and model­checking –to prove the absence of errors (in the formal.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
Application of OBO Foundry Principles in GO Chris Mungall Lawrence Berkeley Labs NCBO GO Consortium.
Automated tools to help construction of Trait Ontologies Chris Mungall Monarch Initiative Gene.
Iowa State University Animal Science Department Bioinformatics & Computational Biology Program - 01/16/06 1 Overview of Animal Trait Ontology and PATO.
C. Varela; Adapted w/permission from S. Haridi and P. Van Roy1 Declarative Computation Model Defining practical programming languages Carlos Varela RPI.
An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
Artificial Intelligence 2004 Natural Language Processing - Syntax and Parsing - Language Syntax Parsing.
Meaning and Language Part 1.
GO Ontology Editing Workshop: Using Protege and OWL Hinxton Jan 2012.
Editing Description Logic Ontologies with the Protege OWL Plugin.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Grammars.
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
1 CS 385 Fall 2006 Chapter 14 Understanding Natural Language (omit 14.4)
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
OBOL Open Bio-Ontology Language GO Meeting Stanford Jan 2004.
Ming Fang 6/12/2009. Outlines  Classical logics  Introduction to DL  Syntax of DL  Semantics of DL  KR in DL  Reasoning in DL  Applications.
Imports, MIREOT Contributors: Carlo Torniai, Melanie Courtot, Chris Mungall, Allen Xiang.
Open Biomedical Ontologies. Open Biomedical Ontologies (OBO) An umbrella project for grouping different ontologies in biological/medical field –a repository.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Design Pattern Interpreter By Swathi Polusani. What is an Interpreter? The Interpreter pattern describes how to define a grammar for simple languages,
Principles and Practice of Ontology Development: Making Definitions Computable Chris Mungall LBL.
Gene Ontology Consortium
C H A P T E R TWO Syntax and Semantic.
ISBN Chapter 3 Describing Semantics -Attribute Grammars -Dynamic Semantics.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
TextBook Concepts of Programming Languages, Robert W. Sebesta, (10th edition), Addison-Wesley Publishing Company CSCI18 - Concepts of Programming languages.
For Wednesday Read chapter 23 Homework: –Chapter 22, exercises 1,4, 7, and 14.
GO terms implicitly refer to other term cysteine biosynthesis myoblast fusion hydrogen ion transporter activity snoRNA catabolism wing disc pattern formation.
1 Definite Clause Grammars for Language Analysis – A Survey of the Formalism and a Comparison with Augmented Transition Networks 인공지능 연구실 Hee keun Heo.
3.2 Semantics. 2 Semantics Attribute Grammars The Meanings of Programs: Semantics Sebesta Chapter 3.
Comp 311 Principles of Programming Languages Lecture 3 Parsing Corky Cartwright August 28, 2009.
Rules, Movement, Ambiguity
Artificial Intelligence: Natural Language
To Boldly GO… Amelia Ireland GO Curator EBI, Hinxton, UK.
OilEd An Introduction to OilEd Sean Bechhofer. Topics we will discuss Basic OilEd use –Defining Classes, Properties and Individuals in an Ontology –This.
+ From OBO to OWL and back again – a tutorial David Osumi-Sutherland, Virtual Fly Brain/FlyBase Chris Mungall – GO/LBL.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
Approach to building ontologies A high-level view Chris Wroe.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Formal Verification. Background Information Formal verification methods based on theorem proving techniques and model­checking –To prove the absence of.
GRAMMARS & PARSING. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program, referred to as a.
Understanding Naturally Conveyed Explanations of Device Behavior Michael Oltmans and Randall Davis MIT Artificial Intelligence Lab.
NATURAL LANGUAGE PROCESSING
Composing Music with Grammars. grammar the whole system and structure of a language or of languages in general, usually taken as consisting of syntax.
Mechanisms for Requirements Driven Component Selection and Design Automation 최경석.
Comp 411 Principles of Programming Languages Lecture 3 Parsing
Describing Syntax and Semantics
Advanced Computer Systems
System Software Unit-1 (Language Processors) A TOY Compiler
SOFTWARE DESIGN AND ARCHITECTURE
Web Ontology Language for Service (OWL-S)
Syntax Analysis Sections :.
ece 720 intelligent web: ontology and beyond
Natural Language - General
Representation, Syntax, Paradigms, Types
The Gene Ontology: an evolution
COMPILER CONSTRUCTION
Presentation transcript:

OBOL: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall GO Advisors Meeting Denver, CO May 2004

Outline ● Motivation – combinatorial issues with composite terms ● Approaches: annotation-time term composition vs tools for maintenance of large DAGs ● The OBOL System – Term decomposition using grammars – Generating computable logical class definitions – Rules and reasoning over class definitions ● Initial Results ● Strategies for using OBOL within GO/OBO

Manual maintenance of GO ● GO is 3 DAGs of over 16k terms ● Large DAGs of terms are hard to maintain ● “cross-products” produce combinatorial explosions and highly connected sub-graphs ● GO terms include OBO terms – eg oxygen binding; wing development ● Zipf's Law – Many terms not yet used in annotation (Ogren, pers. Comm.)

Example combinatorial explosion contraction muscle contraction smooth muscle contraction negative regulation of contraction Negative regulation of muscle contraction regulation regulation of contraction regulation of muscle contraction regulation of smooth muscle contraction negative regulation of smooth muscle contraction Negative regulation part_of “implicit” terms negative regulation negative regulation of muscle contraction actual GO terms implicit anatomical terms not shown affects part_of affects cell motility

One Approach: Properties ● One extreme solution is to remove composite terms from ontology altogether – Generate “anonymous” composite terms at annotation- time via property/slot restrictions to atomic terms ● binding ^ affects(interleukin-18) ● contraction ^ affects(muscle+type(smooth)) – These bindings constitute a class definition – But it is still necessary to make statements about composite terms in the ontology ● macrophage activation is_a immune cell activity ● fibrinolysis is_a negative regulation of blood coagulation

Another approach: Computationally aided ontology maintenance ● GO terms exhibit regularity in their syntactic structure – Substring relationships highly correlated with actual relationships – regulation of smooth muscle contraction – smooth muscle contraction – muscle contraction – contraction Ogren PV, Cohen KB, Acquaah-Mensah GK, Eberlein J, Hunter L The compositional structure of Gene Ontology terms. Pac Symp Biocomput 9:

OBOL: Syntax and Semantics ● What about using the term syntax to get at the meaning of the term? GO/OBO Term Class Definition may involve relationships to other OBO terms Inferences “interleukin binding” binding^affects(interleukin) “interleukin binding” is_a “cytokine binding” inferred from: interleukin is_a cytokine

How OBOL Works ● Term names are parsed using a grammar, generating parse trees ● Parse trees are turned into class definitions using transformation rules and property definitions – Both steps are reversible ● Inferences are made on the class definitions ● Implemented in Prolog

Computational Grammars ● A collection of transformation rules for parsing (decomposing) and generating (composing) sequences of symbols (ie words). ● A language grammar operates on words, which must be categorised into word senses. – e.g. noun, adjective, preposition ● GO/OBO term grammars require very few word senses

A simple OBO term grammar Term --> NP e.g. negative regulation of smooth muscle conraction NP --> NP PP e.g. negative regulation of smooth muscle conraction NP --> NOUN e.g. muscle NP --> ADJ NP e.g. smooth muscle NP --> NP NP e.g. smooth muscle contraction PP --> PREP NP e.g. of smooth muscle contraction This is a subset of the whole OBO grammar: Concrete nouns are treated the same way as abstract nouns (contraction is treated as a noun, even though it is the inflecte form of the verb contract)

A Parse Tree for a GO Term negative regulation of smooth muscle contraction ADJ NOUN P ADJ NOUN NOUN NP NP NP NP -> ADJ NP* NP NP -> NP NP* PP PP -> P NP* NP NP -> NP* PP smooth muscle contraction ADJ NOUN NOUN NP NP -> NOUN NP* NP NP -> ADJ NP* alternate parses are possible (parse forests)

Atomic ontologies: OBO wordlists ● A term grammar requires a vocabulary of words ● These words correspond to atomic terms from the OBO ontologies ● Currently the wordlists are generated semi- automatically ● Relational adjectives are paired with the appropriate noun – [cytosolic, cytosol], [coated, coat] ● OBOL exhibits graceful degradation with incomplete wordlists – unrecognised words treated as orphan nouns

Making Logical Class Definitions from Parse Trees ● A Class definition is a compound term with property/slot restrictions ● contraction^affects(muscle^type(smooth)) ● A class definition can be exported using either OBO or OWL format ● A class definition can be generated from a parse tree using transformation rules and property/slot definitions

Properties guide class tree building Property: affects_cell_type domain: biological_process range: cell_type context: modifier(np) Property: regulates domain: regulation range: biological_process context: preposition(of) The property definitions above constrain how one class (the domain, or subject) can relate to another (the range, or object) in a given grammatical context Muscle contraction Contraction^affects(muscle) Regulation of muscle contraction Regulation^regulates(contraction^affects(muscle))

Reasoning over class definitions ● We can use classdef rules to... – place new terms in the correct place in the DAG – check for missing relationships in the DAG – find inconsistencies between ontologies ● Method: – Inference rules implemented in Prolog – Interactive use or as DAG-Edit plugin – OR export to OWL and use Protege + Racer [not tested yet]

OBOL Architecture API reasone r class builder obo DCG word db pan-obo ontology db obo file ont Y obo file ont Z obo file ont X word file X word file Z word file Y property defs command line/ prolog interpreter java/prolog bridge DAG Edit curator OBO and GO wordlists grey boxes indicate prolog modules obo/owl logical defs other reasoners?

Initial Results From biological_process and cellular_component only ( 3630/9067 have unique parses) Example suspected missing relationships: nucleolar chromatin part_of nucleolus clathrin-coated vesicle has_part clathrin coat chromoplast membrane is_a plastid membrane nuclear microtubule part_of nucleus vitamin E biosynthesis is_a vitamin E metabolism OBOL doesn't currently check for inverses!

Using OBOL with GO/OBO ● The complete OBOL system can be implemented within GO/OBO in a variety of ways – “Behind the scenes” ● GO curators maintain same mode of working and receive periodic auto-generated reports – Within DAG-Edit ● GO curator uses OBOL interactively – To transition GO to a more “formal” ontology

(1)Using OBOL Behind the Scenes – Maintain facade of narrative approach, whilst implementing a combinatorial approach behind the scenes ● Curators carry on working their current mode ● OBOL is used periodically to check the ontology ● suggested edits are submitted to curators en-masse ● OBOL is occasionally invoked on-demand to produce a new subgraph of cross-products (eg development vs anatomy) – Longer feedback cycle is less efficient – We are ready to go in this mode now

(2)Using OBOL from DAG-Edit ● OBOL is invoked by curators from DAG-Edit ● suggested corrections can be highlighted ● new composite terms can be automatically placed in the DAG – Errors can be spotted immediately ● Slot/property based annotation ● non-curators producing annotations (instances) can create new classdefs on-the-fly ● OBOL can check their validity and automatically create the subsumption path

(3)Using OBOL to recast ontologies ● OBOL is used as a one-off to help generate classdefs for all terms in an ontology ● classdefs are then maintained by curator ● Subsequent uses of OBOL are in reverse-mode; automatic generation of term names from classdefs ● OBOL or other reasoner invoked from DAG-Edit to spot mistakes and generate subsumption paths – DAG-Edit and OBO format now supports classdefs (aka complete definitions) – Major transition; more or less work for curators?

Problems to address ● Parsing issues: chemicals, wordy terms ● Logic issues: sensu ● Supporting OBO orthogonal ontologies ● Difficulties with biochemical ontology; plurals ● No generic anatomy ontology (as yet) ● No protein/complex ontology – Can we use OBOL to help construct these other ontologies? ● YES

What next? ● Better coverage: – refine slot/property definitions ● Grammar for human-readable text definitions ● Extend inference rules?? ● e.g. Non-monotonic reasoning (cell HAS-PART nucleus EXCEPT erythroctye) ● Generate OWL from derived class definitions ● distribute (with caveats) to logic community ● Use protege+racer as reasoner

Conclusions ● OBOL can help with the combinatorial explosion in a number of ways ● Initial results with incomplete wordlists and property definitions are promising ● Combining a term grammar with reasoning is powerful and offers significant advantages over either purely syntactic or semantic approaches ● Should OBO focus more the atomic units of the ontologies?

Acknowledgements ● Suzanna Lewis ● John Richter ● Brad Marshall ● Karen Eilbeck ● David Hill ● Joel Richardson ● GO Curators ● Michael Ashburner ● Chris Wroe ● Robert Stevens

The OBO Universe (partial) Phenotyp e SequenceAnatomy Cell Protein Chemical Compone nt Process Function Fish-Anat Fly-Anat

Detecting inconsistencies Y binding X binding Z binding X Z Y

Prolog as an ontology language ● % DATABASE OF FACTS ● isa(carb_binding, binding). ● isa(polysac_binding, carb_binding). ● isa(chitin_binding, polysac_binding) ● isa(cellulose_binding, polysac_binding). ● % INFERENCE RULES ● isaT(X,Y):- isa(X,Y). ● isaT(X,Y):-isa(X,Z), isaT(Z,Y). ● ?- isaT(chitin_binding, binding). ● YES ● ?-isaT(X, polysac_binding). ● X=carb_binding. ● X=chitin_binding. ● X=cellulose_binding. ● ?-isaT(chitin_binding, cellulose_binding). ● NO ● ?-isaT(X,Y). % returns all paths

is_a regulation qualifier regulates is_a affects_cell_type is_a has_type contraction muscle negative smooth class(regulation qualifier=class(negative ) regulates=class(contraction affects_cell_type=class(muscle has_type=class(smooth )))) Prolog internal representation

Prolog Grammar Implementation ● Prolog: the classic logic programming language – High-level declarative language, natural choice for ontologies; built in “database” – Definite Clause Grammars (DCGs)part of the language; DCGs allow passing data up the parse tree ● XSB Prolog – Uses tabling (more efficient, less re-calculation) – Tabling + DCGs = chart parsing (Earley's algorithm) ● This means we can have left-recursive grammars

A Formal Grammar for OBO terms ● All(?) GO/OBO terms are NOUN-PHRASES (exception: phenotypes?) ● A NOUN-PHRASE is (recursively) made from – a NOUN (includes inflected verbs; eg binding) – an ADJECTIVE followed by a NOUN-PHRASE eg inner membrane – a NOUN-PHRASE preceeded by a NOUN-PHRASE acting as ADJECTIVE; eg clathrin coat – a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE; eg regulation of transcription – an (optional) NOUN-PHRASE then a RELATIONAL ADJECTIVE then a NOUN-PHRASE; eg clathrin-coated vesicle ● Precedence rules are also required to prune parse forest ● Simple but effective

● A formal grammar is a set of production rules operating over terminal symbols (eg words) and non-terminal symbols (eg word/phrase categories) ● The rules determine how sequences of symbols can be transformed, making a parse tree Sentence -> Subject Verb Object Subject -> Article Noun Object -> Article Noun Article -> a | the Verb -> ate | chased Noun -> cat | banana | mouse GENERATING: Start at top ('sentence') and apply rules until all symbols are terminal PARSING: Start at bottom – a sequence of terminals; apply rules, combining symbols if necessary Parse tree for a simple sentence, “the cat ate the banana”

negative regulation of smooth muscle contraction ADJ NOUN P ADJ NOUN NOUN NP NP NP NP -> ADJ NOUN* NP NP -> NP NP* PP PP -> P NP* NP NP -> NP* PP Parse tree Class Definition (shown as DAG) is_a regulation qualifier regulates is_a affects_cell_type is_a has_type contraction muscle negative smooth thick lines inidcate stem terms the parse tree shows the synctactic structure we need new OBO format (or OWL) to represent cross-products (aka intersections / complete defs) DAG definition is minimal – non-definitional relationships to other terms not shown the classdef shows the logical structure recurse down tree applying grammatical context rules to get property fillers X XX X X negative regulation of smooth muscle contraction smooth muscle contraction smooth muscle

COPII-coated vesicle membrane class(membrane “COPII-coated vesicle membrane” part_of=class(vesicle “COPII-coated vesicle” has_part=class(coat “COPII coat” made_from=class(COPII “COPII”)))) class/term name shown in quotes; these can be derived by reversion the transformation requires use of inverse properties (has_part vs part_of) - supported in new OBO format. the above classdef is consistent with what is in the GO cellular_component ontology

Inference of intermediate terms and IS_As – example rule class(regulation process_regulated=R qual=Q) is_a class(regulation process_regulated=R' qual=Q) R is_a R' FORALL classdef pairs IFF the stem-class is the same AND all the property-values in the restriction-list are identical EXCEPT for one property, in which the property- values are linked by an isa, THEN the classdefs are linked by an isa class(C P1=V1 P2=V2..Px=Vx Pn=Vn) is_a class(C P1=V1 P2=V2..Px=Vx' Pn=Vn) Vx is_a Vx'