Presentation is loading. Please wait.

Presentation is loading. Please wait.

Obol: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project.

Similar presentations


Presentation on theme: "Obol: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project."— Presentation transcript:

1 Obol: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project / GO Consortium

2 Obol Obol is a system for discovering and reasoning over hidden knowledge in ontologies Obol is a system for discovering and reasoning over hidden knowledge in ontologies Obol is useful for helping maintain cross- products in the Gene Ontology Obol is useful for helping maintain cross- products in the Gene Ontology Obol works by parsing syntax and semantics from GO and OBO terms Obol works by parsing syntax and semantics from GO and OBO terms

3 Motivation: Ontology Maintenance GO: 3 ontologies, 16k terms, 23k relationships GO: 3 ontologies, 16k terms, 23k relationships OBO: cell, biochemical, sequence and multiple anatomical ontologies OBO: cell, biochemical, sequence and multiple anatomical ontologies Many GO terms are combinatorial (cross- products) Many GO terms are combinatorial (cross- products) regulation of neutrophil differentiation regulation of neutrophil differentiation No explicit links between ontologies No explicit links between ontologies Difficult to maintain manually Difficult to maintain manually

4 Some Sample GO terms regulation of neutrophil differentiation. neutrophil differentiation. granuloctye differentiation. smooth muscle contraction. nucleolar chromatin. nucleolus. oxygen transport. negative regulation of interleukin-2 biosynthesis. oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, reduced iron-sulfur protein as one donor, and incorporation of one atom of oxygen.

5 Graph complexity interleukin-2 biosynthesis cytokine biosynthesis regulation of interleukin-2 biosynthesis regulation of cytokine biosynthesis regulation of biosynthesis negative regulation of interleukin-2 biosynthesis negative regulation of cytokine biosynthesis negative regulation of biosynthesis part-of is-a

6 Automatic inference of relationships Some relationships can be derived computationally… Some relationships can be derived computationally… …provided we have complete logical definitions …provided we have complete logical definitions regulation ^ (regtype:negative) ^ (regprocess:biosynthesis ^ (makes:interleukin-2)) Tools exist for reasoning over these logical definitions, but…

7 Generating logical definitions Generating and maintaining logical definitions for GO/OBO is non-trivial Generating and maintaining logical definitions for GO/OBO is non-trivial Obol exploits the highly regular grammatical structure of GO term names Obol exploits the highly regular grammatical structure of GO term names regulation of X, never X regulation regulation of X, never X regulation Y biosynthesis, never biosynthesis of Y Y biosynthesis, never biosynthesis of Y no stemming required no stemming required Obol derives candidate class definitions from term names, and performs basic reasoning over them Obol derives candidate class definitions from term names, and performs basic reasoning over them

8 Obol: parsing and reasoning GO/OBO Term Lexical string Class Definition(s) may involve relationships to other OBO terms Inferences using definitions and existing ontologies interleukin-2 biosynthesis biosynthesis^(makes:interleukin-2) interleukin-2 biosynthesis is_a cytokine biosynthesis inferred from: interleukin-2 is_a cytokine

9 How Obol Works term names are broken into lexical tokens (words) using a tokeniser term names are broken into lexical tokens (words) using a tokeniser tokens are parsed using a grammar, generating parse trees tokens are parsed using a grammar, generating parse trees parse trees are turned into class definitions using transformation rules and property definitions parse trees are turned into class definitions using transformation rules and property definitions transformation is reversible transformation is reversible class definitions are reasoned over class definitions are reasoned over implemented in XSB Prolog implemented in XSB Prolog

10 Word tokens Obol uses an atomic vocabulary of word tokens Obol uses an atomic vocabulary of word tokens tokens are partitioned by ontology domain tokens are partitioned by ontology domain cell, anatomy, biological process, etc cell, anatomy, biological process, etc tokens have a grammatical type tokens have a grammatical type adj, noun, prep, relational adj, special adj, noun, prep, relational adj, special vocabularies need not be correct or complete vocabularies need not be correct or complete

11 Computational Grammars formal grammars can elucidate sentence structure formal grammars can elucidate sentence structure grammars transform token lists into parse trees grammars transform token lists into parse trees multiple parses may be possible multiple parses may be possible parses are reversible parses are reversible a grammar is a collection of transformation rules a grammar is a collection of transformation rules

12 A simple OBO term grammar Term --> NP e.g. negative regulation of interleukin-2 biosynthesis NP --> NP PP e.g. negative regulation of interleukin-2 biosynthesis NP --> NOUN e.g. interleukin;; regulation;; biosynthesis NP --> NP-TOK e.g. interleukin-2 NP --> ADJ NP e.g. negative regulation NP --> NP NP e.g. interleukin-2 biosynthesis PP --> PREP NP e.g. of interleukin-2 biosynthesis (subset of the whole OBO grammar)

13 Applying grammar rules negative regulation of interleukin-2 biosynthesis adj nounprepnountoknoun np np -> n np np -> adj np npnp -> np-tok np np -> np np pp pp -> p np np np -> np pp term -> np

14 Generating Class Definitions A parse tree shows the syntax structure of a term A parse tree shows the syntax structure of a term A class definition is a description of the meaning of a term A class definition is a description of the meaning of a term An Obol classdef is a cross product (intersection) of necessary and sufficient conditions An Obol classdef is a cross product (intersection) of necessary and sufficient conditions Classdefs are generated from parse trees using tree transform rules and property descriptions Classdefs are generated from parse trees using tree transform rules and property descriptions Classdefs can be exported using obo or OWL format Classdefs can be exported using obo or OWL format

15 Property definitions guide class construction Property name: makes domain: biosynthesis range: substance grammar: np_modifier interleukin-2 np biosynthesis biosynthesis^(makes:interleukin-2)

16 Property definitions guide class construction negative npadj np regulation regulation^(regtype:negative) Property name: regtype domain: regulation range: neg/pos grammar: np_modifier

17 Property definitions guide class construction regulation^ (regtype:negative) np pp biosynthesis^ (makes:IL-2) regulation^ (regtype:negative)^ (regprocess:biosynthesis^(makes:IL-2)) Property name: regprocess domain: regulation range: biological_process grammar: prep(of) np of

18 Unparseable terms and multi- parse terms biological process molecular function cellular component single-token terms excluded from this analysis

19 Reasoning over class definitions Using class definitions, we can: Using class definitions, we can: autocreate parentage for new terms autocreate parentage for new terms check for missing relationships check for missing relationships find inconsistencies between ontologies find inconsistencies between ontologies generate implicit orthogonal ontologies generate implicit orthogonal ontologies Method: Method: Use native OBOL rules (via prolog or DAG- Edit) Use native OBOL rules (via prolog or DAG- Edit) OR use external reasoner; eg RACER, FaCT OR use external reasoner; eg RACER, FaCT

20 Finding missing relationships Obol is run periodically on GO to check for missing IS A and PART OF relationships Obol is run periodically on GO to check for missing IS A and PART OF relationships Multiple parses produce false-positives Multiple parses produce false-positives 223 missing relationships added to GO 223 missing relationships added to GO ToDo: increase specificity by improving vocabularies and property definitions ToDo: increase specificity by improving vocabularies and property definitions

21 Obol sample report nucleolar chromatin PART OF nucleus clathrin-coated vesicle HAS PART clathrin coat chromoplast membrane IS A plastid membrane nuclear microtubule PART OF nucleus vitamin E biosynthesis IS A vitamin E metabolism uracil permease activity IS A permease activity chloroplast envelope IS A plastid envelope negative regulation of lipid biosynthesis IS A negative regulation of lipid metabolism ketone body metabolism IS A ketone metabolism dense nuclear body IS A nuclear body false positive! inverse present

22 Aligning to the OBO cell ontology cardioblast differentiation cardiac cell differentiation cardioblast mesodermal cell animal cell cardiac muscle cell muscle cell DEVELOPS FROM most differentiation terms align precisely; some dont… ???

23 Deriving existing GO relationships FunctionProcessComponent all relationships8002136131691 obol-derivable relationships 4793055346 non-trivially derivable relationships 0108963

24 Obol as an ontology curation tool Obol can be used by GO curators in a variety of ways Obol can be used by GO curators in a variety of ways Behind the scenes Behind the scenes Iterative Iterative GO curator receives periodic suggestion reports GO curator receives periodic suggestion reports Continuous Continuous GO curator uses OBOL interactively via DAG-Edit plugin GO curator uses OBOL interactively via DAG-Edit plugin To help the transition to a fully specified ontology To help the transition to a fully specified ontology GO curators then maintain class definitions GO curators then maintain class definitions Obol as a search tool? Obol as a search tool?

25 Problems to address Integration with curation process Integration with curation process Memory usage Memory usage Syntax parsing Syntax parsing chemical terms, long terms chemical terms, long terms Dealing with and, or and not Dealing with and, or and not Generating text definitions Generating text definitions Word list maintenance Word list maintenance solution: integrate with ontology maintenance solution: integrate with ontology maintenance Ontology dependencies Ontology dependencies protein and generic anatomy ontologies needed protein and generic anatomy ontologies needed Obol can be used to help generate these Obol can be used to help generate these

26 Conclusions Obol is useful for maintainng large GO- style ontologies Obol is useful for maintainng large GO- style ontologies combination of semantic parsing with reasoning is powerful combination of semantic parsing with reasoning is powerful benefits of both GO-style ontology development and formal reasoning benefits of both GO-style ontology development and formal reasoning

27 Acknowledgements Berkeley/GO John Richter Brad Marshall Karen Eilbeck Suzanna Lewis Gerry Rubin Jackson Labs/GO David Hill Joel Richardson Judith Blake GO Curators Midori Harris Jennifer Clark Amelia Ireland Jane Lomax Manchester Chris Wroe Robert Stevens Phillip Lord J Michael Cherry Michael Ashburner all the GO Consortium

28 The OBO Universe (partial) Phe noty pe Seq uenc e Anat omy Cell Prot ein Che mica l Com pone nt Proc ess Func tion Fish- Anat Fly- Anat

29 Detecting inconsistencies Y bindi ng X bindi ng Z bindi ng X Z Y

30 Prolog as an ontology language % DATABASE OF FACTS % DATABASE OF FACTS isa(carb_binding, binding). isa(carb_binding, binding). isa(polysac_binding, carb_binding). isa(polysac_binding, carb_binding). isa(chitin_binding, polysac_binding) isa(chitin_binding, polysac_binding) isa(cellulose_binding, polysac_binding). isa(cellulose_binding, polysac_binding). % INFERENCE RULES % INFERENCE RULES isaT(X,Y):- isa(X,Y). isaT(X,Y):- isa(X,Y). isaT(X,Y):- isa(X,Z), isaT(Z,Y). isaT(X,Y):- isa(X,Z), isaT(Z,Y). ?- isaT(chitin_binding, binding). YES ?-isaT(X, polysac_binding). X=carb_binding. X=chitin_binding. X=cellulose_binding. ?-isaT(chitin_binding, cellulose_binding). NO ?-isaT(X,Y). % returns all paths

31 regulation regulates contraction muscle negative smooth class(regulation qualifier=class(ne gative ) regulates=class(c ontraction affects_cell_type =class(muscle has_type=class(s mooth )))) Prolog internal representation

32 Prolog Grammar Implementation Prolog: the classic logic programming language Prolog: the classic logic programming language High-level declarative language, natural choice for ontologies; built in database High-level declarative language, natural choice for ontologies; built in database Definite Clause Grammars (DCGs)part of the language; DCGs allow passing data up the parse tree Definite Clause Grammars (DCGs)part of the language; DCGs allow passing data up the parse tree XSB Prolog XSB Prolog Uses tabling (more efficient, less re-calculation) Uses tabling (more efficient, less re-calculation) Tabling + DCGs = chart parsing (Earley's algorithm) Tabling + DCGs = chart parsing (Earley's algorithm) This means we can have left-recursive grammars This means we can have left-recursive grammars

33 A Formal Grammar for OBO terms All(?) GO/OBO terms are NOUN-PHRASES (exception: phenotypes?) All(?) GO/OBO terms are NOUN-PHRASES (exception: phenotypes?) A NOUN-PHRASE is (recursively) made from A NOUN-PHRASE is (recursively) made from a NOUN (includes inflected verbs; eg binding) a NOUN (includes inflected verbs; eg binding) an ADJECTIVE followed by a NOUN-PHRASE eg inner membrane an ADJECTIVE followed by a NOUN-PHRASE eg inner membrane a NOUN-PHRASE preceeded by a NOUN-PHRASE acting as ADJECTIVE; eg clathrin coat a NOUN-PHRASE preceeded by a NOUN-PHRASE acting as ADJECTIVE; eg clathrin coat a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE; eg regulation of transcription a NOUN-PHRASE then PREPOSITION then NOUN-PHRASE; eg regulation of transcription an (optional) NOUN-PHRASE then a RELATIONAL ADJECTIVE then a NOUN-PHRASE; eg clathrin-coated vesicle an (optional) NOUN-PHRASE then a RELATIONAL ADJECTIVE then a NOUN-PHRASE; eg clathrin-coated vesicle Precedence rules are also required to prune parse forest Precedence rules are also required to prune parse forest Simple but effective Simple but effective

34 A formal grammar is a set of production rules operating over terminal symbols (eg words) and non-terminal symbols (eg word/phrase categories) A formal grammar is a set of production rules operating over terminal symbols (eg words) and non-terminal symbols (eg word/phrase categories) The rules determine how sequences of symbols can be transformed, making a parse tree The rules determine how sequences of symbols can be transformed, making a parse tree Sentence -> Subject Verb Object Subject -> Article Noun Object -> Article Noun Article -> a | the Verb -> ate | chased Noun -> cat | banana | mouse GENERATING: Start at top ('sentence') and apply rules until all symbols are terminal PARSING: Start at bottom – a sequence of terminals; apply rules, combining symbols if necessary Parse tree for a simple sentence, the cat ate the banana

35 negative regulation of smooth muscle contraction ADJ NOUN P ADJ NOUN NOUN NP NP NP NP -> ADJ NOUN* NP NP -> NP NP* PP PP -> P NP* NP NP -> NP* PP Parse tree Class Definition (shown as DAG) regulation regulates contraction muscle negative smooth thick lines inidcate stem terms the parse tree shows the synctactic structure we need new OBO format (or OWL) to represent cross-products (aka intersections / complete defs) DAG definition is minimal – non-definitional relationships to other terms not shown the classdef shows the logical structure recurse down tree applying grammatical context rules to get property fillers X XX X X negative regulation of smooth muscle contraction smooth muscle contraction smooth muscle

36 COPII-coated vesicle membrane class(membrane COPII-coated vesicle membrane part_of=class(ves icle COPII-coated vesicle has_part=class(c oat COPII coat made_from=class (COPII COPII)))) class/term name shown in quotes; these can be derived by reversion the transformation requires use of inverse properties (has_part vs part_of) - supported in new OBO format. the above classdef is consistent with what is in the GO cellular_component ontology

37 Outline Motivation – combinatorial issues with composite terms Motivation – combinatorial issues with composite terms Approaches: annotation-time term composition vs tools for maintenance of large DAGs Approaches: annotation-time term composition vs tools for maintenance of large DAGs The OBOL System The OBOL System Term decomposition using grammars Term decomposition using grammars Generating computable logical class definitions Generating computable logical class definitions Rules and reasoning over class definitions Rules and reasoning over class definitions Initial Results Initial Results Strategies for using OBOL within GO/OBO Strategies for using OBOL within GO/OBO

38 Manual maintenance of GO GO is 3 DAGs of over 16k terms GO is 3 DAGs of over 16k terms Large DAGs of terms are hard to maintain Large DAGs of terms are hard to maintain cross-products produce combinatorial explosions and highly connected sub- graphs cross-products produce combinatorial explosions and highly connected sub- graphs GO terms include OBO terms GO terms include OBO terms eg oxygen binding; wing development eg oxygen binding; wing development Zipf's Law Zipf's Law Many terms not yet used in annotation (Ogren, pers. Comm.) Many terms not yet used in annotation (Ogren, pers. Comm.)

39 Example combinatorial explosion contraction muscle contraction smooth muscle contraction negative regulation of contraction Negative regulation of muscle contraction regulati on regulation of contraction regulation of muscle contraction regulation of smooth muscle contraction negative regulation of smooth muscle contraction Negative regulation part_of implici t terms negative regulation negative regulation of muscle contraction actual GO terms implicit anatomical terms not shown affects part_of affects cell motility

40 One Approach: Properties One extreme solution is to remove composite terms from ontology altogether One extreme solution is to remove composite terms from ontology altogether Generate anonymous composite terms at annotation-time via property/slot restrictions to atomic terms Generate anonymous composite terms at annotation-time via property/slot restrictions to atomic terms binding ^ affects(interleukin-18) binding ^ affects(interleukin-18) contraction ^ affects(muscle+type(smooth)) contraction ^ affects(muscle+type(smooth)) These bindings constitute a class definition These bindings constitute a class definition But it is still necessary to make statements about composite terms in the ontology But it is still necessary to make statements about composite terms in the ontology macrophage activation is_a immune cell activity macrophage activation is_a immune cell activity fibrinolysis is_a negative regulation of blood coagulation fibrinolysis is_a negative regulation of blood coagulation

41 Another approach: Computationally aided ontology maintenance GO terms exhibit regularity in their syntactic structure GO terms exhibit regularity in their syntactic structure Substring relationships highly correlated with actual relationships Substring relationships highly correlated with actual relationships – regulation of smooth muscle contraction – smooth muscle contraction – muscle contraction – contraction Ogren PV, Cohen KB, Acquaah- Mensah GK, Eberlein J, Hunter L. 2004. The composition al structure of Gene Ontology terms. Pac Symp Biocomput 9: 214-215.

42 OBOL: Syntax and Semantics What about using the term syntax to get at the meaning of the term? What about using the term syntax to get at the meaning of the term? GO/OBO Term Class Definition may involve relationships to other OBO terms Inferences interleukin binding binding^affects(interleukin) interleukin binding is_a cytokine binding inferred from: interleukin is_a cytokine

43 Inference of intermediate terms and IS_As – example rule class(regulation process_regulated =R qual=Q) is_a class(regulation process_regulated =R' qual=Q) R is_a R' FORALL classdef pairs IFF the stem- class is the same AND all the property-values in the restriction-list are identical EXCEPT for one property, in which the property-values are linked by an isa, THEN the classdefs are linked by an isa class(C P1=V1 P2=V2..Px=Vx Pn=Vn) is_a class(C P1=V1 P2=V2..Px=Vx' Pn=Vn) Vx is_a Vx'


Download ppt "Obol: Open Bio-Ontology Language Using grammars to extract and use implicit knowledge in the GO and OBO Chris Mungall Berkeley Drosophila Genome Project."

Similar presentations


Ads by Google