Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry

Similar presentations


Presentation on theme: "Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry"— Presentation transcript:

1 Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry batchelorc@rsc.org

2 2 Overview Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.

3 3

4 4

5 5

6 6

7 7

8 8

9 9 Project Prospect: What do we find? Chemical compounds Chemical terms from the IUPAC Gold Book Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types

10 10 Project Prospect: How do we find it? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors

11 11

12 12 RDF in an RSS reader

13 13 RDF: how we do it now Content module from RSS 1.0 http://web.resource.org/rss/1.0/modules/content In what sense does an article contain pyridine or base pairs? We would much rather have proper rdf predicates – e.g. is_about, mentions.

14 14 RDF: what it looks like now [… title] http://xlink.rsc.org/?DOI=b716356h&RSS=1 [… blah] [… human-readable stuff [… dublin core stuff …]

15 15 Basics for a chemical ontology 1.Unambiguous representation of objects of chemical discourse 2.Proper parthood relations

16 16 Basics for a chemical ontology: 1. Objects of chemical discourse Must be able to represent and clearly distinguish Compounds Classes of compound Parts of molecules Mixtures Would be nice to have: Disambiguation cues for the first three

17 17 Imidazole

18 18 An imidazole

19 19 The imidazole side-chain/group/ring

20 20 Can ChEBI handle this? J Imidazoles (!)(CHEBI:24780) J Imidazole(CHEBI:16069) L Imidazole ringnot yet L Imidazolyl groupnot yet (but methyl, benzyl, etc.) … and there are no disambiguation cues

21 21 Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesnt hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions

22 22 Disambiguation: What a one sense per collocation feature set might look like CLASS: w(–1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w(–1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = building block, protecting group, side chain

23 23 Basics for a chemical ontology: 2. Parthood relations Parthood in ChEBI means at least three things: is necessarily chemically part of carbonyl group part_of carbonyl compounds

24 24 Basics for a chemical ontology: 2. Parthood relations Is possibly chemically part of: Lead(2+) part_of lead diacetate (most lead(2+) isnt) Electron part_of muonium (!)

25 25 Basics for a chemical ontology: 2. Parthood relations Is part of a mixture Kanamycin A part_of kanamycin

26 26 Basics for a chemical ontology: 2. Parthood relations Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., Relations in biomedical ontologies, 2005) carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+)(?!) Muonium has_part electron Kanamycin has_part kanamycin A(?!)

27 27 Basics for a chemical ontology: 2. Parthood relations Solution 2 (for discussion): Distinguish molecular- level relationships from sample-level relationships Carbonyl compound molecule has_part carbonyl substituent Muonium atom has_part electron Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+)(?!)

28 28 Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word- sense disambiguation? What is the best way of distinguishing molecules and samples?

29 29 Acknowledgements University of Cambridge: Peter Corbett OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo) www.projectprospect.org

30 30 Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word- sense disambiguation? What is the best way of distinguishing molecules and samples?


Download ppt "Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry"

Similar presentations


Ads by Google