Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry
2 Overview Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.
3
4
5
6
7
8
9 Project Prospect: What do we find? Chemical compounds Chemical terms from the IUPAC Gold Book Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types
10 Project Prospect: How do we find it? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors
11
12 RDF in an RSS reader
13 RDF: how we do it now Content module from RSS In what sense does an article contain pyridine or base pairs? We would much rather have proper rdf predicates – e.g. is_about, mentions.
14 RDF: what it looks like now [… title] [… blah] [… human-readable stuff [… dublin core stuff …]
15 Basics for a chemical ontology 1.Unambiguous representation of objects of chemical discourse 2.Proper parthood relations
16 Basics for a chemical ontology: 1. Objects of chemical discourse Must be able to represent and clearly distinguish Compounds Classes of compound Parts of molecules Mixtures Would be nice to have: Disambiguation cues for the first three
17 Imidazole
18 An imidazole
19 The imidazole side-chain/group/ring
20 Can ChEBI handle this? J Imidazoles (!)(CHEBI:24780) J Imidazole(CHEBI:16069) L Imidazole ringnot yet L Imidazolyl groupnot yet (but methyl, benzyl, etc.) … and there are no disambiguation cues
21 Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesnt hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions
22 Disambiguation: What a one sense per collocation feature set might look like CLASS: w(–1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w(–1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = building block, protecting group, side chain
23 Basics for a chemical ontology: 2. Parthood relations Parthood in ChEBI means at least three things: is necessarily chemically part of carbonyl group part_of carbonyl compounds
24 Basics for a chemical ontology: 2. Parthood relations Is possibly chemically part of: Lead(2+) part_of lead diacetate (most lead(2+) isnt) Electron part_of muonium (!)
25 Basics for a chemical ontology: 2. Parthood relations Is part of a mixture Kanamycin A part_of kanamycin
26 Basics for a chemical ontology: 2. Parthood relations Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., Relations in biomedical ontologies, 2005) carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+)(?!) Muonium has_part electron Kanamycin has_part kanamycin A(?!)
27 Basics for a chemical ontology: 2. Parthood relations Solution 2 (for discussion): Distinguish molecular- level relationships from sample-level relationships Carbonyl compound molecule has_part carbonyl substituent Muonium atom has_part electron Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+)(?!)
28 Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word- sense disambiguation? What is the best way of distinguishing molecules and samples?
29 Acknowledgements University of Cambridge: Peter Corbett OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)
30 Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word- sense disambiguation? What is the best way of distinguishing molecules and samples?