Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advanced Piloting Cruise Plot.
Our library has two forms of encyclopedias: Hard copy and electronic versions. The first is simply the old-fashioned "book on the shelf" type of encyclopedia.
Kapitel S3 Astronomie Autor: Bennett et al. Raumzeit und Gravitation Kapitel S3 Raumzeit und Gravitation © Pearson Studium 2010 Folie: 1.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 1 Embedded Computing.
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
UNITED NATIONS Shipment Details Report – January 2006.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Summary of Convergence Tests for Series and Solved Problems
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt 2 pt 3 pt 4 pt 5 pt 1 pt ShapesPatterns Counting Number.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Addition Facts
Year 6 mental test 5 second questions
Year 6 mental test 10 second questions
2010 fotografiert von Jürgen Roßberg © Fr 1 Sa 2 So 3 Mo 4 Di 5 Mi 6 Do 7 Fr 8 Sa 9 So 10 Mo 11 Di 12 Mi 13 Do 14 Fr 15 Sa 16 So 17 Mo 18 Di 19.
Richmond House, Liverpool (1) 26 th January 2004.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
ABC Technology Project
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
VOORBLAD.
15. Oktober Oktober Oktober 2012.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
“Start-to-End” Simulations Imaging of Single Molecules at the European XFEL Igor Zagorodnov S2E Meeting DESY 10. February 2014.
BIOLOGY AUGUST 2013 OPENING ASSIGNMENTS. AUGUST 7, 2013  Question goes here!
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Squares and Square Root WALK. Solve each problem REVIEW:
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Lets play bingo!!. Calculate: MEAN Calculate: MEDIAN
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
GG Consulting, LLC I-SUITE. Source: TEA SHARS Frequently asked questions 2.
Addition 1’s to 20.
25 seconds left…...
Slippery Slope
H to shape fully developed personality to shape fully developed personality for successful application in life for successful.
Januar MDMDFSSMDMDFSSS
Week 1.
Analyzing Genes and Genomes
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Intracellular Compartments and Transport
A SMALL TRUTH TO MAKE LIFE 100%
PSSA Preparation.
VPN AND REMOTE ACCESS Mohammad S. Hasan 1 VPN and Remote Access.
Immunobiology: The Immune System in Health & Disease Sixth Edition
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
1 Chapter 13 Nuclear Magnetic Resonance Spectroscopy.
Immunobiology: The Immune System in Health & Disease Sixth Edition
CpSc 3220 Designing a Database
ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry
Presentation transcript:

Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry

2 Overview Project Prospect: what we find and how we find it. RDF: How should we be disseminating it? Next steps: Basics for a chemical ontology.

3

4

5

6

7

8

9 Project Prospect: What do we find? Chemical compounds Chemical terms from the IUPAC Gold Book Gene products: function, process, location Nucleotide and polypeptide sequence terms Cell types

10 Project Prospect: How do we find it? For compound names: ~60% Oscar (Corbett and Murray-Rust 2006, Batchelor and Corbett 2007) ~20% PubChem ~20% ChemDraw For compound numbers: ~70% author ChemDraw ~30% editors

11

12 RDF in an RSS reader

13 RDF: how we do it now Content module from RSS In what sense does an article contain pyridine or base pairs? We would much rather have proper rdf predicates – e.g. is_about, mentions.

14 RDF: what it looks like now [… title] [… blah] [… human-readable stuff [… dublin core stuff …]

15 Basics for a chemical ontology 1.Unambiguous representation of objects of chemical discourse 2.Proper parthood relations

16 Basics for a chemical ontology: 1. Objects of chemical discourse Must be able to represent and clearly distinguish Compounds Classes of compound Parts of molecules Mixtures Would be nice to have: Disambiguation cues for the first three

17 Imidazole

18 An imidazole

19 The imidazole side-chain/group/ring

20 Can ChEBI handle this? J Imidazoles (!)(CHEBI:24780) J Imidazole(CHEBI:16069) L Imidazole ringnot yet L Imidazolyl groupnot yet (but methyl, benzyl, etc.) … and there are no disambiguation cues

21 Disambiguation One Sense per Discourse (Gale et al. 1992) … this doesnt hold at all One Sense per Collocation (Yarowsky 1993) … matches our intuitions

22 Disambiguation: What a one sense per collocation feature set might look like CLASS: w(–1) = a, an, the, this w(0) plural (bit of a cheat, as not a collocation) PART: w(–1) = bridging, terminal w(+1) = backbone, bridge, chain, core, dyad, fluorophore, fragment, framework (and many more) w(+1)w(+2) = building block, protecting group, side chain

23 Basics for a chemical ontology: 2. Parthood relations Parthood in ChEBI means at least three things: is necessarily chemically part of carbonyl group part_of carbonyl compounds

24 Basics for a chemical ontology: 2. Parthood relations Is possibly chemically part of: Lead(2+) part_of lead diacetate (most lead(2+) isnt) Electron part_of muonium (!)

25 Basics for a chemical ontology: 2. Parthood relations Is part of a mixture Kanamycin A part_of kanamycin

26 Basics for a chemical ontology: 2. Parthood relations Solution 1: define relationships according to pattern: all instances of X have a relationship with some Y. (Smith et al., Relations in biomedical ontologies, 2005) carbonyl compound has_part carbonyl group Lead diacetate has_part lead(2+)(?!) Muonium has_part electron Kanamycin has_part kanamycin A(?!)

27 Basics for a chemical ontology: 2. Parthood relations Solution 2 (for discussion): Distinguish molecular- level relationships from sample-level relationships Carbonyl compound molecule has_part carbonyl substituent Muonium atom has_part electron Kanamycin has_component kanamycin A Lead diacetate has_component lead(2+)(?!)

28 Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word- sense disambiguation? What is the best way of distinguishing molecules and samples?

29 Acknowledgements University of Cambridge: Peter Corbett OBO Foundry: Chris Mungall (Berkeley), Barry Smith (Buffalo)

30 Open questions How do we represent the relationship between named entities and documents? How do we integrate ontologies and word- sense disambiguation? What is the best way of distinguishing molecules and samples?