ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry 2008-05-19

Slides:



Advertisements
Similar presentations
Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry
Advertisements

Organic Chemistry IB.
Organic Chemistry.
Acids and Bases. Acids & Bases These were introduced in Chapter 4 Arrhenius: Acid = any substance that produces H + in soution. Base = any substance that.
Calculations in Chemistry You need to know how to carry out several calculations in Additional and Triple Chemistry This booklet gives you a step by step.
Ch 16 Amines Homework problems: 16.9, 16.10, 16.21, 16.25, 16.39,
Objectives SWBAT Define an organic compound.
1. Review- What does it mean when a molecule is said to be “polar” Use Models- Use the structure of a water molecule to explain why it is polar 2. Review-
Additional Chemistry Calculations Relative atomic and Formula Masses The mass of an atom is too small to deal with in real terms, so we use ‘relative’
Application of OBO Foundry Principles in GO Chris Mungall Lawrence Berkeley Labs NCBO GO Consortium.
Experiment 5 % Potassium Hydrogen Phthalate (KHP) in an Unknown.
Chemical Calculations: Formula Masses, Moles, and Chemical Equations.
Roy Kennedy Massachusetts Bay Community College Wellesley Hills, MA Introductory Chemistry, 2 nd Edition Nivaldo Tro Chapter 6 Chemical Composition 2006,
Matter Properties and Classification AP Chemistry Croatan High School Thanks to David English.
4.6 MOLECULAR FORMULAS. 1. Determine the percent composition of all elements. 2. Convert this information into an empirical formula 3. Find the true number.
STOICHIOMETRY Mass relationships between reactants and products in a chemical reaction.
Cost-based Optimization of Graph Queries Silke Trißl Humboldt-Universität zu Berlin Knowledge Management in Bioinformatics IDAR 2007.
Formulas of Hydrocarbons and Isomers The adventure continues.
The Mole & Stoichiometry
Notes for Chapter 12 Logic Programming The AI War Basic Concepts of Logic Programming Prolog Review questions.
Organic Chemistry Hydrocarbons Organic Chemistry The study of the compounds that contain the element carbon Are numerous due to the bonding capability.
Stoichiometry.
Dr Manal F. AbouTaleb Alkanes 1 Introduction 2 Nomenclature of Alkanes
Chemistry An Introduction to General, Organic, and Biological Chemistry, Eleventh Edition Copyright © 2012 by Pearson Education, Inc. Chapter 5 Chemical.
Stoichiometry Joe’s favorite word! 1. Our toolbox We’ve now filled our toolbox with the basic tools required to discuss real chemistry: 1. Nomenclature.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Read Sections 6.1 and 6.2 before viewing the slide show.
Stoichiometry Stoichiometry CDO High School. Stoichiometry Consider the chemical equation: 4NH 3 + 5O 2  6H 2 O + 4NO There are several numbers involved.
IUPAC Nomenclature Organic Compounds (Part 1) of.
Standards for Digital Data Representation 1) The IUPAC/NIST Chemical Identifier 2) IUPAC Terminology NSF Workshop Constructing a Kinetics Database NIST,
8 | 1 CHAPTER 8 CHEMICAL COMPOSITION. 8 | 2 Atomic Masses Balanced equations tell us the relative numbers of molecules of reactants and products. C +
Chapter 4 1 © 2011 Pearson Education, Inc. 4.5 Isomerism in Organic Compounds, Part 1 Structural Isomers Structural isomers are compounds with the same.
Aqueous Reactions and Solution Stoichiometry. Electrolyte- a substance whose aqueous solutions contain ions. Nonelectrolyte- a substance that does not.
Chapter 16 Oxidation-Reduction Reactions. Objectives 16.1 Analyze the characteristics of an oxidation reduction reaction 16.1 Distinguish between oxidation.
Introduction to atoms and molecules Chapter 2-1 – 2-5 Chapter 5-7 and 5-9 Chapter 4-5 – 4-6.
Chapter 16 Aldehydes and Ketones.
Introduction to Organic Chemistry
Normalization of Database Tables
“ Good annotation practice ” for chemical data: ChEBI experience Kirill Degtyarenko European Patent Office.
Here let's discuss the difference between atoms, elements, compounds and mixtures. First let's discuss element : Element is the basic substance that can't.
Amines. 2 Learning Objectives Chapter ten discusses the following topics and by the end of this chapter the students will:  Know.
Organic Chemistry …oh what fun…. Organic Chemistry  What does it mean to be organic?  To be an organic compound means that you contain carbon … that’s.
Lesson 5 - Chemical Reactions. Compare the following videos Consider: 1)What is a chemical reaction? 2)How do we know a chemical reaction occurred? 3)What.
A marriage of chemistry and biology Aligning the Gene Ontology with CHEBI.
A molecular formula of a compound is a whole-number multiple of its empirical formula. Section 4: Empirical and Molecular Formulas K What I Know W What.
© 2009, Prentice-Hall, Inc. Formula Weights. © 2009, Prentice-Hall, Inc. Formula Weight (FW) A formula weight is the sum of the atomic weights for the.
Mole Conversions: Molecules to Atoms. Objectives You will be able to…  Distinguish molecules and atoms  Convert from moles of a compound to # atoms.
Alkanes are hydrocarbons that contain only single bonds. Section 2: Alkanes K What I Know W What I Want to Find Out L What I Learned.
Specific rules are used when naming binary molecular compounds, binary acids, and oxyacids. Section 2: Naming Molecules K What I Know W What I Want to.
1 Dr Nahed Elsayed. Learning Objectives Chapter ten discusses the following topics and by the end of this chapter the students will:  Know the structure.
Alkenes are hydrocarbons that contain at least one double bond, and alkynes are hydrocarbons that contain at least one triple bond. Section 3: Alkenes.
The Mole. What is a mole? Well, yes, but we’re not discussing biology or dermatology now. We want the CHEMIST’S mole.
Fundamentals of Organic Chemistry CHONCCHONC bonds.
Ontology, RDF, SW for Chemical Structures
Lecture 11 Monday 2/8/17.
Chapter Sixteen: Compounds
Classifying Chemistry: Current Efforts in Canada
Amines
Amines
AN INTRODUCTION TO THE CHEMISTRY OF ALCOHOLS.
Alcohols and Ethers Introduction—Structure and Bonding
Identification of Molecular Level Differences (6.1)
Organic Chemistry IB.
Chapter 17 Aldehydes and Ketones
Nomenclature of Aldehydes
Functional Groups Definition: A structural feature of a molecule, consists of a specific arrangement of atoms, responsible for certain properties of.
Amines
Amines
Seawater Chemistry.
WATER The Universal Solvent.
Presentation transcript:

ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry

2 What is text mining? Marti Hearst, Berkeley: “Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.” Can ChEBI help?

3 Overview  Reasoning  ChEBI as dictionary  Regular polysemy in chemistry  Some possible solutions

4 Reasoning

5 Reasoning is using the logical structure of an ontology to automatically infer facts about the world which have not been explicitly added by a human being. Computers have no real-world knowledge beyond what we tell them.

6 Logical structure: properties of relations We only have time to look at transitivity and is_a. Smith et al., “Relations in Biomedical Ontologies”, Genome Biol., 2005, 6, R46. RelationTransitiveSymmetricReflexiveAnti- symmetric is_aYesNoYes part_ofYesNoYes

7 ChEBI’s is_a is not transitive (1) If a relation R is transitive, then: If a R b and b R c, then a R c.  glutathione is_a cofactor  cofactor is_a biological role therefore glutathione is_a biological role

8 ChEBI’s is_a is not transitive (2)  water is_a amphiprotic solvent  amphiprotic solvent is_a protophilic solvent (*)  protophilic solvent is_a Bronsted base (*)  Bronsted base is_a base  base is_a biological role therefore water is_a base therefore water is_a biological role * how come “protophilic solvent” and “Bronsted base” only have one child each?

9 ChEBI’s is_a is not transitive (3)  N-hydroxy-L-aspartic acid is_a hydroxamic acids  hydroxamic acids is_a organic functional classes therefore N-hydroxy-L-aspartic acid is_a organic functional classes

10 is_a has many meanings! 1.An amount of a compound has a biological role: tris is_a buffer.* 2.An amount of a compound has an application: sodium dodecyl sulfate is_a detergent.* 3.A less-abstract type is an example of a more abstract type: propane is_a alkanes. 4.?!: metals is_a atoms.* * Not a property of a lone atom or molecule!

11 Computers need facts about the world, not about ChEBI curation

12 ChEBI as dictionary

13 Evaluating name–structure conversion with ChEBI ChEBI release 37 (26 September 2007) contains annotated entities, of which 8486 have InChI strings. We use OSCAR3 (oscar3-chem.sourceforge.net) for name– structure conversion. We convert chebi.obo to an XML file, each paragraph containing either a ChEBI name or an IUPAC name. The layered structure of the InChI lets us give partial credit for incomplete matches.

14 Results: IUPAC names Total8447 Identified as chemical8255 (97.73%) With InChI (upper bound)1810 (21.43%) Matching InChI, disregarding fixed hydrogen layer1734 (20.53%) Matching InChI, disregarding stereo1176 Matching InChI, exact (lower bound)1174 (13.90%) Not all of name matched1024 Name identified as two or more separate names974 (11.53%)

15 Results: ChEBI names Total8146 Identified as chemical7173 (88.06%) With InChI (upper bound)1036 (12.72%) Matching InChI, disregarding fixed hydrogen layer953 (11.70%) Matching InChI, disregarding stereo637 Matching InChI, exact (lower bound)628 (7.71%) Not all of name matched764 Name identified as two or more separate names373 (4.58%)

16 Regular polysemy

17 Regular polysemy … where words stand for multiple things in a consistent way. Examples:  Brand names  Grinding  Figure–ground  Exact–class–part polysemy in chemistry Peter Corbett, Colin Batchelor and Ann Copestake (2008), “Pyridines, pyridine and pyridine rings”, Proc. BERBMTM08 at LREC 2008, Marrakech, Morocco.

18 Regular polysemy Brand names “Learning to buy a Renault and talk to BMW” Grinding “The squirrel scampered down the path and kept stopping and looking at the officers to check they were behind” vs. “[…] the trick was to serve squirrel fresh and not to leave it hanging like other game”

19 Regular polysemy Figure–ground  Audrey Hepburn painted the door (figure)  Audrey Hepburn walked through the door (ground)  The Incredible Hulk walked through the door (ambiguous)

20 Methyl, the radical (exact)

21 Methyl, the group (part)

22 Can ChEBI handle methyl? methyl group(CHEBI:32875) YES methyl radical(CHEBI:29309) YES

23 Imidazole (exact)

24 An imidazole (class)

25 imidazole side-chain/group/ring (part)

26 Can ChEBI handle imidazole? imidazoles(CHEBI:24780) YES imidazole(CHEBI:16069)YES imidazole ringnot yet imidazolyl groupnot yet

27 Mapping exact, class and part to entries in ChEBI Tests: 1.Has InChI: exact 2.Name is plural: class 3.Ends in –yl, “group” or “residue”: part Test 2 doesn’t work for applications or roles. Test 3 is brittle. I would much rather use the logical structure of the ontology.

28 Some possible solutions

29 Some possible solutions (1)  ChEBI must represent facts about the world rather than about itself. Examples:  If unclassified compounds have a structure, they should be in the molecular structure tree rather than the unclassifieds tree.  “organic functional classes” is a tool for assigning nomenclature. No chemical compound is an “organic functional class”.

30 Some possible solutions (2) ChEBI must distinguish between what is always true and what is only sometimes true. Example:  Replace some is_a relationships with has_biological_role and has_application. We need ChEBI to represent parts of molecules that aren’t substituents. They should all be descendants of molecular part (a new term), as should amino acid residues and nucleoside residues.

31 Questions?