Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008 SPECTRa-T Project.

Slides:



Advertisements
Similar presentations
Resource description and access for the digital world Gordon Dunsire Centre for Digital Library Research University of Strathclyde Scotland.
Advertisements

1ETD 2008_Morgan_The SPECTRa-T Project Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.
Crystal Structure EPrints: Source Through the Open Archive Initiative S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge.
S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge b. a School of Chemistry, University of Southampton, UK.; b School of Electronics.
The SPECTRa Project : A wider chemistry picture Alan Tonge & Jim Downing A Digital Repository for the Chemical Community.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
A BRIEF INTRO TO THE PROV DATA MODEL Simon Miles The entire W3C Provenance Working Group.
THE GLOBAL CHEMISTRY NETWORK David James Executive Director, Strategic Innovation Jim Iley Executive Director, Science and Education 3 rd September 2013.
Christoph Steinbeck Cologne University Bioinformatics Center (CUBIC) Folie 1 16:39:56 Reviving Analytical Data of the Past with Open Submission Databases.
Features and Uses of a Multilingual Full-Text Electronic Theses and Dissertations (ETDs) System Yin Zhang Kent State University Kyiho Lee, Bumjong You.
The Web of data with meaning... By Michael Griffiths.
Click to edit Master subtitle style JISC XYZ Project Principal Investigator: Peter Murray-Rust Project Team: Nick England, Brian Brooks Unilever Centre,
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
Module 2b: Modeling Information Objects and Relationships IMT530: Organization of Information Resources Winter, 2007 Michael Crandall.
A Really Brief Crash Course in Semantic Web Technologies Rocky Dunlap Spencer Rugaber Georgia Tech.
Metadata Standards and Applications 4. Metadata Syntaxes and Containers.
Semantic Chemical Publishing Nick Day*, Peter Corbett, Peter Murray-Rust Unilever Centre for Molecular Informatics, University of Cambridge, UK. March.
Biological Sequences and Patents Chemical compounds and Patents Agenda Acknowledgements: FELICS is funded by the European.
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
Metadata and identifiers for e- journals Copenhagen Juha Hakala Helsinki University Library
RDF (Resource Description Framework) Why?. XML XML is a metalanguage that allows users to define markup XML separates content and structure from formatting.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Logics for Data and Knowledge Representation
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
ST22 revision proposal June-2006 WIPO-SDWG meeting Geneva.
Retrieving Chemistry Information Kenneth Lim & Mak Jie Ying Graduate Tutorial - 12 Aug 2015.
1 CHBE Orientation Program Searching the Literature.
Searching the Chemical Literature: Reference Books and Online Resources Dr. Sheppard Chemistry 4401L.
Chemistry Add-in for Word OR 10 Joe Townsend University of Cambridge
Resource Description Framework (RDF) Course: Electronic Document Team member: Ding Feng Ding Wei Wang Ling Date:
Of 41 lecture 4: rdf – basics and language. of 41 RDF basic ideas the fundamental concepts of RDF  resources  properties  statements ece 720, winter.
Scientific Applications of XML Arvind Hulgeri, Shantanu Godbole
 Major part of psychology for researchers, students, clinicians, etc…  Difference between journal article and popular press articles  Scholarly Journal-
Retrieving Chemistry Information Loh Mee Lan & Mak Jie Ying Inorganic Chemistry - 6 Aug 2015.
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lecture 5, Jan 23 th, 2003 Lotzi Bölöni.
Using Semantic Mapping to Manage Heterogeneity in XLIFF Interoperability by Dave Lewis, Rob Brennan, Alan Meehan, Declan O’Sullivan CNGL Centre for Global.
Retrieving Chemistry Information Alex Liu & Kenneth Lim (NUS-TUM) MSc in Industrial Chemistry - 11 Aug 2015.
Retrieving Chemistry Information Alex Liu & Kenneth Lim Physical Chemistry - 4 Aug 2015.
SPRINGER ONLINE
The future of the Web: Semantic Web 9/30/2004 Xiangming Mu.
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
1 Web 2.0 and Grids for Scholarly Research Peking University July Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.
 A brand new workflow tool for research chemists & related science  Largest database of chemical properties and reaction data presented with chemistry.
THE BIBFRAME EDITOR AND THE LC PILOT Module 3 – Unit 1 The Semantic Web and Linked Data : a Recap of the Key Concepts Library of Congress BIBFRAME Pilot.
Retrieving Chemistry Information Kenneth Lim & Mak Jie Ying Analytical Chemistry - 4 Aug 2015.
RDF & SPARQL Introduction Dongfang Xu Ph.D student, School of Information, University of Arizona Sept 10, 2015.
EEL 5937 Ontologies EEL 5937 Multi Agent Systems Lotzi Bölöni.
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
Feb 21-25, 2005ICM 2005 Mumbai1 Converting Existing Corpus to an OAI Compliant Repository J. Tang, K. Maly, and M. Zubair Department of Computer Science.
CombeDay Making Data Openly Available Simon Coles.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Reaxys – The Highlights. Slide 2 What is Reaxys? A brand new workflow solution for research chemists and scientists from related disciplines An extensive.
Shared innovation Linking Distributed Data across the Web Dr Tom Heath Researcher, Platform Division Talis Information Ltd t
Semantic Web In Depth Resource Description Framework Dr Nicholas Gibbins –
Linked Data & Semantic Web Technology The Semantic Web Part 4. Resource Description Framework (1) Dr. Myungjin Lee.
Tools for your Chemistry Research Alex Liu & Mak Jie Ying Graduate Tutorial - 13 Jan 2016.
Library Resources and Citation Tools for Chemistry Honours Students
SPARQL.
Library Resources and Citation Tools for Chemistry Honours Students
Repository Software - Standards
Zachary Cleaver Semantic Web.
Semantic Markup for Semantic Web Tools:
Extracting Recipes from Chemical Academic Papers
Chaitali Gupta, Madhusudhan Govindaraju
CSE591: Data Mining by H. Liu
Chemistry Metadata Initiatives
Presentation transcript:

Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008 SPECTRa-T Project

12-month project between University of Cambridge and Imperial College London to develop text- and data-mining tools to extract chemical data from e-theses Part of the JISC Digital Repositories programme Project Overview Submission, Preservation and Exposure of Chemistry Teaching and Research Data – in Theses

Background Chemistry is an experimental science Synthetic Organic Chemistry is the basis of Pharmaceutical and Agrochemical industries Where does the information to make this molecule come from? Ethyl 4,5-epoxy-hex-2-enolate C8H12O3 Systematic Name : Molecular Formula :

Chemical Abstracts (9000+ journals - 12,000 structures/day) Beilstein (180 core journals) Patents (CAS, Derwent, MDL) (400,000 /annum) Academic chemistry publications largely derived from PhD Theses Perhaps ~10K published per year worldwide Synthetic : contains preparations – only 20% published in detail Search Chemical patent & journal abstracting services – e.g.

List of Starting Materials & Reagents Recipe: Reactions Conditions & Work-up Product Characterization – spectroscopic & physical properties

Sample preparation from synthetic chemistry thesis

~80% of (academic) synthetic preparations remain locked in theses Manual abstraction (cf journals/patents) not an option The Problem The Solution OSCAR3 : Automatic high-throughput chemical name and chemical term recognition Open Source Chemistry Analysis Routines is an extensible Open Source framework which can identify much of the chemical terminology in electronic articles Semantic Web : Deposit extracted terms in searchable RDF triplestore

OSCAR Name recognition: 1. Dictionary of chemical names/terms (ChEBI Ontology) 2. Rules; chemical suffix filters 3. Regular expressions to recognise: data, formulae

Input: PDF Legacy Format PDF is the de facto format for electronic document deposition in digital repositories Problem: irregular word order line-breaks: loss of continuous text; paragraphs difficult to identify loss of subscripts and superscripts non-printing characters erroneous character assignment with OCR. PDF text is a Page Description Format – optimized for human, not machine, readability

Remove linebreaks from extended chemical names Remove text fragments derived from Figures and Tables Correct whitespace in chemical names PDF UTF-8 text OSCAR3 SAF XMLRDF statements XSLT Used ‘as is’ OSCAR used ‘as is’ on PDF e-theses : Gives 5000 terms / thess Gives 5000 terms / thesis (80% duplicates) Cannot identify chemical objects (spectra assignments; properties) Programmatic modifications to:

Input: MS Office Open XML – ‘docx’ No information loss from student’s deposited thesis (written with MS software ) Identification of experimental sections no longer a problem - > Chemical Objects Conversion of CO’s into Chemical Markup Language DocX Extract chemical terms OSCAR3 Link together RDF statements Extract chemical objects CML data files Data Repository URI

Sample preparation from synthetic chemistry thesis Sample preparation from chemistry thesis

CML Infra-Red ASSIGNMENTS - film - CML C-13 NMR ASSIGNMENTS

RDF - Resource Description Framework. A component of the Semantic Web, it is based upon the idea of making statements about resources/data in the form of a subject-predicate-object ( or resource-property-value) expression (called a triple) e.g. : My_thesis has_chemical_entity 2,4-dinitrobenzene The value of one property can in turn be used as the resource for another.

RDF TRIPLESTORE ENTRY <rdf:RDF xmlns:dc=" xmlns:dcrdf=" xmlns:rdf=" xmlns:spectra-t=" - CDCl3 ClC([2H])(Cl)Cl InChI=1/CHCl3/c2-1(3)4/h1H/i1D - 1-Benzyloxy-but-3-yne C#CCCOCC1=CC=CC=C1 InChI=1/C11H12O/c /h1,4-8H,3,9-10H (3E,5S,6S)-8-(p-Methoxy-benzyloxy)-5,6-epoxy-6-methyl-oct-3-en-2-one SPARQL QUERY PREFIX st: PREFIX dcrdf: CONSTRUCT { ?thesis st:hasBicycloMoleculeAndHNMR ?chemical. ?thesis dcrdf:author ?author } WHERE { ?thesis dcrdf:creator ?author. ?thesis st:hasChemicalName ?annot. ?annot st:chemicalName ?chemical. ?annot st:hasHNMRSpectrum ?hnmr. FILTER regex(?chemical, ".*bicyclo.*"). } RESULT 5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene N.R.Champness 5-Acetyl-bicyclo[4.2.1]nona-4,7-diene N.R.Champness 5-Phenyl-bicyclo[4.2.1]nona-3,7-diene N.R.Champness 5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene N.R.Champness 5-Acetyl-bicyclo[4.2.1]nona-4,7-diene N.R.Champness 5-Phenyl-bicyclo[4.2.1]nona-3,7-diene N.R.Champness

Caveats (Proof-of-concept): Single subject area (synthetic organic chemistry) Single institution docx (limited variation in document structure) Limited thesis availability Solutions : Domain ontology development Make your e-theses public! Message to repository managers: PDF is a limited format for data extraction from e-theses Docx allows chemical data object extraction (~80% precision / recall)

Acknowledgements Project Director:Peter Morgan UL Cambridge Chemistry leads:Henry Rzepa, Peter Murray-Rust Developers:Jim Downing, Diana Stewart, Joe Townsend, Matt Harvey Project Manager:Alan Tonge /

SPECTRa Tools Workshop Autumn 2008 Unilever Centre, Cambridge, UK Contact: Peter Murray-Rust Peter Morgan