Semantic Chemical Publishing Nick Day*, Peter Corbett, Peter Murray-Rust Unilever Centre for Molecular Informatics, University of Cambridge, UK. March.

Slides:



Advertisements
Similar presentations
Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry
Advertisements

Visualisation of chemical data Brian McMahon Research & Development Officer International Union of Crystallography 5 Abbey Square Chester CH1 2HU
1ETD 2008_Morgan_The SPECTRa-T Project Extracting and re-using research data from chemistry e-theses: the SPECTRa-T project.
S.J. Coles a*, M.B. Hursthouse a, R.A. Stephenson a, P. Cliff b, E. Lyon b, M. Patel b J. Downing c & P. Murray-Rust.
© S.J. Coles 2006 Usability WS, NeSC Jan 06 Enabling the reusability of scientific data: Experiences with designing an open access infrastructure for sharing.
S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge b. a School of Chemistry, University of Southampton, UK.; b School of Electronics.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
RCUK, Octiber Archiving research data and research publications. Dr Leslie Carr, Intelligence, Agents Multimedia, University of Southampton Dr Simon.
Data Curation in Crystallography: Publisher Perspectives JISC Data Cluster Consultation Workshop CCLRC, Didcot, Oxon 10 October 2006.
Federation eCrystals Federation: Open Repositories for Data-driven Science Dr Liz Lyon, UKOLN, University of Bath, UK Dr Simon Coles, University of Southampton,
CCPN project modeling framework University of Cambridge European Bioinformatics Institute MSD group.
Publisher perspective eBank/R4L/SPECTRa Joint Consultation Workshop London Metropole Hotel 20 October 2006.
The SPECTRa Project : A wider chemistry picture Alan Tonge & Jim Downing A Digital Repository for the Chemical Community.
© S.J. Coles 2006 Institutional Data Repositories for Chemistry Simon Coles School of Chemistry, University of Southampton, U.K.
EBankII Workshop 1 Making Scientific Data Openly Available Simon Coles School of Chemistry, University of Southampton.
The CLARION Project for the Infrastructure for Integration in Structural Sciences (I2S2) mtg, Rutherford Labs, 11 th February 2010 CLARION – Chemical Laboratory.
Accessing the data: going beyond what the author wanted to tell you Brian McMahon International Union of Crystallography 5 Abbey Square, Chester CH1 2HU,
Open 24 hours a day 7 days a week The Library. Your subject librarian…. Linda Humphreys x5248 Library 4.04.
Open Access; Open Data I590 Spring Budapest Open Access Initiative Based on: –Self archiving by authors –Open Access journals, e.g., BioMed Central.
Changing methods of data sharing in crystallography Professor John R Helliwell Imperial College, June 28th, 2006 The University of Manchester
Data activities of the International Union of Crystallography Brian McMahon IUCr 5 Abbey Square Chester CH1 2HU
28 October 2005Jeremy Frey, University of Southampton1 “The CombeChem Experience” CICC Workshop 28 October 2005 Bloomington Indiana.
Community Grids Lab CICC Activities Geoffrey Fox, Marlon Pierce Indiana University.
Recent developments 1) Tests (outlier analysis) and Bug fixing ( with Paul) 2) Regeneration of Values of Bonds and Bond-angles existing all structures.
Christoph Steinbeck Cologne University Bioinformatics Center (CUBIC) Folie 1 16:39:56 Reviving Analytical Data of the Past with Open Submission Databases.
Open Notebook Science using Blogs and Wikis Jean-Claude Bradley E-Learning Coordinator College of Arts and Sciences Drexel University March 27, 2007 American.
Open Source/Open Notebook Science: Doing science with blogs and wikis Jan 20, 2007 Drexel University Chemistry Department.
University of Southampton, U.K.
Click to edit Master subtitle style JISC XYZ Project Principal Investigator: Peter Murray-Rust Project Team: Nick England, Brian Brooks Unilever Centre,
EPrints Workshop, January eBank UK: Dissemination of research data using EPrints Simon Coles, School of Chemistry, University of Southampton.
Capturing Chemistry in XML/CML J. A. Townsend *, S. E. Adams *, J. M. Goodman *, P. Murray-Rust *, C. A. Waudby * Capturing Chemistry in XML/CML ACS March.
Crystallographic Data Publication at Source International Union of Crystallography Peter R. Strickland and Brian McMahon IUCr 5 Abbey Square Chester CH1.
Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008 SPECTRa-T Project.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
Service update Elin Stangeland Repository Manager.
Computational Chemistry Robots ACS Sep 2005 Computational Chemistry Robots ACS Sep 2005 Computational Chemistry Robots J. A. Townsend, P. Murray-Rust,
Ontologies in Physical Science Onto Workshop, ed.ac.uk An #animalgarden production Peter Murray-Rust, University of Cambridge & Open Knowledge.
Carl Lagoze, Cornell University Prasenjit Mitra, William Brouwer (Penn State University) Mark Borkum (University of Southampton)
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Approaches for extraction and “digital chromatography” of chemical data: A perspective from the RSC.
Information Integration Intelligence with TopBraid Suite SemTech, San Jose, Holger Knublauch
1 The BT Digital Library A case study in intelligent content management Paul Warren
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Programming the Web Web = Computer Network + Hypertext.
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
EBank UK: linking scientific data, scholarly communication and learning Michael Day and Rachel Heery UKOLN, University of Bath
Flexible Text Mining using Interactive Information Extraction David Milward
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Chemistry Add-in for Word OR 10 Joe Townsend University of Cambridge
RSC eBook Collection April 2007 RSC eBook Collection Over 700 Books c. 8,000 chapters c. 250,000 pages 10,000 items - tables.
Markup in Atomic and Molecular Simulations: Implementation & Issues Jon Wakelin Dept. Earth Sciences University of Cambridge.
1 Bridging the gap between the paper past and digital future.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
Scientific Applications of XML Arvind Hulgeri, Shantanu Godbole
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
Introduction to the Semantic Web and Linked Data Module 1 - Unit 2 The Semantic Web and Linked Data Concepts 1-1 Library of Congress BIBFRAME Pilot Training.
Introduction to the Semantic Web and Linked Data
Vendor Session: ChemSpider, from Royal Society of Chemistry.
ChEBI, text mining and ontological best practice Colin Batchelor Royal Society of Chemistry
1 Web 2.0 and Grids for Scholarly Research Peking University July Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories.
Taming the Big Data in Computational Chemistry #euroCRIS2015 Barcelona 9-11-XI-2015 Carles Bo ICIQ (BIST) -
CombeDay Making Data Openly Available Simon Coles.
CS276B Text Information Retrieval, Mining, and Exploitation Practical 1 Jan 14, 2003.
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
Chemoinformatics and eScience
Introduction an Open Source, Open Data international collaboration, based entirely in the internet started following a CECAM meeting in Zaragoza:
Open Notebook Science And the Library
‘The eCrystals Federation’ Management and Publication of Small Molecule Structure Data for the Whole Crystallographic Community S.J. Colesa*, J.G. Freya,
Extracting Recipes from Chemical Academic Papers
Developing Institutional Data Repositories
Presentation transcript:

Semantic Chemical Publishing Nick Day*, Peter Corbett, Peter Murray-Rust Unilever Centre for Molecular Informatics, University of Cambridge, UK. March 27 th, 2007 All software Open Source *

Overview What is ‘semantic chemistry’ and markup? OSCAR3 – robotic analysis of chemistry in free text recognition of chemical names name-2-structure chemical verbs, adjectives and reaction names terminologies (e.g. techniques) RSC Project Prospect … CrystalEye – creating semantic chemistry from crystallography: High-throughput robotic harvesting Re-use using CIF2CML Dissemination through CMLRSS

The Semantic Web “People keep asking what Web 3.0 is. I think maybe when you've got an overlay of scalable vector graphics […] on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource.” - Tim Berners-Lee, A 'more revolutionary' Web (2006) Tim Berners-LeeA 'more revolutionary' Web

…Let’s change the vision to chemistry…

The Chemical Semantic Web “…when you've got an overlay of [CML, InChI and chemical ontologies - everything well-defined and marked up] - on Web 2.0 and access to a semantic Web integrated across a huge space of data, you'll have access to an unbelievable data resource.” [our adaptations] … what are chemical semantics and CML?… …..

Implicit and explicit semantics Implicit semantics “Compound 2a melted at 119 o C” humans are good at interpreting this; machines see just a string. Explicit semantics <cml:scalar dictRef=“prop:mpt” units=“units:celsius” dataType=“xsd:float” >119 4 namespaces, 3 dictionaries propertyDictionary unitsDictionary W3CSchema CML Schema Molecules in CML/InChI

UCC’s approach to creating Semantic Chemistry Authoring tools for theses and collaboration with publishers XML-ization (through FoX) of Comp. Chem. codes (MOPAC, CASTEP, SIESTA, GULP, ABINIT, DL_POLY GAMESS…) Capturing/conversion of CML data at source (SPECTRa) Rich clients (Bioclipse) Legacy Conversion (OpenBabel, CDK, JUMBO…) Intelligent Ontologies (Golem) (today we will cover the following…) Chemical Linguistics and text-mining (OSCAR3) Legacy Conversion – CIF Many of these semantic chemical components are now deployed or prototyped…

Chemical semantic framework at UCC prototyped Under development legacy CIF2CML OSCAR3 CrystalEye Golem SPECTRa JUMBO FoX OSCAR1 Units

“OSCAR” OSCAR1 + CheckCML (2003, 2004, 2005, 2006) Student projects supported by RSC SciBorg ( )EPSRC project (Computer Lab, Chemistry, Cambridge) OSCAR3 (Peter Corbett) Recognition of chemical entities. Name2structure, chemical diagrams, canonical identifiers Chemical heuristics to parse article full-text Links to ontologies and molecular databases. Open source High-throughput – 500, 000 PubMed abstracts parsed Substructure and similarity search on corpora

OSCAR3 Concepts - Example All markup is automatic

…how can this be used for publishing?... UCC and RSC have been collaborating on transferring this technology to journal articles… Project Prospect (2007) adds semantics…

Project Prospect (RSC) Typical HTML paperSemantics confined to hyperlinks Prospect adds more semantics…

… Prospect markup includes: CML (Chemical Markup Language) InChI IUPAC Gold Book Gene Ontology …and more coming … …the marked-up semantic paper…

Project Prospect RSC … but not all chemistry is in free text …

GaussianCIF Chemistry is also “Data” … some examples of data taken from theses…

Semantic Chemistry and the Datument The document is only part of the scientific record We can transform the experimental data to CML. The integrated result is a datument… Experimental Text / HTML Compounds and Analytical data Crystallography Reactions Comp chem output Measurements and properties Supplemental and Other data + P. Murray-Rust, The complete chemical E-publication, 216th ACS National Meeting, Boston, August (1998), CINF-033. Peter Murray-Rust, Henry S. Rzepa and Michael Wright, Development of Chemical Markup Language (CML) as a System for Handling Complex Chemical Content, New J. Chem., 2001, P. Murray-Rust and H. S. Rzepa, "The Next Big Thing: From Hypermedia to Datuments", J. Digital Inf., 2004, 5, article 248, article 248, XML Datument

The Datument Experimental CMLCore, InChI CMLReact CMLSpect AnIML CMLCryst CMLComp CMLTable graphs O O O O OSCAR3 CrystalEye

Chemical Crystallography Universally published as CIF. Complete output of structure experiment, standard supplementary data for article full-text. > 10, 000 CIFs published online per year > 30, 000 unpublished per year (e.g. theses)

CrystalEye The aim: To automatically create semantic chemistry from crystallography (CIFs) published on the Web.

CIF CrystalEye CIF CML Aggregation Web spider checks publishers and repositories every day. Currently over 60,000 validated CIF files. TOC Article CIF IUCr, RSC publishers OAI-PMH Thesis Record CIF Cambridge, Imperial repositories CML Dept. Crystallographic Service CIF Browse/Search Interface SPECTRa SPECTRa-T

Marking up and Validation CIFDOI InChI SMILES CDKjni-InChI CML * CheckCIF CheckCIF HTML CheckCIF XML Raw crystal structure CML CIFDOM Legacy2CML JUMBO Validated, Disorder resolved, Unique molecules, Bond orders and charges. 2D image Stereochemistry added

CML * JUMBO Ring nucleiMetal centres CrystalEye: Re-Use through XML/CML Automatic generation of fragments ca. 1 million fragments with 50,000 different chemical types Open Access via automatically generated HTML Here are examples of a ring-nucleus and a metal centre… Ligands Metal Clusters Linkers Cu centre Cu(OH 2 )(C 2 HO 2 Cl 2 ) 2 (C 6 H 6 N 2 O) 2

CrystalEye for humans Generated content FragmentsEntriesAnnotations CrystalEye Data sourcesRepository Publisher CIF Services Substruct. searchData searchBrowse Browsing links back to publications HTML

CrystalEye Webpage Demo Let’s assume we’re interested in Cu-N bonds: Browse to title page View structure data Explore fragments Inspect bond lengths

CMLRSS newsfeeds How can the chemist find every Cu-N bond immediately? CrystalEye uses RSS 1, RSS 2 and Atom 1.0 to create both RSS and CMLRSS feeds. Web browsingUsing RSS = contains Cu-N = no Cu-N What we’ve just been doing CrystalEye Structure: TOCCu-N feed *read* RSS Reader …

CML-RSS feeds Feeds for: Journal …Acta Cryst. E, Dalton compound class …organometallic atoms in structures …Os, Ir, Pt bonds in structures …Zn-N, Cu-N Thousands of feeds, but the robot does the work. Client-side software can filter and reorganize the feeds For a chemist who specialises in Cu-N chemistry, CrystalEye can alert them EVERY DAY to new examples. RSS CMLRSS

CrystalEye: KnowledgeBase not DataBase Aggregation by robots, not humans All types of chemistry (organic, inorganic, etc…) Social computing - aggregates COD*, theses Software validation by robots, not humans, Open and free; Goes live in April… * Crystallographic Open Database

Chemical semantic framework at UCC prototyped Under development CIF2CML FoX GOLEM SPECTRa CrystalEye JUMBO OSCAR3 legacy

Cambridge Semantic Chemistry Released: CML – XML for chemistry JUMBO – library for CML OSCAR1/CheckCML – data validation OSCAR3 – text mining FoX – Fortran XML Soon: CrystalEye – crystallographic knowledgebase SPECTRa – chemical repositories Later… Golem - ontologies CMLUnits

Acknowledgements CML - Henry Rzepa OSCAR1 – Sam Adams, Joe Townsend, Fraser Norton, Justin Davies, Richard Marsh, Jonathan Goodman OSCAR 3 – Ann Copestake, SimoneTeufel (Computer Lab) SPECTRa – Jim Downing, Alan Tonge, Peter Morgan Software - CDK, Jmol, jni-InChI and many Blue Obelisk contributions Timo Hannay (NPG), Richard Kidd (RSC), Colin Batchelor (RSC), Brian McMahon (IUCr)

Thankyou.