Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko.

Slides:



Advertisements
Similar presentations
Implementation of a Validated Statistical Computing Environment Presented by Jeff Schumack, Associate Director – Drug Development Information September.
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
UK National Chemical Database Service: An integration of commercial and public chemistry services to support chemists in the United Kingdom Antony Williams,
Fitting WebWorks Publisher into a Publications Workflow Presentation to SF Bay Area STC Chapters October 2004 Steve Homer Consulting
Royal Society of Chemistry developments to support open drug discovery Antony Williams, Ken Karapetyan, Valery Tkachenko, Colin Batchelor Alexey Pshenichnov.
Information Retrieval in Practice
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
Overview of Search Engines
How community crowdsourcing and social networking is helping to build a quality online resource for chemists.
1 1 Roadmap to an IEPD What do developers need to do?
P.Fiévet December 18, 2007 WIPO IT tools supporting the reformed IPC IPC FORUM Geneva, December 18, 2007 Patrick FIÉVET World Intellectual Property Organization.
Crowdsourcing Chemistry for the Community – 5 Years of Experiences Antony Williams NFAIS, February 28 th 2012.
The Value of a Unique Researcher Identifier to ChemSpider Projects Antony Williams ORCID Meeting, Boston, May 18 th 2011.
Approaches for extraction and “digital chromatography” of chemical data: A perspective from the RSC.
Luc Audrain Hachette Livre Head of digitalization
 By the end of this, you should be able to state the difference between DATE and INFORMAITON.
A Visual Comparison Approach to Automated Regression Testing (PDF to PDF Compare)
Copyright © 2006 Knovel Corporation Streamline Your Science and Engineering Research
ChemSpider – A Crowdsourcing Environment for Hosting and Validating Chemistry Resources (and lessons from President Bush) Antony Williams 5th Meeting on.
Ver1.1. The Analysis Begins Plot your LIBS data, and get a copy of your captured image and download a copy of The Mars Lab Spectral Library. STEP 1.
Title of Articulate Module (must match what’s on the VITALS calendar) Johnny Hippocrates, MD Assistant Professor of Western Medicine
ROYAL SOCIETY OF CHEMISTRY
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
Chemical Database Projects Delivered by RSC eScience at the FDA Meeting “Development of a Freely Distributable Data System for the Registration of Substances”
ChemSpider – A Combination Platform of Free Chemistry Database, Free Prediction Engines and Crowdsourcing Environment Antony Williams University of Oregon,
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune,
Chemical health and safety data online – data consistency Antony Williams iRAMP Meeting, Ithaca, Feb 2014.
Scientific Data and Electronic Publishing Renze Brandsma, Head, Digital Production Centre University of Amsterdam Maarten Hoogerwerf, Project Manager,
Marrying ACD/Labs technologies to eScience Projects at the Royal Society of Chemistry Antony Williams ACD/Labs User Meeting June 2013.
WEB APPLICATION DEVELOPMENT For More visit:
Scientific Applications of XML Arvind Hulgeri, Shantanu Godbole
The Benefits of Participation in the Social Web of Science Antony Williams Research Square October 30 th 2014.
Content Management Systems Linda Fernandezlopez LIS 385T Information Architecture February 6, 2003.
Intro to Spectroscopy Ch 12: Spectral Unknown HDI (Hydrogen Deficiency Index) Lecture Problem 1 Due This week in lab: Ch 4: Recrystallization & Melting.
VIVO and Scholarly Repositories: Synergistic Opportunities.
Vendor Session: ChemSpider, from Royal Society of Chemistry.
Seybold 2001 Mark Stephens (Managing Director). Who are IDRSolutions? Based in United Kingdom. Customers mainly large corporations.
5 th Annual Conference on Technology & Standards April 28 – 30, 2008 Hyatt Regency Washington on Capitol Hill Considerations for Future XML.
Web Technologies for Bioinformatics Ken Baclawski.
Taming the Big Data in Computational Chemistry #euroCRIS2015 Barcelona 9-11-XI-2015 Carles Bo ICIQ (BIST) -
Seybold 2002 Mark Stephens (Managing Director) Ready made solutions. Bespoke development, configuration and consultancy.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Supporting Information Review & Data Analysis at Organic Letters Angie Hunter Data Analyst, Organic Letters MPS Open Data Workshop – November 2015 American.
AMERICAN INSTITUTE OF PHYSICS URL:
Bell Work  Tell me what you know or remember about the following words: 1. Matter 2. Atom 3. Molecule 4. Compound 5. Energy 6. Types of chemical reactions.
2.4. Choose and configure HTML5 tags to organize content and forms Choose and configure HTML5 tags for input and validation. Building the User Interface.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
Reaxys – The Highlights. Slide 2 What is Reaxys? A brand new workflow solution for research chemists and scientists from related disciplines An extensive.
Clustering the Royal Society of Chemistry chemical repository to enable enhanced navigation across millions of chemicals Valery Tkachenko, Ken Karapetyan,
A Chemistry Data Repository to Serve Them All Antony Williams.
Automated extraction of reaction data from text Daniel Lowe, Lezan Hawizy, David Jessop, Peter Murray-Rust.
Structure verification and elucidation using the ChemSpider database Antony J Williams, Valery Tkachenko and Alexey Pshenichnov SERMACS, November 16 th.
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
Information Retrieval in Practice
James Weeks, ChemWeb Inc. 84 Theobalds Road, Holborn, London WC1X 8RR
Applying Royal Society of Chemistry Cheminformatics Skills to Support the PharmaSea Project Antony Williams, Alexey Pshenichnov, Valery Tkachenko, Ken.
Search Engine Architecture
Experiences in Hosting Big Chemistry Data Collections for the Community Antony Williams July 30th 2014, NIST.
Dealing with the complex challenge of managing diverse chemistry data online Antony Williams, Valery Tkachenko, Alexey Pshenichnov and Ken Karapetyan.
ORCID ID: Chemical Information in the Big Data Era: Data Quality, Data Integration and Building a Profile for Yourself as an Online.
ORCID ID: Driving needs for analytical data exchange standards and the potential impacts on the chemical sciences Antony Williams.
Data Exchange & Public Reference Data
RSC电子平台使用介绍 联系人:孙燕 Tel:
Accelerate define.xml using defineReady - Saravanan June 17, 2015.
Who knew I would get here from there: How I became the ChemConnector
Journey of Quality Analysts towards Data Analytics
Overview of open resources to support automated structure verification
OMPOL – Visualisation of large chemical spaces
PPT and video are due no later than March 1, 2019
Palestinian Central Bureau of Statistics
Presentation transcript:

Data enhancing the Royal Society of Chemistry publication archive Antony Williams, Colin Batchelor, Peter Corbett, Ken Karapetyan and Valery Tkachenko ACS Dallas March 2014

Data Enhancing the RSC Archive Publications summarise data acquisition, analysis and conclusions. Much detail in the data Improved navigation includes data access Reanalysis of data is limited in PDFs

Text Mining The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4- thiadiazol-5-yl)urea prepared in Example 6, thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer, thermometer and reflux condenser. The reaction mixture was heated at reflux with stirring, for a period of about one-half hour. After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2- trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

How is DERA going? TEXT We have text-mined all 21 st century articles… >100k articles from Mostly marked up with XML, more structured, easier to handle. Markup mostly published onto the HTML forms of the articles Required multiple iterations based on dictionaries, markup, OSCAR extraction New visualization approaches in development

Chemical Validation and Standardization

The RSC Data Repository

Text-Mining

ChemSpider Reactions

Reactions We will put reactions from our databases into the Reactions Repository We will use “Reaction Validation” procedures to clean up Daniel Lowe’s USPTO patent set of over a million extracted reactions We will move ChemSpider SyntheticPages content to the Reactions Repository We will use the RXNO Ontology to classify the reactions

Reaction Deposition/Validation

ESI – Text Spectra

Lots of “Textual Spectra”

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

13C NMR (CDCl3, 100 MHz): δ = (CH3), (CH, benzylic methane), (CH, benzylic methane), (CH2), (CH2), , , , , , , , , , (ArCH), 99.42, , , , , , , , (ArC)

How is DERA going? Text Spectra Overall progress is good Improved algorithms for extraction of spectra Extraction of associated compound name with spectrum – name to structure conversion now MestreLabs have provided us with batch conversion tool Work in progress – manual and automated validation. In theory auto-assignment also

Visualization of Spectra For spectra associated with compounds we would like to view “interactive spectra”

Javascript viewer with JMol

Figure Spectra into “Real Spectra”? We are turning text into structures We are turning text into spectra And we are turning figures into spectra

Turn “Figures” Into Data EXTRACTED DATA FIGURE

EXTRACTED DATA FIGURE

How is DERA going? Figures Validation tests performed with William Brouwer. Good enough to proceed with larger test set Ready to run process across larger collection Focus on 21 st century articles only for now

Early Test Experiments Input : 74 supplementary data documents/ 3444 pages Output : p2t extracted content in 1069 page instances  578 molecules ~ 10% false positives eg., classifies Bruker logo as chemical object ~ 20% false negatives eg., missing some symbols from structure  1151 spectra > 80% of peaks extracted to within 1-2 decimal places (ppm)

Validating Spectra How will we check data consistency? How do we know the structure and the spectra match? Comparing image to spectrum is NOT enough!!! Predict spectra, use spectral verification, use algorithmic checking. Flag “dodgy data” and use crowdsourcing for data checking MULTIPLE prediction technologies now available – VERIFICATION is tougher

What are we extracting? Compounds from compound names Reactions from the text Spectral extraction – from figures and text Extraction of data from “tables” – not only CSV files but literal tables in the publication – specifically data from MedChemComm as proof of concept

Building out the technology We are presently Open-Sourcing a chemical registration system developed for OpenPHACTS We will then Open Source the Chemical Validation and Standardization Platform We are working with Bob Hanson and Bob Lancashire on Jmol/JSpecView Open Source We will deliver a set of Open Source widgets for structure handling/visualization

Javascript viewer NMR, MS, IR

Grand Target Fingers crossed to get 21 st century spectra converted Spectra associated with compounds will go into ChemSpider Spectra converted from Figures but without compound association will be captured with Figures into the Data Repository Focus on IR, Raman, UV-Vis & 1D NMR

DERA is FINE for an archive The WRONG WAY otherwise! We should NOT be mining data out of future publications Structures should be submitted “correctly” Spectra should be digital spectral formats, not images ESI should be RICH and interactive Data should be open, available, with meta data and provenance

We can solve for Authors here Will it be used though???

Advanced ESI

Conclusions Great progress in mining the archive and 21 st century articles are being enhanced on the publishing platform iteratively Spectral Data is the next focus – directly connected to our work on the data repository Reaction extraction, processing and validation from articles is progressing more slowly Results are content, software components and and Open Source Contributions

Acknowledgments Bill Brouwer – Plot2Txt Development Carlos Cobas and Santi Dominguez Bob Hanson and Bob Lancashire for Jmol/JSpecView Javascript version Leah McEwan and Will Dichtel ACD/Labs – Provider of spectroscopy tools

Thank you ORCID: Twitter: Personal Blog: SLIDES: