from scientific literature Principal Scientist (Chemoinformatics)

Slides:



Advertisements
Similar presentations
EBooks and Audiobooks. This class will give you an overview of eBooks and electronic Audiobooks available from the Library. We will also explain the basic.
Advertisements

Don’t Type it! OCR it! How to use an online OCR..
© S.J. Coles 2006 Usability WS, NeSC Jan 06 Enabling the reusability of scientific data: Experiences with designing an open access infrastructure for sharing.
S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge b. a School of Chemistry, University of Southampton, UK.; b School of Electronics.
Delivering textual resources. Overview Getting the text ready – decisions & costs Structures for delivery Full text Marked-up Image and text Indexed How.
DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Version 6.1. Old Vs New Input Format - PDF Meta-Scan & Structured Folder Output Features & Benefits Examples – Other Verticals Content.
How the University Library can help you with your term paper
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
CAPTURE SOFTWARE Please take a few moments to review the following slides. Please take a few moments to review the following slides. The filing of documents.
The Library behind the scene How does it work ? The Library behind the scenes 1 JINR / CERN Grid and advanced information systems 2012 Anne Gentil-Beccot.
Extraction of text data and hyperlink structure from scanned images of mathematical journals Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Providing Online Access to the HKUST University Archives: EAD to INNOPAC Sintra Tsang and K.T. Lam The Hong Kong University of Science and Technology 7th.
XP Browser and Basics1. XP Browser and Basics2 Learn about Web browser software and Web pages The Web is a collection of files that reside.
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
Data Input How do I transfer the paper map data and attribute data to a format that is usable by the GIS software? Data input involves both locational.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Browser and Basics Tutorial 1. Learn about Web browser software and Web pages The Web is a collection of files that reside on computers, called.
Reducing Costs and Expanding XML Submissions with PDF to JATS Conversion by Keishi KATOH ( 加藤圭志 ) DIGITAL COMMUNICATIONS Co Ltd.
Developing Health Geographic Information Systems (HGIS) for Khorasan Province in Iran (Technical Report) S.H. Sanaei-Nejad, (MSc, PhD) Ferdowsi University.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Aniko T. Valko, Keymodule Ltd.
Automatic Subject Classification and Topic Specific Search Engines -- Research at KnowLib Anders Ardö and Koraljka Golub DELOS Workshop, Lund, 23 June.
Collaborative Approach to Open Access: Experience from Bioline International Leslie Chan Associate Director Bioline International University of Toronto.
Planning a digital library How to Build a Digital Library Ian H. Witten and David Bainbridge.
Living in a Digital World Discovering Computers Fundamentals, 2010 Edition.
Lakeland Click arrow to advance show. Click on the “A” under “Listed By Name.” (“A” for Academic Search Database)
European Patent Office PCT Minimum Documentation EPO views on a new definition Gérard Giroud, Principal Director PD Tools European Patent Office WIPO,Geneva.
Data input 1: - Online data sources -Map scanning and digitizing GIS 4103 Spring 06 Adina Racoviteanu.
PAN-European Exploitation of the Results of the Libraries Programme - EXPLOIT German Libraries Institute Berlin EXPLOIT 1 Electronic Access, Document Ordering.
Section 8.1 Create a custom theme Design a color scheme Use shared borders Section 8.2 Identify types of graphics Identify and compare graphic formats.
For web 2.0.  Digital media files that is made available for download via web syndication.  It is a way to receive audio/video files over the internet.
Scientific Applications of XML Arvind Hulgeri, Shantanu Godbole
COLLECTING Software. Why use Software with Hardware? Software used for collecting includes the software that interfaces with hardware collection device.
Using EBSCOhost databases Access via MyAthens Click on the EBSCOhost link.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
1 CHBE 594 Lect 16 Scifinder Scholar. 2 Objective Strengths and weaknesses of scifinder scholar How to use it to look up molecules or reactions.
Taming the Big Data in Computational Chemistry #euroCRIS2015 Barcelona 9-11-XI-2015 Carles Bo ICIQ (BIST) -
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
By: Kem Forbs Advanced Google Search. Tips and Tricks Keywords: adding additional terms or keywords can redefine your search and make the most relevant.
Automation Living in a Paper Oriented World and The Steps to Automation.
Text2PTO: Modernizing Patent Application Filing A Proposal for Submitting Text Applications to the USPTO.
A SCRIPT FOR ARCHIVING DIGITAL RESEARCH DATA IMPROVING ACCURACY AND EFFICIENCY IN THE DATAVERSE NETWORK ABSTRACT SUMMARY Rachel Carriere, Thu-Mai Christian,
1 ACCESSING THE PURDUE LIBRARY DATABASES AND ONLINE JOURNALS September 14, 2006.
A Bibliographic Management Software NORSHUHADA SAIDIN REFERENCE & RESEARCH DIVISION PERPUSTAKAAN KEJURUTERAAN UNIVERSITI SAINS MALAYSIA.
Section 8.1 Section 8.2 Create a custom theme Design a color scheme
Input & Output Devices ASHIMA KALRA.
WV DOT Scanning Project
2.01 Understand Digital Raster Graphics
2.01 Understand Digital Raster Graphics
7th Annual Hong Kong Innovative Users Group Meeting
The CompTox Chemistry Dashboard: an informational data hub at the
PLOS Facilitating Text & Data Mining The Role the Publisher Can Play
S.Rajeswari Head , Scientific Information Resource Division
2.01 Understand Digital Raster Graphics
Software for scientific calculations
Getting the Most out of the PDBe
CICC Combines Grid Computing with Chemical Informatics
2.01 Understand Digital Raster Graphics
DIGITAL LIBRARY.
Meet the speakers: Sergey Adonin
Aniko T. Valko, Keymodule Ltd.
Introduction of KNS55 Platform
DATABASES By: Hanna Ben-Or Phone:
Computational Nanotechnology
2.01 Understand Digital Raster Graphics
Presentation By: Amanda Lekovski & Erin Vassarotti
Research Institute for Future Media Computing
Ann Arbor, March 19, 2002 Masakazu Suzuki (Kyushu University)
Ton Spek Utrecht University The Netherlands Vienna –ECM
Presentation transcript:

from scientific literature Principal Scientist (Chemoinformatics) ChemEngine An automated chemical data harvesting tool for molecular inventory and chemical computing from scientific literature M. Karthikeyan Ph.D Principal Scientist (Chemoinformatics) CSIR-National Chemical Laboratory Pune, India

Open Source in Chemoinformatics ChemXtreme (JCIM 2005) GoogleAPI (MSDS) ChemStar (JCIM2008) (Pubchem 18m) ChemRobot (2011 * Denver) (Img2Str) ChemInfoCloud (2014 * San Fransisco) ChemScreener (CCHTS 2015) Virtual Screening ChemEngine (2016 * San Diego) Poster CINF Why Open Source?

ChemEngine Digital access to chemical journals resulted in a vast array of molecular information that is now available in the supplementary material files in PDF format. However, extracting this molecular information, generally from a PDF document format is a daunting task. Here we present an approach to harvest 3D molecular data from the supporting information of scientific research articles that are normally available from publisher’s resources. In order to demonstrate the feasibility of extracting truly computable molecules from PDF file formats in a fast and efficient manner, we have developed an application, ChemEngine. This program recognizes textual patterns from the supplementary data and generates standard molecular structure data (bond matrix, atomic coordinates) that can be subjected to a multitude of computational processes automatically.

ChemEngine The methodology has been demonstrated via three case studies on different formats of coordinates data stored in supplementary information files, wherein ChemEngine selectively harvested the atomic coordinates and interpreted them as molecules with high accuracy. The reusability of extracted molecular coordinate data was demonstrated by computing Single Point Energies (SPEs) that were in close agreement with the original computed data provided with the articles. It is envisaged that the methodology will enable large scale conversion of molecular information from supplementary files available in PDF format into a collection of ready- to- compute molecular data to create an automated workflow for advanced computational processes. Software Source code Project file is available for downloading at the Sourceforge site. FREE !

Re-usable Chemical Structures When chemical structures are stored in truly computable format with atoms and bond matrices (vector format-Cartesian co-ordinates), they can be processed electronically for computational and informatics purposes. However while transforming/storing the files in PDF (Printable/Portable Document/Data Format) that are usually used for the convenience of printing and reading, the valuable and re-usable molecular data is totally lost and buried in scientific literature as documents and seldom used for further computational studies.

Image to Structure Transforming the raster images into vector graphics followed by identification of relevant pixel information associated with atoms and bonds of a molecule is a cumbersome job. Tools have also been developed to harvest molecular data from images using web camera, scanned images wherein the raster graphics data was transformed into vector graphics to eventually retrieve the atoms and bonds information for the generation of truly computable and re-usable chemical structures. Several attempts have been made to convert molecular images into truly computable format such as ChemRobot, OSRA, ChemReader, CLiDE, but only partial success has been achieved.

ChemEngine FREE ! https://sourceforge.net/projects/chemengine/files/?source=navbar

Chem Engine We have developed an application, ChemEngine that reads all the files stored in the pdf format to generate computable molecular structures. To demonstrate the efficiency of the program, supporting material data files of three different formats available freely from the publishers were selected and data was parsed using PDF readers to extract the textual information.

The Challenge Supporting materials of molecular data files also include brief description of molecules, computed data, plots, page numbers, document information, manuscript bibliographic details etc. as a single document in pdf format that makes harvesting the molecular data extremely difficult as these have to be excluded while parsing the PDF file.

General Workflow

General Workflow Input Data Format Pattern Recognition Regular Expression Generate Coordinates

Pattern Recognition Pseudo code: (Co-ordinate Text).matches(“Regular Expression Pattern with Delimiter Definition”); For Example: Delimiter: Comma String_Data.matches("^[A-Za-z0-9]{1,2}\\,[0]{0,1}[\\,]{0,1}-{0,1}.{1,2}[0-9]{1,10}\\,-{0,1}.{1,2}[0-9]{1,10}.{1,}") Delimiter: Space String_Data.matches("^[A-Za-z0-9]{1,2}\\s+[0]{0,1}[\\s+]{0,1}-{0,1}.{1,2}[0-9]{0,10}\\s+-{0,1}.{1,2}[0-9]{0,10}.{1,}")

Harvesting Molecular Data Bond matrix data (Computed)  Coordinates Data (From PDF) (To be Processed) Non- Molecular Data (To be Excluded) 0:---- Mol_0 1 C1 H2 1.09011 1.200 -0.109889 Mol_0 2 C1 H3 1.08923 1.200 -0.110769 Mol_0 3 C1 H4 1.08923 1.200 -0.110769 Mol_0 4 C1 S5 1.831868 1.55 0.281868 Mol_0 5 S5 H6 1.34383 1.410 -0.066162    C -0.04781100 1.16216400 0.00000000 H -1.09556300 1.46309200 0.00000000 H 0.43082600 1.55738100 0.89506100 H 0.43082600 1.55738100 -0.89506100 S -0.04781100 -0.66970400 0.00000000 H 1.28575000 -0.83557700 0.00000000    S1 SUPPORTING INFORMATION Thiol-Ene Click Chemistry: Computational and Kinetic Analysis of the Influence of Alkene Functionality. Brian H. Northrop* and Roderick N. Coffey Department of Chemistry Wesleyan University, Middletown, Connecticut 06459.

AS X Y Z AN X Y Z AS X Y Z AS Delimiter (Space) Delimiter (Comma) AN H -2.1028 0.8352 -0.4693 H -0.7923 1.2464 -1.7030 6 -1.587291 -1.111365 4.067000 6 -0.971341 0.146218 4.067000 6 0.409526 0.261943 4.067000 7 1.580845 0.365064 4.067000 1 -2.671280 -1.185650 4.067000 1 -1.001643 -2.026790 4.067000 C,0,0.2760228632,-1.0699067694,1.2573191608 C,0,0.1782250457,1.2145223184,1.2478315394 C,0,0.178225046,1.2145223184,-1.2478315394 C,0,0.2760228634,-1.0699067694,-1.2573191608 C,0,-1.15402211,-0.680015634,-1.5218932206 C,0,-1.2153634407,0.7116626186,-1.5239653892 AS Delimiter (Space) Delimiter (Comma) AN Delimiter (Space) AS ChemEngine Pattern Recognition AN X Y Z C -1.1744 0.5417 -0.9605 S -0.0501 -0.0786 0.1566 C 1.0931 -1.1415 -0.7978 C 1.1877 1.1781 0.7204 H -2.1028 0.8352 -0.4693 H -0.7923 1.2464 -1.7030 Coordinate matrix C1 S2 1.7019797266712668 1.55 0.1519797266712668 C1 H5 1.0905715244769594 1.2000000000000002 -0.1094284755230408 C1 H6 1.0926613153214495 1.2000000000000002 -0.10733868467855068 Optional CT 3D MOLECULE

Bond Recognition (a) A2 A1 (b) (c) Bond Recognition r1 r2 l1 r1’ r2’ 0.35 Å r'1 r‘2 (a) (b) (c)

Compare the Bond distance ChemEngine Gaussian 09

Computed Conformational Energy

Case Studies Entry Case Study N= molecules Regular Expression pattern Format & Delimit er 1 Epoxide formation from sulfur ylides and aldehydes 29 ^[A-Za-z0-9]{1,2}\\s+- {0,1}.{1,2}[0-9]{1,8}\\s+- {0,1}.{1,2}[0-9]{1,8}.{1,} PDF Space 2 Thiol ene click chemistry 115 Text 3 Design of tetra(arenediyl)bis(allyl) derivatives for cope rearrangement transition states 55 ^[A-Za-z0- 9]{1,2}\\,[0]{0,1}[\\,]{0,1}- {0,1}.{1,2}[0-9]{1,10}\\,- {0,1}.{1,2}[0-9]{1,10}.{1,} Comma

ChemEngine

Harvesting Molecular Data

Graphical User Interface

Acknowledgement The Director, CSIR-National Chemical Laboratory, Pune Department of Science and Technology for International Travel Grant ACS-CINF Division SourceForge

Thank You