Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS.

Slides:



Advertisements
Similar presentations
Chemical named entity recognition and literature mark-up Colin Batchelor Informatics Department Royal Society of Chemistry
Advertisements

Version 5.3, April 2010 The ChemAxon Markush project overview and development discussion.
Chemical Naming Daniel Bonniot, PhD October 2008.
Reaxys – Managing Complexity
SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
Partnering ChemAxon Nóra Lapusnyik, Alexander Drijver Solutions for Cheminformatics.
Name to structure, Structure to name, chemicalize.org Daniel Bonniot de Ruisselet Solutions for Cheminformatics.
2008 Accelrys EUGM Pipelining ChemAxon Szilard Dorant Solutions for Cheminformatics.
Solutions for Cheminformatics
UKPMC and Dryad Dryad-UK meeting: 28 th April 2010 Robert Kiley, Head Digital Services, Wellcome Library
CICC June meeting IUPUI team: Kelsey Forsythe Malika Mahoui Deepthi Jonnala Usha Cheemakurthi.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Managing References : Mendeley
Crystal Structure EPrints: Source Through the Open Archive Initiative S.J. Coles a*, J.G. Frey a, M.B. Hursthouse a, L. Carr b & C.J. Gutteridge.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
© S.J. Coles 2006 Digital Repositories as a Mechanism for the Capture, Management and Dissemination of Chemical Data Simon Coles School of Chemistry, University.
EBankII Workshop 1 Making Scientific Data Openly Available Simon Coles School of Chemistry, University of Southampton.
Software change management
Discovery Studio AtlasStore: Protein/Ligand Database Steve Potts, Ph.D., MBA Product Manager Biological Informatics
Supporting Engagement in Open Access: a Publishers Perspective
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
AGILE BI Company profile Today’s Format ● Registration ● Presentation 1 ● Demonstration 1 ● Break ● Demonstration 2 ● Q & A.
ACS PUBLICATIONS An Overview of Products & Services A C S P U B L I C A T I O N S H I G H Q U A L I T Y. H I G H I M P A C T.
Royal Society of Chemistry developments to support open drug discovery Antony Williams, Ken Karapetyan, Valery Tkachenko, Colin Batchelor Alexey Pshenichnov.
Christoph Steinbeck Cologne University Bioinformatics Center (CUBIC) Folie 1 16:39:56 Reviving Analytical Data of the Past with Open Submission Databases.
Why you need this App Sean Ekins 1, Alex M. Clark 2 1 Collaborations in Chemistry, 5616 Hilltop Needmore Road, Fuquay Varina, NC 27526, U.S.A. 2 Molecular.
1,000,000,000 7,000, US $ investment hours of work experiments researchers years drug Pharma is experience challenges.
1 Enriching UK PubMed Central SPIDER launch meeting, Wolfson College, Oxford Paul Davey, UK PubMed Central Engagement Manager.
1 CBioC: Collaborative Bio- Curation Chitta Baral Department of Computer Science and Engineering Arizona State University.
1 User-Centered Design at the USPTO: Application to Patent IT Modernization Marti Hearst Chief IT Strategist, USPTO May 23, 2011.
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
CHM338 Organic Chemistry Synthesis Paper Linda Shackle Noble Science & Engineering Library Room 130E
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
TEXT MINING IN BIOMEDICAL RESEARCH QI LI 03/28/14.
Searching the Scientific Literature Douglas A. Loy.
Slide 1 of 21 Nontraditional Careers in Chemistry at the ACS Nontraditional Careers in Chemistry at the American Chemical Society Lorrin R. Garson Robert.
Biological Sequences and Patents Chemical compounds and Patents Agenda Acknowledgements: FELICS is funded by the European.
Biological Science Database Proquest WEDAD AL-HUSAINAN ISD/NSTIC Kuwait Institute for Scientific Research November/2012.
Aniko T. Valko, Keymodule Ltd.
Finding the right stuff Mining chemical structural information from the drug literature… finding the right stuff Debra L. Banville, Ph.D.
Driving the Terminology Hub RDF Triplets as a means to express lexical and referential data. Therese Vachon, NIBR, Unit Head UltraLink Technologies W3C.
RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.
The Ultralink – an expert system for contextual hyperlinking in knowledge management Manuel C. Peitsch Head of Systems Biology Novartis Institutes for.
Meta-Knowledge Computer-age study skill or What kids need to know to be effective students Graham Seibert Copyright 2006.
Thomson Scientific October 2006 ISI Web of Knowledge Autumn updates.
ROYAL SOCIETY OF CHEMISTRY
Royal Society of Chemistry activities to develop a data repository for chemistry-specific data Aileen Day, Alexey Pshenichnov, Ken Karapetyan, Colin Batchelor,
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Big Data Supporting Drug Discovery Cautionary Tales from the World of Chemistry for Translational Informatics Valery Tkachenko RSC-CSIR/OSDD meeting Pune,
Recent and Current Developments in Handling Markush Structures from Chemical Patents Dr John M. Barnard Scientific Director.
Assignee Name Harmonization Efforts at the U.S. Patent and Trademark Office US Patent and Trademark Office Office of Electronic Information Products Patent.
Text Mining Special Interest Group Stuart Murray, Wyeth Research Novartis Institute for Biomedical Research, Cambridge, MA 6-8 th October 2004.
12-1 Links Gateway Vision Jeff Clovis ISI 4 Oct
EBI is an Outstation of the European Molecular Biology Laboratory. Literature Resources at the EBI Information Workshop on European Bioinformatics Resources.
1 CHBE 594 Lect 16 Scifinder Scholar. 2 Objective Strengths and weaknesses of scifinder scholar How to use it to look up molecules or reactions.
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
MDL Information Systems, Inc. Powering the Process of Invention Donna del Rey Director, Business Planning
Business intelligence systems. Data warehousing. An orderly and accessible repositery of known facts and related data used as a basis for making better.
Reaxys – The Highlights. Slide 2 What is Reaxys? A brand new workflow solution for research chemists and scientists from related disciplines An extensive.
A Bibliographic Management Software NORSHUHADA SAIDIN REFERENCE & RESEARCH DIVISION PERPUSTAKAAN KEJURUTERAAN UNIVERSITI SAINS MALAYSIA.
Ingenuity Pathway Analysis Alex Pico. Description "IPA is a software application that enables researchers to analyze and understand the complex biological.
General & Background InformationPractical & Useful DataDetailed, Original Research Encyclopedias Dictionaries Reference Texts Books Safety Information.
Ontology, RDF, SW for Chemical Structures
Data Mining Generally, (Sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it.
from scientific literature Principal Scientist (Chemoinformatics)
TDM=Text Mining “automated processing of large amounts of structured digital textual content for purposes of information retrieval, extraction, interpretation.
Aniko T. Valko, Keymodule Ltd.
Extracting Recipes from Chemical Academic Papers
Instant jchem and plexus SUITE
Presentation transcript:

Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Where the story of this project started... Dreirosenbrücke Novartis Campus A day in October 2008 Some time around 7:45 in the morning...

Vision for textmining Integration chemical, biological knowledge

Mining for Chemical Knowledge - Rationale - Make text corpora searchable for chemistry - Generate chemistry databases for use in research based on Scientific Papers or Patents - Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications - Patent analyis for MedChem projects Connection table

Mining for chemical Knowledge - Rationale Information on compounds targeting GPCRs HELP Information explosion Source: Banville, Debra L. Mining chemical structural information from the drug literature. Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42

Example: Project Prospect – Royal Society of Chemistry Enhancing Journal Articles with Chemical Features This helps you identifying other articles talking about the same molecule

Mining for Chemical Knowledge – Focus for today - Make text corpora searchable for chemistry - Generate chemistry databases for use in research based on Scientific Papers or Patents - Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications - Patent analyis for MedChem projects Connection table

A use case for successful patent mining (molecules you sometimes find in your inbox ;-) ) Vardenafil (2003, Bayer) – 1.24 billion (USD 1.6 billion) Sildenafil (1998, Pfizer) – 11.7 billion (USD 15.1 billion) Slide inspired by an example from Steve Boyer/IBM; Sales data from Prous Integrity datase

Conventional Database Building

Facts – current standard... (ACS) owes most of its wealth to its two 'information services' divisions the publications arm and the Chemical Abstracts Service (CAS), a rich database of chemical information and literature. Together, in 2004, these divisions made about $340 million 82% of the society's revenue and accounted for $300 million (74%) of its expenditure. Over the past five years, the society has seen its revenue and expenditure grow steadily... Source: ACS homepage

Facts Established application Straighforward use De-facto Gold standard Unique data source Very costly No structure export for reasonable price Very limited in large-scale follow-up analysis Most recent patents not available

Not data (search), but integration, analysis and insight, leading to decisions and discovery

Now – What would be the perfect solution? All patent offices require to provide all claimed structures as machine-readable version available for one-click- download

Text extraction Definition: Extract all molecules that are mentioned in a patent text of interest, convert them to structures and make them available in machine-readable format

Mining for Chemical Knowledge Technologies from providers Text entity recognitionImage recognition (a)Extractors (IUPAC names) - TEMIS Chemical Entity Relationships Skill Cartridge - Accelrys Pipeline Pilot extractor (Notiora) - Fraunhofer (ProMiner Chemistry) - Chemaxon (chemicalize.org) - Oscar (Corbett, Murray-Rust et al.) - SureChem - IBM ChemFrag Annotator (b)Converter (Names connection table) - CambridgeSoft name=struct - Openeye Lexichem - Chemaxon - OSRA (NIH) - Clide Pro (Keymodule Ltd.) - Fraunhofer chemoCR - ChemReader

The objective To provide a tool that provides sophisticated text analysis methods for NIBR scientists and thereby leverages the methods of TMS

Mining for Chemical Knowledge – Novartis Tools – the chemicalize-technology is working under the hood! Clipboard Analysis Patent text Identified structures View structure onMouseOver Export to other applications

Mining for Knowledge – Novartis Tools Input example: J Med Chem Paper

Mining for Chemical Knowledge – Use Case Medicinal Chemist wants to synthesize competitor compound as tool compound for own project Identification of core scaffold Analysis of substitution patterns This enables the identification of compounds most representative for a competitor patent

Example – A text-based patent Automated Text extraction 452 compounds Reference 636 compounds 71% A patent example

Example – An image-base patent Text extraction not suitable for this case, it does find only a meager 40 molecules, 1129 in reference – Why? An entirely image-based patent example

Language issues – e.g. Japanese patents

Encountered problems OCR (Optical Character Recognition)!! USPTO and WIPO are now available full text in most cases Typos! Name2Struct problems (less an issue here)

IBM initiative Patent Mining / ChemVerse database (Steve Boyer) The objective is to automatically extract all molecules from all patents available and make them searchable in a database They leverage cloud computing and have access to all full- text patents This is going absolutely the right direction They annotate the molecules with information from freely available databases

Future ideas: Patent Analysis Markush translation, Image+Target Ranking capabilities of outcome for User blurred dicos for translating stuff like aryl, cycloalkyl etc. Select annotate as entity on the fly error-correction Result goes in a database Crowdsourcing efforts to improve and store results Suggest functionality

To enable true Patinformatics analyses... Definition by Tony Trippe:

Acknowledgements Alex Fromm Katia Vella Olivier Kreim Therese Vachon Daniel Cronenberger Pierre Parisot Martin Romacker Nicolas Grandjean NITAS/TMS Clayton Springer Naeem Yusuff Bharat Lagu And many other people in different divisions of NIBR for their support