Download presentation
Presentation is loading. Please wait.
Published byForrest Mather Modified over 10 years ago
1
Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS
2
Where the story of this project started... Dreirosenbrücke Novartis Campus A day in October 2008 Some time around 7:45 in the morning...
3
Vision for textmining Integration chemical, biological knowledge
4
Mining for Chemical Knowledge - Rationale - Make text corpora searchable for chemistry - Generate chemistry databases for use in research based on Scientific Papers or Patents - Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications - Patent analyis for MedChem projects Connection table
5
Mining for chemical Knowledge - Rationale Information on compounds targeting GPCRs HELP Information explosion Source: Banville, Debra L. Mining chemical structural information from the drug literature. Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42
6
Example: Project Prospect – Royal Society of Chemistry Enhancing Journal Articles with Chemical Features This helps you identifying other articles talking about the same molecule
7
Mining for Chemical Knowledge – Focus for today - Make text corpora searchable for chemistry - Generate chemistry databases for use in research based on Scientific Papers or Patents - Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications - Patent analyis for MedChem projects Connection table
8
A use case for successful patent mining (molecules you sometimes find in your inbox ;-) ) Vardenafil (2003, Bayer) – 1.24 billion (USD 1.6 billion) Sildenafil (1998, Pfizer) – 11.7 billion (USD 15.1 billion) Slide inspired by an example from Steve Boyer/IBM; Sales data from Prous Integrity datase
9
Conventional Database Building
10
Facts – current standard... (ACS) owes most of its wealth to its two 'information services' divisions the publications arm and the Chemical Abstracts Service (CAS), a rich database of chemical information and literature. Together, in 2004, these divisions made about $340 million 82% of the society's revenue and accounted for $300 million (74%) of its expenditure. Over the past five years, the society has seen its revenue and expenditure grow steadily... Source: ACS homepage
11
Facts Established application Straighforward use De-facto Gold standard Unique data source Very costly No structure export for reasonable price Very limited in large-scale follow-up analysis Most recent patents not available
12
Not data (search), but integration, analysis and insight, leading to decisions and discovery
13
Now – What would be the perfect solution? All patent offices require to provide all claimed structures as machine-readable version available for one-click- download
14
Text extraction Definition: Extract all molecules that are mentioned in a patent text of interest, convert them to structures and make them available in machine-readable format
15
Mining for Chemical Knowledge Technologies from providers Text entity recognitionImage recognition (a)Extractors (IUPAC names) - TEMIS Chemical Entity Relationships Skill Cartridge - Accelrys Pipeline Pilot extractor (Notiora) - Fraunhofer (ProMiner Chemistry) - Chemaxon (chemicalize.org) - Oscar (Corbett, Murray-Rust et al.) - SureChem - IBM ChemFrag Annotator (b)Converter (Names connection table) - CambridgeSoft name=struct - Openeye Lexichem - Chemaxon - OSRA (NIH) - Clide Pro (Keymodule Ltd.) - Fraunhofer chemoCR - ChemReader
16
The objective To provide a tool that provides sophisticated text analysis methods for NIBR scientists and thereby leverages the methods of TMS
17
Mining for Chemical Knowledge – Novartis Tools – the chemicalize-technology is working under the hood! Clipboard Analysis Patent text Identified structures View structure onMouseOver Export to other applications
18
Mining for Knowledge – Novartis Tools Input example: J Med Chem Paper
19
Mining for Chemical Knowledge – Use Case Medicinal Chemist wants to synthesize competitor compound as tool compound for own project Identification of core scaffold Analysis of substitution patterns This enables the identification of compounds most representative for a competitor patent
20
Example – A text-based patent Automated Text extraction 452 compounds Reference 636 compounds 71% A patent example
21
Example – An image-base patent Text extraction not suitable for this case, it does find only a meager 40 molecules, 1129 in reference – Why? An entirely image-based patent example
22
Language issues – e.g. Japanese patents
23
Encountered problems OCR (Optical Character Recognition)!! USPTO and WIPO are now available full text in most cases Typos! Name2Struct problems (less an issue here)
24
IBM initiative Patent Mining / ChemVerse database (Steve Boyer) The objective is to automatically extract all molecules from all patents available and make them searchable in a database They leverage cloud computing and have access to all full- text patents This is going absolutely the right direction They annotate the molecules with information from freely available databases
25
Future ideas: Patent Analysis Markush translation, Image+Target Ranking capabilities of outcome for User blurred dicos for translating stuff like aryl, cycloalkyl etc. Select annotate as entity on the fly error-correction Result goes in a database Crowdsourcing efforts to improve and store results Suggest functionality
26
To enable true Patinformatics analyses... Definition by Tony Trippe:
27
Acknowledgements Alex Fromm Katia Vella Olivier Kreim Therese Vachon Daniel Cronenberger Pierre Parisot Martin Romacker Nicolas Grandjean NITAS/TMS Clayton Springer Naeem Yusuff Bharat Lagu And many other people in different divisions of NIBR for their support
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.