Www.digitalchemistry.co.uk Recent and Current Developments in Handling Markush Structures from Chemical Patents Dr John M. Barnard Scientific Director.

Slides:



Advertisements
Similar presentations
Recuperação de Informação B Cap. 10: User Interfaces and Visualization 10.1,10.2,10.3 November 17, 1999.
Advertisements

August 2010, ACS National meeting, Boston Representation of Markush structures from molecules towards patents Szabolcs Csepregi Solutions for Cheminformatics.
May, 2008 Presenting: Szabolcs Csepregi The ChemAxon Markush project overview and development discussion.
Version 5.3, April 2010 The ChemAxon Markush project overview and development discussion.
SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
UGM, June, 2007 Szabolcs Csepregi Markush: Whats new, development discussions.
Solutions for Cheminformatics
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Dr. Matthew Wright Product Director.
Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS.
Darrell W. Gunter EVP / CMO Collexis Holdings, Inc. March 23, 2010 Spring Conference CONTENT: Uncovering the Value and Benefits of Semantic Technology.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Personalia: Pre-Sheffield Batchelor’s degree in Chemistry at Oxford Pre-university job in my local public library system Chemistry or information science?
Bonn-Aachen International Center for Information Technology Evaluation of different benchmark sets and evaluation methods for automatic extraction of chemical.
1 DAFFODIL Effective Support for Using Digital Libraries Norbert Fuhr University of Duisburg-Essen, Germany.
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
Information Retrieval in Practice
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Scientific Workflows Systems : In Drug discovery informatics Presented By: Tumbi Muhammad Khaled 3 rd Semester Department of Pharmacoinformatics.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.
SciFinder ® : Part of the process™ 2006 Edition. SciFinder ® : Part of the process™ 2006 Edition SciFinder ® 2006 provides new, powerful capabilities.
Tunable Machine Vision-Based Strategy for Automated Annotation of Chemical Database ChemReader Jungkap Park, Gus R. Rosania, and Kazuhiro Saitou University.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
User interface design Designing effective interfaces for software systems Objectives To suggest some general design principles for user interface design.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 5 Administrative Software START This multimedia product and its contents are protected under copyright.
1 Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
1 Chemical Structure Representation and Search Systems Lecture 3. Nov 4, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software.
SciFinder Web Version Pootorn R. Book Promotion & Service Co.,Ltd. Thailand.
Aniko T. Valko, Keymodule Ltd.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
Mihir Daptardar Software Engineering 577b Center for Systems and Software Engineering (CSSE) Viterbi School of Engineering 1.
Fusing database rankings in similarity-based virtual screening Peter Willett, University of Sheffield.
Scientific Data and Electronic Publishing Renze Brandsma, Head, Digital Production Centre University of Amsterdam Maarten Hoogerwerf, Project Manager,
Representing Markush Structures from Patents and Combinatorial Libraries Dr John M. Barnard Scientific Director Digital Chemistry.
Information Visualization: Ten Years in Review Xia Lin Drexel University.
Software Architecture
Elsevier Confidential Reaxys – an introduction. Elsevier Confidential 2  A brand new workflow tool for synthetic and medical chemists  An extensive.
Scientific Applications of XML Arvind Hulgeri, Shantanu Godbole
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Dendral: A Case Study Lecture 25.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Information Retrieval
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
The experience of student use of eBooks on mobile devices Amy Devenney, University of Huddersfield Maggie Sarjantson, University of Hull.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Combining Text and Image Queries at ImageCLEF2005: A Corpus-Based Relevance-Feedback Approach Yih-Cheng Chang Department of Computer Science and Information.
Margret Plank 17th International Conference on Grey Literature 1st and 2nd December 2015, Amsterdam (Netherlands) Move beyond text – How TIB manages the.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Computational Approach for Combinatorial Library Design Journal club-1 Sushil Kumar Singh IBAB, Bangalore.
Reaxys – The Highlights. Slide 2 What is Reaxys? A brand new workflow solution for research chemists and scientists from related disciplines An extensive.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Ingenuity Pathway Analysis Alex Pico. Description "IPA is a software application that enables researchers to analyze and understand the complex biological.
Representation and searching of molecules in chemical patents Presented at IRF Symposium 2007, 8 th November 2007, Vienna Peter Willett, University of.
Page 1 Computer-aided Drug Design —Profacgen. Page 2 The most fundamental goal in the drug design process is to determine whether a given compound will.
Information Retrieval in Practice
Automation of systematic reviews: the reviewer’s viewpoint
from scientific literature Principal Scientist (Chemoinformatics)
CSc4730/6730 Scientific Visualization
Aniko T. Valko, Keymodule Ltd.
Multimedia Information Retrieval
E-Science Life-Cycle A. D. Smith – September 26, 2011.
Jonathan Griffin, Managing Director, IFIS Publishing &
Digital Chemistry Ltd and Barnard Chemical Information Ltd
Presentation transcript:

Recent and Current Developments in Handling Markush Structures from Chemical Patents Dr John M. Barnard Scientific Director Digital Chemistry Ltd., UK Presented at GDCh 7 th German Conference on Chemoinformatics Goslar, Germany, 6-8 November 2011 Entire presentation Copyright © Digital Chemistry Ltd., 2011

2 Markush structures in patents Existing commercial systems for search – launched in the late 1980s Markush DARC (MMS) MARPAT – little change since then Much recent and current activity – data mining of patent text – new databases of specific structures from patents – new search software – integration of patent data into drug discovery informatics

3 Chemical patent claims Markush claim Claimed example compounds

4 Chemical patents in two dimensions Single molecules individually exemplified in the patent, or enumerated from a Markush structure Specific structures Markush structure Manual input and curation Automatic analysis of full text patent Current "gold standard" for accuracy and retrieval performance. Difficult to visualise and complicated to search Expensive and time consuming Easy to search using conventional structure search systems Cover the scope and claims of the patent more comprehensively Attractive approach given free availability of raw material Active area of research and development

5 Databases and data mining software Specific Structures Markush Structures Manually Curated Automatically Extracted MMS MARPAT SureChem CA Registry Derwent Chemistry Resource Reaxys IBM CLiDE chemoCR Databases Data-mining software DecrIPt ChemProspector

6 Specific structure databases Manually-curated databases of specific structures from patents have a long history – Chemical Abstracts Registry (from 1907) – Derwent Chemistry Resource (linked to WPI) – etc. Some newer databases use combination of automatic extraction and manual curation – Elsevier Reaxys database incorporates former MDL Patent Chemistry Database (patents from 1976) Can be searched by conventional full structure and substructure search systems

7 Mining patents for specific structures Free availability of full-text patent documents since 1990s has encouraged data mining to extract specific structures Commercial and open-source software used for both steps – use of multiple NT programs can improve quality Recent work in both academic and commercial environments – Cambridge University (OSCAR, OPSIN) – SURECHEM database (Macmillan/Digital Science) – IBM Patent database Raw patent documents Systematic nomenclature Connection tables Chemical name recognition Nomenclature translation

8 Databases and data mining software Specific Structures Markush Structures Manually Curated Automatically Extracted MMS MARPAT SureChem CA Registry Derwent Chemistry Resource Reaxys IBM CLiDE chemoCR Databases Data-mining software DecrIPt ChemProspector

9 Mining patents for Markush structures much more complicated than for specific structures requires analysis of both structure diagrams and text, and of semantic relationships between them

10 Mining patents for Markush structures Seminal work at Sheffield University (mid-1990s) – based on analysis of Derwent Abstracts Now a very active area of research and development CLiDE Pro Leeds University / KeyModule Ltd., UK Extension of original chemical OCR software graphical regions identified and separated from text connection tables generated from graphical regions links established between R-groups and references in text Fraunhofer SCAI, Germany Work based on ChemoCR program for analysis of documents with images similar workflow to CLiDE Extended by Haupt (2009) to reconstruct Markush structures from patents, in an extended SMILES notation utilises open-source UIMA to provide framework for text analysis limited success in initial results ChemProspector Work at InfoChem GmbH under German government THESEUS project; some success but much work still to do

11 Prospects for automatic analysis Can automatically-extracted databases supplant manually curated ones? – manually-curated databases remain “gold standard” – automatically extracted ones are gaining ground, especially for specific structures nomenclature translation software has improved source nomenclature quality remains an issue Markush structure extraction still has a long way to go – issues of generic nomenclature translation – likely to remain a need for manual intervention to resolve ambiguities but automation could do much of the “donkey work” might also assist with transcription and other errors that creep in during manual curation

12 Searching chemical patents Specific structure databases can be searched by conventional (sub)structure and similarity – limited to those specific structures extracted from patent Main Markush search systems now showing their age – only available online – clunky interfaces – difficult to visualise complete structures New systems in the pipeline – improved visualisation – in-house deployment – new search algorithms Thomson Reuters making MMS database available for in-house use Chemical Abstracts Service retaining control over MARPAT database Potential for new (automatically- extracted?) databases

13 Markush Visualization Interactive desktop application for visualization of MMS structures, developed at Roche using Pipeline Pilot Deng et al., JCIM 2011, 51,

14 In-house Markush searching New commercial software for substructure search of Markush databases – ChemAxon, Digital Chemistry – able to handle homology variation (“R1= alkyl, heteroaryl” etc.) – intended to enable Thomson MMS database to be searched in-house, rather than online via Questel Advantages – improved visualisation of Markush – partial enumeration of specific molecules covered – end-user chemist access to patent databases – integration with drug discovery informatics systems rather than separate patent search department }

15 Alternative approaches to search Substructure search of Markush databases presents several challenges – lack of suitable available databases – absence of standard exchange formats – complexities of matching specific structures against homology-variant groups Various attempts made to “finesse” the search have been proposed and developed vs. R84 is a substituted or unsubstituted, mono-, di- or polycyclic, aromatic or non- aromatic, carbocylic or heterocyclic ring system, or...

16 Alternative approaches - Daylight Proposed by Dave Weininger at MUG Meeting, 1998 Collaborative work with Derwent Information (now Thomson Reuters) Never developed further MMS Markush structures Selective enumeration of specific molecules Asymmetric (“Tversky”) similarity search Daylight fingerprints

17 Alternative approaches - DecrIPt Developed by DecrIPt LLC, founded in USA 1996 search principles originally developed in collaboration with Pfizer Inc. described in US patent applications in 2009 and 2010 MMS Markush structures Random enumeration of specific molecules Identify Markushes for which large proportion of specifics have high similarity to query molecule Structure fingerprints

18 Alternative approaches – IBM SIMPLE Developed by IBM to perform searches across specific molecules mined from free-text chemical patents (nomenclature extraction and translation) available to subscribers only principles described in some conference presentations and in 2011 US patent application Similarity calculation (Tanimoto coefficient) Integer vector encoding Feature analysis (equivalent to substructural fragments) Specific molecules Canonical InChI representation

19 Alternative approaches - Accelrys Developed in association with Notiora consultants (B. Bartell) based on specific molecules mined from free-text patents described at 2009 ICIC conference presentation Specific molecules Extended-connectivity fingerprints Maximum common substructure search “Aggregate asymmetric screen search” Principle is to match fragments from different specific molecules mentioned the same patent against a target query molecule

20 Practicalities of searching Established systems generally regarded as insufficient to cope with the complexity of modern patents – in many cases high recall is required – searchers have to put up with poor precision Recall and precision are not usually relevant to chemical structure search deterministic isomorphism algorithms give 100% recall and 100% precision Situation with Markush structures is more ambiguous some parts of Markush may be more important than others (“what the patent teaches”) meaning that hits may have degrees of relevance ranked output might help bring most relevant hits to the top of long lists

21 Retrieval system evaluation Systematic evaluation of performance of search systems may be useful TREC-CHEM track recently started under auspices of long-running Text-REtrieval Conferences – multi-year experiment using standard set of patents and queries with relevance judgements – commercial databases not included (at least as yet) – most participants are academic groups using e.g. nomen- clature identification and translation software Evaluation of retrieval systems with graded relevance judgements is in its infancy

22 Conclusions and Prospects Data mining of specific structures from patents increasingly proving useful – AZ Chemistry Connect (Muresan et al., DDT 2011, 16, Automatic mining of Markush descriptions still in its infancy – likely to be used ultimately in semi-automatic curation systems New generation of Markush search software appearing, using existing curated databases – may be deployed in-house and online, with improved visualisation – may feature relevance ranking of hits Similarity-based search approaches may find some applications, but are unlikely to provide high-recall high-precision search results

23 Contact details Dr John M. Barnard Scientific Director, Digital Chemistry Ltd. 46 Uppergate Road, Sheffield S6 6BX, UK j +44 (0) G. M. Downs and J. M. Barnard, “Chemical patent information systems,” Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 1, no. 5, pp , Free download from until end of (DOI: /wcms.41)